Files
watcher/doc/source/datasources/prometheus.rst
Jaromir Wysoglad 8309d9848a Add Aetos datasource
Implement the spec for multi-tenancy support for metrics. This adds
a new 'Aetos' datasource very similar to the current Prometheus
datasource. Because of that, the original PrometheusHelper class
was split into two classes and the base class is used for
PrometheusHelper and for AetosHelper. Except for the split, there
is one more change to the original PrometheusHelper class code, which
is the addition and use of the _get_fqdn_label() and
_get_instance_uuid_label() methods.

As part of the change, I refactored the current prometheus datasource
unit tests. Most of them are now used to test the PrometheusBase class
with minimal changes. Changes I've made to the original tests:

- the ones that can be be used to test the base class are moved into the
  TestPrometheusBase class
- the _setup_prometheus_client, _get_instance_uuid_label and
  _get_fqdn_label functions are mocked in the base class tests.
  Their concrete implementations are tested in each datasource tests
  separately.
- a self._create_helper() is used to instantiate the helper class with
  correct mocking.
- all config value modification is the original tests got moved out and
  instead of modifying the config values, the _get_* methods are mocked
  to return the wanted values
- to keep similar test coverage, config retrieval is tested for each
  concrete class by testing the _get_* methods.

New watcher-aetos-integration and watcher-aetos-integration-realdata
zuul jobs are added to test the new datasource. These use the same set
of tempest tests as the current watcher-prometheus-integration jobs.
The only difference is the environment setup and the Watcher config,
so that the job deploys Aetos and Watcher uses it instead of accessing
Prometheus directly.

At first this was generated by asking cursor to implement the linked spec
with some additional prompts for some smaller changes. Afterwards I manually
went through the code doing some cleanups, ensuring it complies with
PEP8 and hacking and so on. Later on I manually adjusted the code to use
the latest observabilityclient changes.
The zuul job was also mostly generated by cursor.

Implements: https://blueprints.launchpad.net/watcher/+spec/prometheus-multitenancy-support

Generated-By: Cursor with claude-4-sonnet model
Change-Id: I72c2171f72819bbde6c9cbbf565ee895e5d2bd53
Signed-off-by: Jaromir Wysoglad <jwysogla@redhat.com>
2025-08-14 02:27:24 -04:00

6.0 KiB

Prometheus datasource

Synopsis

The Prometheus datasource allows Watcher to use a Prometheus server as the source for collected metrics used by the Watcher decision engine. At minimum deployers must configure the host and port at which the Prometheus server is listening.

Requirements

It is required that Prometheus metrics contain a label to identify the hostname of the exporter from which the metric was collected. This is used to match against the Watcher cluster model ComputeNode.hostname. The default for this label is fqdn and in the prometheus scrape configs would look like:

scrape_configs:
- job_name: node
  static_configs:
 - targets: ['10.1.2.3:9100']
    labels:
      fqdn: "testbox.controlplane.domain"

This default can be overridden when a deployer uses a different label to identify the exporter host (for example hostname or host, or any other label, as long as it identifies the host).

Internally this label is used in creating fqdn_instance_labels, containing the list of values assigned to the label in the Prometheus targets. The elements of the resulting fqdn_instance_labels are expected to match the ComputeNode.hostname used in the Watcher decision engine cluster model. An example fqdn_instance_labels is the following:

[
 'ena.controlplane.domain',
 'dio.controlplane.domain',
 'tria.controlplane.domain',
]

For instance metrics, it is required that Prometheus contains a label with the uuid of the OpenStack instance in each relevant metric. By default, the datasource will look for the label resource. The instance_uuid_label config option in watcher.conf allows deployers to override this default to any other label name that stores the uuid.

Limitations

The current implementation doesn't support the statistic_series function of the Watcher class DataSourceBase. It is expected that the statistic_aggregation function (which is implemented) is sufficient in providing the current state of the managed resources in the cluster. The statistic_aggregation function defaults to querying back 300 seconds, starting from the present time (the time period is a function parameter and can be set to a value as required). Implementing the statistic_series can always be re-visited if the requisite interest and work cycles are volunteered by the interested parties.

One further note about a limitation in the implemented statistic_aggregation function. This function is defined with a granularity parameter, to be used when querying whichever of the Watcher DataSourceBase metrics providers. In the case of Prometheus, we do not fetch and then process individual metrics across the specified time period. Instead we use the PromQL querying operators and functions, so that the server itself will process the request across the specified parameters and then return the result. So granularity parameter is redundant and remains unused for the Prometheus implementation of statistic_aggregation. The granularity of the data fetched by Prometheus server is specified in configuration as the server scrape_interval (current default 15 seconds).

Configuration

A deployer must set the datasources parameter to include prometheus under the watcher_datasources section of watcher.conf (or add prometheus in datasources for a specific strategy if preferred eg. under the [watcher_strategies.workload_stabilization] section).

The watcher.conf configuration file is also used to set the parameter values required by the Watcher Prometheus data source. The configuration can be added under the [prometheus_client] section and the available options are duplicated below from the code as they are self documenting:

cfg.StrOpt('host',
           help="The hostname or IP address for the prometheus server."),
cfg.StrOpt('port',
           help="The port number used by the prometheus server."),
cfg.StrOpt('fqdn_label',
           default="fqdn",
           help="The label that Prometheus uses to store the fqdn of "
                "exporters. Defaults to 'fqdn'."),
cfg.StrOpt('instance_uuid_label',
           default="resource",
           help="The label that Prometheus uses to store the uuid of "
                "OpenStack instances. Defaults to 'resource'."),
cfg.StrOpt('username',
           help="The basic_auth username to use to authenticate with the "
                "Prometheus server."),
cfg.StrOpt('password',
           secret=True,
           help="The basic_auth password to use to authenticate with the "
                "Prometheus server."),
cfg.StrOpt('cafile',
           help="Path to the CA certificate for establishing a TLS "
                "connection with the Prometheus server."),
cfg.StrOpt('certfile',
           help="Path to the client certificate for establishing a TLS "
                "connection with the Prometheus server."),
cfg.StrOpt('keyfile',
           help="Path to the client key for establishing a TLS "
                "connection with the Prometheus server."),

The host and port are required configuration options which have no set default. These specify the hostname (or IP) and port for at which the Prometheus server is listening. The fqdn_label allows deployers to override the required metric label used to match Prometheus node exporters against the Watcher ComputeNodes in the Watcher decision engine cluster data model. The default is fqdn and deployers can specify any other value (e.g. if they have an equivalent but different label such as host).

So a sample watcher.conf configured to use the Prometheus server at 10.2.3.4:9090 would look like the following:

[watcher_datasources]

datasources = prometheus

[prometheus_client]

host = 10.2.3.4
port = 9090
fqdn_label = fqdn