watcher/doc/source/datasources/grafana.rst
chenke 8cb7158790 Replace human_id with name in grafana doc
This patch does two things:

1. replace instance's human_id with name.
2. remove ComputeNode human_id.

Now name field in Watcher Compute Data Model is availible.
Use name is better than human_id. For the reason, please see[1].

[1]. https://bugs.launchpad.net/watcher/+bug/1833665

Change-Id: I04f40e7d2a2bda48e9a362f9d0b23f449c40324e
2019-07-10 15:10:59 +08:00

17 KiB

Grafana datasource

Synopsis

Grafana can interface with many different types of storage backends that Grafana calls datasources. Since the term datasources causes significant confusion by overlapping definitions used in Watcher these datasources are called projects instead. Some examples of supported projects are InfluxDB or Elasticsearch while others might be more familiar such as Monasca or Gnocchi. The Grafana datasource provides the functionality to retrieve metrics from Grafana for different projects. This functionality is achieved by using the proxy interface exposed in Grafana to communicate with Grafana projects directly.

Background

Since queries to retrieve metrics from Grafana are proxied to the project the format of these queries will change significantly depending on the type of project. The structure of the projects themselves will also change significantly as they are structured by users and administrators. For instance, some developers might decide to store metrics about compute_nodes in MySQL and use the UUID as primary key while others use InfluxDB and use the hostname as primary key. Furthermore, datasources in Watcher should return metrics in specific units strictly defined in the baseclass depending on how the units are stored in the projects they might require conversion before being returned. The flexible configuration parameters of the Grafana datasource allow to specify exactly how the deployment is configured and this will enable to correct retrieval of metrics and with the correct units.

Requirements

The use of the Grafana datasource requires a reachable Grafana endpoint and an authentication token for access to the desired projects. The projects behind Grafana will need to contain the metrics for compute_nodes or instances and these need to be identifiable by an attribute of the Watcher datamodel for instance hostname or UUID.

Limitations

  • Only the InfluxDB project is currently supported1.
  • All metrics must be retrieved from the same Grafana endpoint (same URL).
  • All metrics must be retrieved with the same authentication token.

Configuration

Several steps are required in order to use the Grafana datasource, Most steps are related configuring Watcher to match the deployed Grafana setup such as queries proxied to the project or the type of project for any given metric. Most of the configuration can either be supplied via the traditional configuration file or in a special yaml file.

token

First step is to generate an access token with access to the required projects. This can be done from the api or from the web interface. Tokens generated from the web interface will have the same access to projects as the user that created them while using the cli allows to generate a key for a specific role.The token will only be displayed once so store it well. This token will go into the configuration file later and this parameter can not be placed in the yaml.

base_url

Next step is supplying the base url of the Grafana endpoint. The base url parameter will need to specify the type of http protocol and the use of plain text http is strongly discouraged due to the transmission of the access token. Additionally the path to the proxy interface needs to be supplied as well in case Grafana is placed in a sub directory of the web server. An example would be: https://mygrafana.org/api/datasource/proxy/ were /api/datasource/proxy is the default path without any subdirectories. Likewise, this parameter can not be placed in the yaml.

To prevent many errors from occurring and potentially filing the logs files it is advised to specify the desired datasource in the configuration as it would prevent the datasource manager from having to iterate and try possible datasources with the launch of each audit. To do this specify datasources in the [watcher_datasources] group.

The current configuration that is required to be placed in the traditional configuration file would look like the following:

[grafana_client]
token = 0JLbF0oB4R3Q2Fl337Gh4Df5VN12D3adBE3f==
base_url = https://mygranfa.org/api/datasource/proxy

[watcher_datasources]
datasources = grafana

metric parameters

The last five remaining configuration parameters can all be placed both in the traditional configuration file or in the yaml, however, it is not advised to mix and match but in the case it does occur the yaml would override the settings from the traditional configuration file. All five of these parameters are dictionaries mapping specific metrics to a configuration parameter. For instance the project_id_map will specify the specific project id in Grafana to be used. The parameters are named as follow:

  • project_id_map
  • database_map
  • translator_map
  • attribute_map
  • query_map

These five parameters are named differently if configured using the yaml configuration file. The parameters are named as follows and are in identical order as to the list of the traditional configuration file:

  • project
  • db
  • translator
  • attribute
  • query

When specified in the yaml the parameters are no longer dictionaries instead each parameter needs to be defined per metric as sub-parameters. Examples of these parameters configured for both the yaml and traditional configuration are described at the end of this document.

project_id

The project id's can only be determined by someone with the admin role in Grafana as that role is required to open the list of projects. The list of projects can be found on /datasources in the web interface but unfortunately it does not immediately display the project id. To display the id one can best hover the mouse over the projects and the url will show the project id's for example /datasources/edit/7563. Alternatively the entire list of projects can be retrieved using the REST api. To easily make requests to the REST api a tool such as Postman can be used.

database

The database is the parameter for the schema / database that is actually defined in the project. For instance, if the project would be based on MySQL this is were the name of schema used within the MySQL server would be specified. For many different projects it is possible to list all the databases currently available. Tools like Postman can be used to list all the available databases per project. For InfluxDB based projects this would be with the following path and query, however be sure to construct these request in Postman as the header needs to contain the authorization token:

https://URL.DOMAIN/api/datasources/proxy/PROJECT_ID/query?q=SHOW%20DATABASES

translator

Each translator is for a specific type of project will have a uniquely identifiable name and the baseclass allows to easily support new types of projects such as elasticsearch or prometheus. Currently only InfluxDB based projects are supported as a result the only valid value for this parameter is influxdb.

attribute

The attribute parameter specifies which attribute to use from Watcher's data model in order to construct the query. The available attributes differ per type of object in the data model but the following table shows the attributes for ComputeNodes, Instances and IronicNodes.

ComputeNode Instance IronicNode
uuid uuid uuid
id name human_id
hostname project_id power_state
status watcher_exclude maintenance
disabled_reason locked maintenance_reason
state metadata extra
memory state
disk memory
disk_capacity disk
vcpus disk_capacity
vcpus

Many if not all of these attributes map to attributes of the objects that are fetched from clients such as Nova. To see how these attributes are put into the data model the following source files can be analyzed for Nova and Ironic.

query

The query is the single most important parameter it will be passed to the project and should return the desired metric for the specific host and return the value in the correct unit. The units for all available metrics are documented in the datasource baseclass. This might mean the query specified in this parameter is responsible for converting the unit. The following query demonstrates how such a conversion could be achieved and demonstrates the conversion from bytes to megabytes.

SELECT value/1000000 FROM memory...

Queries will be formatted using the .format string method within Python. This format will currently have give attributes exposed to it labeled {0} to {4}. Every occurrence of these characters within the string will be replaced with the specific attribute.

  • {0} is the aggregate typically mean, min, max but count is also supported.
  • {1} is the attribute as specified in the attribute parameter.
  • {2} is the period of time to aggregate data over in seconds.
  • {3} is the granularity or the interval between data points in seconds.
  • {4} is translator specific and in the case of InfluxDB it will be used for retention_periods.

InfluxDB

Constructing the queries or rather anticipating how the results should look to be correctly interpreted by Watcher can be a challenge. The following json example demonstrates how what the result should look like and the query used to get this result.

{
"results": [
    {
        "statement_id": 0,
        "series": [
            {
                "name": "vmstats",
                "tags": {
                    "host": "autoserver01"
                },
                "columns": [
                    "time",
                    "mean"
                ],
                "values": [
                    [
                        1560848284284,
                        7680000
                    ]
                ]
            }
        ]
    }
]
}
SELECT {0}("{0}_value") FROM "vmstats" WHERE host =~ /^{1}$/ AND
"type_instance" =~ /^mem$/ AND time >= now() - {2}s GROUP BY host

Example configuration

The example configurations will show both how to achieve the entire configuration in the config file or use a combination of the regular file and yaml. Using yaml to define all the parameters for each metric is recommended since it has better human readability and supports mutli-line option definitions.

Configuration file

It is important to note that the line breaks shown in between assignments of parameters can not be used in the actual configuration and these are simply here for readability reasons.

[grafana_client]
# Authentication token to gain access (string value)
# Note: This option can be changed without restarting.
token = eyJrIjoiT0tTcG1pUlY2RnVKZTFVaDFsNFZXdE9ZWmNrMkZYbk==

# first part of the url (including https:// or http://) up until project id
# part. Example: https://secure.org/api/datasource/proxy/ (string value)
# Note: This option can be changed without restarting.
base_url = https://monitoring-grafana.com/api/datasources/proxy/

# Project id as in url (integer value)
# Note: This option can be changed without restarting.
project_id_map = host_cpu_usage:1337,host_ram_usage:6969,
instance_cpu_usage:1337,instance_ram_usage:9696

# Mapping of grafana databases to datasource metrics. (dict value)
# Note: This option can be changed without restarting.
database_map = host_cpu_usage:monit_production,
host_ram_usage:monit_production,instance_cpu_usage:prod_cloud,
instance_ram_usage:prod_cloud

translator_map = host_cpu_usage:influxdb,host_ram_usage:influxdb,
instance_cpu_usage:influxdb,instance_ram_usage:influxdb

attribute_map = host_cpu_usage:hostname,host_ram_usage:hostname,
instance_cpu_usage:name,instance_ram_usage:name

query_map = host_cpu_usage:SELECT 100-{0}("{0}_value") FROM {4}.cpu WHERE
("host" =~ /^{1}$/ AND "type_instance" =~/^idle$/ AND time > now()-{2}s),
host_ram_usage:SELECT {0}("{0}_value")/1000000 FROM {4}.memory WHERE
 ("host" =~ /^{1}$/) AND "type_instance" =~ /^used$/ AND time >= now()-{2}s
 GROUP BY "type_instance",instance_cpu_usage:SELECT {0}("{0}_value") FROM
 "vmstats" WHERE host =~ /^{1}$/ AND "type_instance" =~ /^cpu$/ AND time >=
 now() - {2}s GROUP BY host,instance_ram_usage:SELECT {0}("{0}_value") FROM
 "vmstats" WHERE host =~ /^{1}$/ AND "type_instance" =~ /^mem$/ AND time >=
 now() - {2}s GROUP BY host

[grafana_translators]

retention_periods = one_week:10080,one_month:302400,five_years:525600

[watcher_datasources]
datasources = grafana

yaml

When using the yaml configuration file some parameters still need to be defined using the regular configuration such as the path for the yaml file these parameters are detailed below:

[grafana_client]
token = eyJrIjoiT0tTcG1pUlY2RnVKZTFVaDFsNFZXdE9ZWmNrMkZYbk==

base_url = https://monitoring-grafana.com/api/datasources/proxy/

[watcher_datasources]
datasources = grafana

[watcher_decision_engine]
metric_map_path = /etc/watcher/metric_map.yaml

Using the yaml allows to more effectively define the parameters per metric with greater human readability due to the availability of multi line options. These multi line options are demonstrated in the query parameters.

grafana:
  host_cpu_usage:
    project: 1337
    db: monit_production
    translator: influxdb
    attribute: hostname
    query: >
        SELECT 100-{0}("{0}_value") FROM {4}.cpu
        WHERE ("host" =~ /^{1}$/ AND "type_instance" =~/^idle$/ AND
        time > now()-{2}s)
  host_ram_usage:
    project: 6969
    db: monit_production
    translator: influxdb
    attribute: hostname
    query: >
        SELECT {0}("{0}_value")/1000000 FROM {4}.memory WHERE
         ("host" =~ /^{1}$/) AND "type_instance" =~ /^used$/ AND time >=
         now()-{2}s GROUP BY "type_instance"
  instance_cpu_usage:
    project: 1337
    db: prod_cloud
    translator: influxdb
    attribute: name
    query: >
        SELECT {0}("{0}_value") FROM
         "vmstats" WHERE host =~ /^{1}$/ AND "type_instance" =~ /^cpu$/ AND
         time >= now() - {2}s GROUP BY host
  instance_ram_usage:
    project: 9696
    db: prod_cloud
    translator: influxdb
    attribute: name
    query: >
        SELECT {0}("{0}_value") FROM
         "vmstats" WHERE host =~ /^{1}$/ AND "type_instance" =~ /^mem$/ AND
         time >= now() - {2}s GROUP BY host

  1. A base class for projects is available and easily extensible.↩︎