infra-specs/specs/test-metrics-db.rst

::

  This work is licensed under a Creative Commons Attribution 3.0
  Unported License.
  http://creativecommons.org/licenses/by/3.0/legalcode

=====================
Test Metrics Database
=====================


https://storyboard.openstack.org/#!/story/156

Using subunit2sql store data regarding gating test runs in a SQL database to
enable longer term analytics on testing in the gate. Also, construct a
dashboard for the stored data to visualize the gating job trends over the period
of time for which there is data in the DB.

Problem Description
===================

Right now we capture test run artifacts and archive them for roughly 1 release
cycle on logs.o.o. In addition logstash provides 10 days of API queryable
information about the test runs. This has proved invaluable for both
ascertaining the current status of the gate and debugging failures. However,
due to the lack of long term data being available there is an inability to
perform analysis to categorize the long term trends in our test suite.

This specifically becomes an issue when we want to optimize how we're using
tempest in the gate. For example, if we wanted to trim down the tests we're
gating on to maintain a timing quota would be very difficult because we don't
have an API interface to use which we can figure out which tests never fail, or
which tests on average take the longest to run.


Proposed Change
===============

Using the subunit2sql library, which was recently added to openstack-infra, a
new post processing job (similar to how logs are processed for logstash) is
added to read the subunit file from runs, and push that to a SQL DB setup for
storing the data. The SQL DB should be put on a separate server, possibly using
Trove to provision it. By having the data available in DB allows for API access
to enable interesting analytics to be performed. However, to enable an
environment conducive to developing tooling around performing this analysis,
public read-only access to the DB will be needed; a tcp proxy, like simpleproxy,
will be used to enable connectivity the database and a "public" set of
credentials will be created which only has read permissions on the database.

On top of the DB a new dashboard will be created that visualises the data
stored in the DB. It's basic functionality will be to show an analysis
of the performance and stability of individual tests over time, as well as
showing the broad long term trends on the jobs over the whole stored history.
(ie, graphs of test success and failures counts, total run time for each run
type over the whole stored history)

Additionally, storing the test timing data in a database will allow us to
eventually use the timing data with testr to perform scheduler optimizations.
Pre-seeding the timing data into testr is something which has been discussed
previously, but there was not a good method of doing this before. Using the
data in the subunit2sql database a script for nodepool can be created to
generate a subunit stream which can be loaded into testrepository during
image creation.

Alternatives
------------

An alternative approach for the data collection would be to build a log
scraping tool that would scrape the archived logs in order to collect the
required information from the logs instead of the subunit file directly. However
that seems like an error prone and less efficient approach.

As alternative to using a SQL DB a different data storage mechanism could be
used. For example hadoop, a nosql db, and graphite have all been considered as
alternatives. However, because the parts of the subunit data we need are
already highly structured using SQL makes sense for the storage. Using SQL also
enables us to exploit other tooling to make the task simpler.

Implementation
==============

Assignee(s)
-----------

Matthew Treinish <mtreinish@kortar.org>
Clark Boylan <cboylan@sapwetik.org>

Work Items
----------

* Add server for SQL DB using trove and setup subunit2sql schema on it
* Add post-processing workers (similar to how logstash does it) to parse the
  subunit output from the run and push it to the DB using subunit2sql
* Configure a proxy to enable "public" access to the database
* Write a dashboard script to query the database and to visualize the metrics
  we want from the database
* Start running the dashboard on top of the DB

Repositories
------------

The only required repository is subnunit2sql which has already been created.
However the code for the dashboard will need to be stored somewhere, it will be
too specific to the CI environment that it probably doesn't belong in the
subunit2sql repo. Depending on the size we can keep it directly in
openstack-infra/config. However, if it ends up be sufficiently large we might
need to create separate repository to store it.

Servers
-------

So to enable this a new SQL database is required. While a new server is not
strictly required to do this, based on the estimated load caused by running
subnunit2sql for each test run in the gate it probably makes the most sense.
It makes sense to use trove here to spin up a database to use. Another
consideration is to enable public read access to the database which will be
need to allow development on top of the data, as well as incorporating the data
into the testr scheduler on gate slaves.

DNS Entries
-----------

A DNS entry could be created for the new DB server so that we can have the
post-processing job send the results to it. However, since it's a trove
instance and only has a private network, this might not be explicitly needed.
Additionally, the dashboard view will need to be given an address, however a
separate DNS entry probably isn't required for that since a new path on an
existing webserver is probably sufficient. (ie status.o.o/test-stats)

Documentation
-------------

subunit2sql is lacking substantial documentation right now. This documentation
will include both operational instructions for subunit2sql as well as being a
DB API and schema guide. Additionally, a new doc for setting up the test metric
workflow shoud also be added.

Security
--------

There shouldn't be any additional security implications besides the risks
associated with running a new server and SQL DB. Care also needs to be taken
when Storing the credentials for access to the DB server for the post
processing jobs. Setting up a tcp proxy to allow public read access to the
DB does open up a new potential attack entry-point.

Testing
-------

There shouldn't be any additional testing requires. Both subunit2sql and
the dashboard should have there own unit testing.

Dependencies
============

- This BP will be the first use of subunit2sql in the infra workflow
- A tcp proxy will also need to be installed and configured