2ad1caded9
This commit updates some sections in the spec based on some recent changes that were made in the plan. Change-Id: I9807dee1fd1029cd4ee5663f2e11c51320876c74
158 lines
6.6 KiB
ReStructuredText
158 lines
6.6 KiB
ReStructuredText
::
|
|
|
|
This work is licensed under a Creative Commons Attribution 3.0
|
|
Unported License.
|
|
http://creativecommons.org/licenses/by/3.0/legalcode
|
|
|
|
=====================
|
|
Test Metrics Database
|
|
=====================
|
|
|
|
|
|
https://storyboard.openstack.org/#!/story/156
|
|
|
|
Using subunit2sql store data regarding gating test runs in a SQL database to
|
|
enable longer term analytics on testing in the gate. Also, construct a
|
|
dashboard for the stored data to visualize the gating job trends over the period
|
|
of time for which there is data in the DB.
|
|
|
|
Problem Description
|
|
===================
|
|
|
|
Right now we capture test run artifacts and archive them for roughly 1 release
|
|
cycle on logs.o.o. In addition logstash provides 10 days of API queryable
|
|
information about the test runs. This has proved invaluable for both
|
|
ascertaining the current status of the gate and debugging failures. However,
|
|
due to the lack of long term data being available there is an inability to
|
|
perform analysis to categorize the long term trends in our test suite.
|
|
|
|
This specifically becomes an issue when we want to optimize how we're using
|
|
tempest in the gate. For example, if we wanted to trim down the tests we're
|
|
gating on to maintain a timing quota would be very difficult because we don't
|
|
have an API interface to use which we can figure out which tests never fail, or
|
|
which tests on average take the longest to run.
|
|
|
|
|
|
Proposed Change
|
|
===============
|
|
|
|
Using the subunit2sql library, which was recently added to openstack-infra, a
|
|
new post processing job (similar to how logs are processed for logstash) is
|
|
added to read the subunit file from runs, and push that to a SQL DB setup for
|
|
storing the data. The SQL DB should be put on a separate server, possibly using
|
|
Trove to provision it. By having the data available in DB allows for API access
|
|
to enable interesting analytics to be performed. However, to enable an
|
|
environment conducive to developing tooling around performing this analysis,
|
|
public read-only access to the DB will be needed; a tcp proxy, like simpleproxy,
|
|
will be used to enable connectivity the database and a "public" set of
|
|
credentials will be created which only has read permissions on the database.
|
|
|
|
On top of the DB a new dashboard will be created that visualises the data
|
|
stored in the DB. It's basic functionality will be to show an analysis
|
|
of the performance and stability of individual tests over time, as well as
|
|
showing the broad long term trends on the jobs over the whole stored history.
|
|
(ie, graphs of test success and failures counts, total run time for each run
|
|
type over the whole stored history)
|
|
|
|
Additionally, storing the test timing data in a database will allow us to
|
|
eventually use the timing data with testr to perform scheduler optimizations.
|
|
Pre-seeding the timing data into testr is something which has been discussed
|
|
previously, but there was not a good method of doing this before. Using the
|
|
data in the subunit2sql database a script for nodepool can be created to
|
|
generate a subunit stream which can be loaded into testrepository during
|
|
image creation.
|
|
|
|
Alternatives
|
|
------------
|
|
|
|
An alternative approach for the data collection would be to build a log
|
|
scraping tool that would scrape the archived logs in order to collect the
|
|
required information from the logs instead of the subunit file directly. However
|
|
that seems like an error prone and less efficient approach.
|
|
|
|
As alternative to using a SQL DB a different data storage mechanism could be
|
|
used. For example hadoop, a nosql db, and graphite have all been considered as
|
|
alternatives. However, because the parts of the subunit data we need are
|
|
already highly structured using SQL makes sense for the storage. Using SQL also
|
|
enables us to exploit other tooling to make the task simpler.
|
|
|
|
Implementation
|
|
==============
|
|
|
|
Assignee(s)
|
|
-----------
|
|
|
|
Matthew Treinish <mtreinish@kortar.org>
|
|
Clark Boylan <cboylan@sapwetik.org>
|
|
|
|
Work Items
|
|
----------
|
|
|
|
* Add server for SQL DB using trove and setup subunit2sql schema on it
|
|
* Add post-processing workers (similar to how logstash does it) to parse the
|
|
subunit output from the run and push it to the DB using subunit2sql
|
|
* Configure a proxy to enable "public" access to the database
|
|
* Write a dashboard script to query the database and to visualize the metrics
|
|
we want from the database
|
|
* Start running the dashboard on top of the DB
|
|
|
|
Repositories
|
|
------------
|
|
|
|
The only required repository is subnunit2sql which has already been created.
|
|
However the code for the dashboard will need to be stored somewhere, it will be
|
|
too specific to the CI environment that it probably doesn't belong in the
|
|
subunit2sql repo. Depending on the size we can keep it directly in
|
|
openstack-infra/config. However, if it ends up be sufficiently large we might
|
|
need to create separate repository to store it.
|
|
|
|
Servers
|
|
-------
|
|
|
|
So to enable this a new SQL database is required. While a new server is not
|
|
strictly required to do this, based on the estimated load caused by running
|
|
subnunit2sql for each test run in the gate it probably makes the most sense.
|
|
It makes sense to use trove here to spin up a database to use. Another
|
|
consideration is to enable public read access to the database which will be
|
|
need to allow development on top of the data, as well as incorporating the data
|
|
into the testr scheduler on gate slaves.
|
|
|
|
DNS Entries
|
|
-----------
|
|
|
|
A DNS entry could be created for the new DB server so that we can have the
|
|
post-processing job send the results to it. However, since it's a trove
|
|
instance and only has a private network, this might not be explicitly needed.
|
|
Additionally, the dashboard view will need to be given an address, however a
|
|
separate DNS entry probably isn't required for that since a new path on an
|
|
existing webserver is probably sufficient. (ie status.o.o/test-stats)
|
|
|
|
Documentation
|
|
-------------
|
|
|
|
subunit2sql is lacking substantial documentation right now. This documentation
|
|
will include both operational instructions for subunit2sql as well as being a
|
|
DB API and schema guide. Additionally, a new doc for setting up the test metric
|
|
workflow shoud also be added.
|
|
|
|
Security
|
|
--------
|
|
|
|
There shouldn't be any additional security implications besides the risks
|
|
associated with running a new server and SQL DB. Care also needs to be taken
|
|
when Storing the credentials for access to the DB server for the post
|
|
processing jobs. Setting up a tcp proxy to allow public read access to the
|
|
DB does open up a new potential attack entry-point.
|
|
|
|
Testing
|
|
-------
|
|
|
|
There shouldn't be any additional testing requires. Both subunit2sql and
|
|
the dashboard should have there own unit testing.
|
|
|
|
Dependencies
|
|
============
|
|
|
|
- This BP will be the first use of subunit2sql in the infra workflow
|
|
- A tcp proxy will also need to be installed and configured
|