Merge "clean up readme"
This commit is contained in:
commit
ce81aa0dbf
89
README.rst
89
README.rst
@ -1,6 +1,6 @@
|
||||
===============================
|
||||
===============
|
||||
elastic-recheck
|
||||
===============================
|
||||
===============
|
||||
|
||||
"Use ElasticSearch to classify OpenStack gate failures"
|
||||
|
||||
@ -8,27 +8,28 @@ elastic-recheck
|
||||
|
||||
Idea
|
||||
----
|
||||
Identifying the specific bug that is causing a transient error in the gate
|
||||
is very hard. Just identifying which tempest test failed is not enough
|
||||
because a single bug can potentially cause multiple tempest tests to fail.
|
||||
If we can find a fingerprint for a specific bug using logs, then we can use
|
||||
ElasticSearch to automatically detect any occurrences of the bug.
|
||||
|
||||
Identifying the specific bug that is causing a transient error in the gate is
|
||||
difficult. Just identifying which tempest test failed is not enough because a
|
||||
single tempest test can fail due to any number of underlying bugs. If we can
|
||||
find a fingerprint for a specific bug using logs, then we can use ElasticSearch
|
||||
to automatically detect any occurrences of the bug.
|
||||
|
||||
Using these fingerprints elastic-recheck can:
|
||||
|
||||
* Search ElasticSearch for all occurrences of a bug.
|
||||
* Identify bug trends such as: when it started, is the bug fixed, is it
|
||||
getting worse, etc.
|
||||
* Identify bug trends such as: when it started, is the bug fixed, is it getting
|
||||
worse, etc.
|
||||
* Classify bug failures in real time and report back to gerrit if we find a
|
||||
match, so a patch author knows why the test failed.
|
||||
|
||||
queries/
|
||||
--------
|
||||
|
||||
All queries are stored in separate yaml files in a queries directory
|
||||
at the top of the elastic-recheck code base. The format of these files
|
||||
is ######.yaml (where ###### is the launchpad bug number), the yaml should have
|
||||
a ``query`` keyword which is the query text for elastic search.
|
||||
All queries are stored in separate yaml files in a queries directory at the top
|
||||
of the elastic-recheck code base. The format of these files is ######.yaml
|
||||
(where ###### is the launchpad bug number), the yaml should have a ``query``
|
||||
keyword which is the query text for elastic search.
|
||||
|
||||
Guidelines for good queries:
|
||||
|
||||
@ -36,60 +37,60 @@ Guidelines for good queries:
|
||||
filename query is typically better than a console one, as that's matching a
|
||||
deep failure versus a surface symptom.
|
||||
|
||||
- Queries should not return any hits for successful jobs, this is a
|
||||
sign the query isn't specific enough. A rule of thumb is > 10% success hits
|
||||
probably means this isn't good enough.
|
||||
- Queries should not return any hits for successful jobs, this is a sign the
|
||||
query isn't specific enough. A rule of thumb is > 10% success hits probably
|
||||
means this isn't good enough.
|
||||
|
||||
- If it's impossible to build a query to target a bug, consider patching the
|
||||
upstream program to be explicit when it fails in a particular way.
|
||||
|
||||
- Use the 'tags' field rather than the 'filename' field for filtering. This is
|
||||
primarily because of grenade jobs where the same log file shows up in the
|
||||
'old' and 'new' side of the grenade job. For example, tags:"screen-n-cpu.txt"
|
||||
will query in logs/old/screen-n-cpu.txt and logs/new/screen-n-cpu.txt. The
|
||||
tags:"console" filter is also used to query in console.html as well as
|
||||
tempest and devstack logs.
|
||||
'old' and 'new' side of the grenade job. For example,
|
||||
``tags:"screen-n-cpu.txt"`` will query in ``logs/old/screen-n-cpu.txt`` and
|
||||
``logs/new/screen-n-cpu.txt``. The ``tags:"console"`` filter is also used to
|
||||
query in ``console.html`` as well as tempest and devstack logs.
|
||||
|
||||
- Avoid the use of wildcards in queries since they can put an undue burden on
|
||||
the query engine. A common case where wildcards are used and shouldn't be are
|
||||
in querying against a specific set of build_name fields,
|
||||
e.g. gate-nova-python26 and gate-nova-python27.
|
||||
Rather than use build_name:gate-nova-python*, list the jobs with an OR, e.g.:
|
||||
|
||||
::
|
||||
in querying against a specific set of ``build_name`` fields, e.g.
|
||||
``gate-nova-python26`` and ``gate-nova-python27``. Rather than use
|
||||
``build_name:gate-nova-python*``, list the jobs with an ``OR``. For example::
|
||||
|
||||
(build_name:"gate-nova-python26" OR build_name:"gate-nova-python27")
|
||||
|
||||
In order to support rapidly added queries, it's considered socially
|
||||
acceptable to +A changes that only add 1 new bug query, and to even
|
||||
self approve those changes by core reviewers.
|
||||
|
||||
In order to support rapidly added queries, it's considered socially acceptable
|
||||
to approve changes that only add 1 new bug query, and to even self approve
|
||||
those changes by core reviewers.
|
||||
|
||||
Adding Bug Signatures
|
||||
---------------------
|
||||
|
||||
Most transient bugs seen in gate are not bugs in tempest associated
|
||||
with a specific tempest test failure, but rather some sort of issue
|
||||
further down the stack that can cause many tempest tests to fail.
|
||||
Most transient bugs seen in gate are not bugs in tempest associated with a
|
||||
specific tempest test failure, but rather some sort of issue further down the
|
||||
stack that can cause many tempest tests to fail.
|
||||
|
||||
#. Given a transient bug that is seen during the gate, go through the
|
||||
logs (logs.openstack.org) and try to find a log that is associated
|
||||
with the failure. The closer to the root cause the better.
|
||||
#. Given a transient bug that is seen during the gate, go through `the logs
|
||||
<http://logs.openstack.org/>`_ and try to find a log that is associated with
|
||||
the failure. The closer to the root cause the better.
|
||||
|
||||
Note that queries can only be written against INFO level and higher log
|
||||
messages. This is by design to not overwhelm the search cluster.
|
||||
|
||||
#. Go to logstash.openstack.org and create an elastic search query to
|
||||
find the log message from step 1. To see the possible fields to
|
||||
search on click on an entry. Lucene query syntax is available at
|
||||
http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
|
||||
#. Go to `logstash.openstack.org <http://logstash.openstack.org/>`_ and create
|
||||
an elastic search query to find the log message from step 1. To see the
|
||||
possible fields to search on click on an entry. Lucene query syntax is
|
||||
available at `lucene.apache.org
|
||||
<http://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description>`_.
|
||||
|
||||
#. Add a comment to the bug with the query you identified and a link to
|
||||
the logstash url for that query search.
|
||||
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml`` and push
|
||||
the patch up for review.
|
||||
https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries
|
||||
#. Tag your commit with a ``Related-Bug`` tag in the footer, or add a comment
|
||||
to the bug with the query you identified and a link to the logstash URL for
|
||||
that query search.
|
||||
|
||||
#. Add the query to ``elastic-recheck/queries/BUGNUMBER.yaml``
|
||||
(All queries can be found on `git.openstack.org
|
||||
<https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries>`_)
|
||||
and push the patch up for review.
|
||||
|
||||
Future Work
|
||||
------------
|
||||
|
Loading…
Reference in New Issue
Block a user