neutron/doc/source/policies/gate-failure-triage.rst
Kyle Mestery 2486842857 Move Neutron Policy pages into the tree
The information from the NeutronPolicies wiki page [1] is useful. But
having it in a static wiki is less so. Moving this into the Neutron
documentation in-tree will allow for an easier maintenance of this
information.

[1] https://wiki.openstack.org/wiki/NeutronPolicies

Change-Id: I23cada04031e00f48c0f6dc26a992ed204195526
2015-03-05 15:53:21 -05:00

70 lines
4.3 KiB
ReStructuredText

Neutron Gate Failure Triage
===========================
This page provides guidelines for spotting and assessing neutron gate failures. Some hints for triaging
failures are also provided.
Spotting Gate Failures
----------------------
This can be achieved using several tools:
* `Joe Gordon's github.io pages <http://jogo.github.io/gate/>`_
* `logstash <http://logstash.openstack.org/>`_
Even though Joe's script is not an "official" OpenStack page it provides a quick snapshot of the current
status for the most important jobs This page is built using data available at graphite.openstack.org.
If you want to check how that is done go `here <https://github.com/jogo/jogo.github.io/tree/master/gate>`_
(caveat: the color of the neutron job is very similar to that of the full job with nova-network).
For checking gate failures with logstash the following query will return failures for a specific job:
> build_status:FAILURE AND message:Finished AND build_name:"check-tempest-dsvm-neutron" AND build_queue:"gate"
And divided by the total number of jobs executed:
> message:Finished AND build_name:"check-tempest-dsvm-neutron" AND build_queue:"gate"
It will return the failure rate in the selected period for a given job. It is important to remark that
failures in the check queue might be misleading as the problem causing the failure is most of the time in
the patch being checked. Therefore it is always advisable to work on failures occurred in the gate queue.
However, these failures are a precious resource for assessing frequency and determining root cause of
failures which manifest in the gate queue.
The step above will provide a quick outlook of where things stand. When the failure rate raises above 10% for
a job in 24 hours, it's time to be on alert. 25% is amber alert. 33% is red alert. Anything above 50% means
that probably somebody from the infra team has already a contract out on you. Whether you are relaxed, in
alert mode, or freaking out because you see a red dot on your chest, it is always a good idea to check on
daily bases the elastic-recheck pages.
Under the `gate pipeline <http://status.openstack.org/elastic-recheck/gate.html>`_ tab, you can see gate
failure rates for already known bugs. The bugs in this page are ordered by decreasing failure rates (for the
past 24 hours). If one of the bugs affecting Neutron is among those on top of that list, you should check
that the corresponding bug is already assigned and somebody is working on it. If not, and there is not a good
reason for that, it should be ensured somebody gets a crack at it as soon as possible. The other part of the
story is to check for `uncategorized <http://status.openstack.org/elastic-recheck/data/uncategorized.html>`_
failures. This is where failures for new (unknown) gate breaking bugs end up; on the other hand also infra
error causing job failures end up here. It should be duty of the diligent Neutron developer to ensure the
classification rate for neutron jobs is as close as possible to 100%. To this aim, the diligent Neutron
developer should adopt the following procedure:
1. Open logs for failed jobs and look for logs/testr_results.html.gz.
2. If that file is missing, check console.html and see where the job failed.
1. If there is a failure in devstack-gate-cleanup-host.txt it's likely to be an infra issue.
2. If the failure is in devstacklog.txt it could a devstack, neutron, or infra issue.
3. However, most of the time the failure is in one of the tempest tests. Take note of the error message and go to
logstash.
4. On logstash, search for occurrences of this error message, and try to identify the root cause for the failure
(see below).
5. File a bug for this failure, and push a elastic-recheck query for it (see below).
6. If you are confident with the area of this bug, and you have time, assign it to yourself; otherwise look for an
assignee or talk to the Neutron's bug czar to find an assignee.
Root Causing a Gate Failure
---------------------------
Time-based identification, i.e. find the naughty patch by log scavenging.
Filing An Elastic Recheck Query
-------------------------------
The `elastic recheck <http://status.openstack.org/elastic-recheck/>`_ page has all the current open ER queries.
To file one, please see the `ER Wiki <https://wiki.openstack.org/wiki/ElasticRecheck>`_.