This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead - For just the 'basic_linux'
rules.
Change-Id: I30d6ab71d747b297f522bbeb12b8f4dbfce1eefe
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change updates the prometheus alerting rules to use ranged vectors
in their expressions, to avoid situations wher missed scrapes would
cause scalar metrics to "go stale" - resetting the alert timer.
Only the ceph alerts are affected by this change.
Change-Id: Ib47866d12616aaa808e6a09c58aa4352e338a152
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead.
Change-Id: I4df757f961524bed23b6a6ad361779c1749ca2c5
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
This change adds a means of introducing new storage classes
and local persistent volumes.
Change-Id: I340c75f3d0a1678f3149f3cf62e4ab104823cc49
Co-Authored-By: Steven Fitzpatrick <steven.fitzpatrick@att.com>
This updates the deployment scripts for Prometheus to leverage the
feature gate functionality rather than bash generation of the list
of override files to use for alerting rules
Change-Id: Ie497ae930f7cc4db690a4ddc812a92e4491cde93
Signed-off-by: Steve Wilkerson <sw5822@att.com>
I noticed a some nagios service checks were checking prometheus
alerts which did not exist in our default prometheus configuration.
In one case a prometheus alert did not match the naming convention
of similar alerts.
One nagios service check, ceph_monitor_clock_skew_high, does not
have a corresponding alert at all, so I've changed it to check the
node_ntmp_clock_skew_high
alert, where a node has the label ceph-mon="enabled".
Change-Id: I2ebf9a4954190b8e2caefc8a61270e28bf24d9fa
This updates the Prometheus chart to support federation. This
moves to defining the Prometheus configuration file via a template
in the values.yaml file instead of through raw yaml. This allows
for overriding the chart's default configuration wholesale, as
this would be required for a hierarchical federated setup. This
also strips out all of the default rules defined in the chart for
the same reason. There are example rules defined for the various
aspects of OSH's infrastructure in the prometheus/values_overrides
directory that are executed as part of the normal CI jobs. This
also adds a nonvoting federated-monitoring job that vets out the
ability to federate prometheus in a hierarchical fashion with
extremely basic overrides
Change-Id: I0f121ad5e4f80be4c790dc869955c6b299ca9f26
Signed-off-by: Steve Wilkerson <sw5822@att.com>