137 Commits

Author SHA1 Message Date
Andrii Ostapenko
731a6b4cfa Enable yamllint checks
- document-end
- document-start
- empty-lines
- hyphens
- indentation
- key-duplicates
- new-line-at-end-of-file
- new-lines
- octal-values

with corresponding code adjustment.

Change-Id: I92d6aa20df82aa0fe198f8ccd535cfcaf613f43a
2020-05-29 19:49:05 +00:00
Andrii Ostapenko
67d1409a74 Enable yamllint checks
- brackets
- braces
- colon
- commas

with corresponding code adjustment.

Change-Id: I8d294cfa8f358431bee6ecb97396dae66f955b86
2020-05-21 14:04:23 +00:00
diwakar thyagaraj
163c5aa780 Enable Apparmor to all osh-infra test pods
Also Changed container names to static.

Change-Id: I51f53b480d18aaa38a9707429f01052ee122e7e9
Signed-off-by: diwakar thyagaraj <diwakar.chitoor.thyagaraj@att.com>
2020-05-19 15:36:07 +00:00
Zuul
e53d28718d Merge "Remove OSH Authors copyright" 2020-05-12 20:00:38 +00:00
diwakar thyagaraj
64ac469eb6 Enable Apparmor to Prometheus-init-containers
Change-Id: Ibea27338437c9c039b10bff02a28d60d3f5cf4b1
Signed-off-by: diwakar thyagaraj <diwakar.chitoor.thyagaraj@att.com>
2020-05-08 17:24:54 +00:00
Gage Hugo
d14d826b26 Remove OSH Authors copyright
The current copyright refers to a non-existent group
"openstack helm authors" with often out-of-date references that
are confusing when adding a new file to the repo.

This change removes all references to this copyright by the
non-existent group and any blank lines underneath.

Change-Id: I1882738cf9757c5350a8533876fd37b5920b5235
2020-05-07 02:11:15 +00:00
Zuul
01aa16620b Merge "Prometheus: Status Alerts Scalar/Vector Conversion" 2020-02-18 17:35:43 +00:00
Zuul
57ad8ad603 Merge "Prometheus: Ceph Alerts Scalar/Vector Conversion" 2020-02-18 17:35:42 +00:00
Zuul
3c7a9de243 Merge "Prometheus: Node Alerts Scalar/Vector Conversion" 2020-02-18 17:29:48 +00:00
dt241s@att.com
8bd4a2624a [FIX] Add apparmor to prometheus.
This also fixes Elasticsearch apparmor Jobs.

Change-Id: I8f2a9aa12beffe3ca394a2e9dd00aba7e5292f29
2020-02-14 23:13:38 +00:00
Steven Fitzpatrick
a41262e459 Prometheus: Node Alerts Scalar/Vector Conversion
This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead - For just the 'basic_linux'
rules.

Change-Id: I30d6ab71d747b297f522bbeb12b8f4dbfce1eefe
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
2020-02-11 15:14:40 +00:00
Steven Fitzpatrick
f37865d6a0 Prometheus: Ceph Alerts Scalar/Vector Conversion
This change updates the prometheus alerting rules to use ranged vectors
in their expressions, to avoid situations wher missed scrapes would
cause scalar metrics to "go stale" - resetting the alert timer.

Only the ceph alerts are affected by this change.

Change-Id: Ib47866d12616aaa808e6a09c58aa4352e338a152
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
2020-02-11 15:14:35 +00:00
Steven Fitzpatrick
d408bed90d Prometheus: Status Alerts Scalar/Vector Conversion
This change converts alert expressions which relied on instant vectors
to use range aggregate functions instead.

Change-Id: I4df757f961524bed23b6a6ad361779c1749ca2c5
Co-Authored-By: Meghan Heisler <mkheisler93@gmail.com>
2020-02-11 15:14:27 +00:00
Zuul
cc399a08ed Merge "Fix incorrect prometheus alert names in nagios" 2020-01-15 23:43:05 +00:00
Zuul
c2ece6a45a Merge "Support for local storage" 2020-01-09 23:18:16 +00:00
Smruti Soumitra Khuntia
2ac08b59b4 Support for local storage
This change adds a means of introducing new storage classes
and local persistent volumes.

Change-Id: I340c75f3d0a1678f3149f3cf62e4ab104823cc49
Co-Authored-By: Steven Fitzpatrick <steven.fitzpatrick@att.com>
2020-01-09 10:24:31 -06:00
Tin Lam
c199addf3c Update apiVersion
This patch set updates and tests the apiVersion for rbac.authorization.k8s.io
from v1beta1 to v1 in preparation for its removal in k8s 1.20.

Change-Id: I4e68db1f75ff72eee55ecec93bd59c68c179c627
Signed-off-by: Tin Lam <tin@irrational.io>
2020-01-09 08:59:48 +00:00
Steve Wilkerson
ddd5a74319 Prometheus: Add feature-gate support in deployment scripts
This updates the deployment scripts for Prometheus to leverage the
feature gate functionality rather than bash generation of the list
of override files to use for alerting rules

Change-Id: Ie497ae930f7cc4db690a4ddc812a92e4491cde93
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2020-01-07 22:06:19 +00:00
Steven Fitzpatrick
4fdcff593c Fix incorrect prometheus alert names in nagios
I noticed a some nagios service checks were checking prometheus
alerts which did not exist in our default prometheus configuration.
In one case a prometheus alert did not match the naming convention
of similar alerts.

One nagios service check, ceph_monitor_clock_skew_high, does not
have a corresponding alert  at all, so I've changed it to check the

node_ntmp_clock_skew_high

alert, where a node has the label ceph-mon="enabled".

Change-Id: I2ebf9a4954190b8e2caefc8a61270e28bf24d9fa
2020-01-03 10:30:08 -06:00
Steve Wilkerson
fbd34421f2 Prometheus: Update chart to support federation
This updates the Prometheus chart to support federation. This
moves to defining the Prometheus configuration file via a template
in the values.yaml file instead of through raw yaml. This allows
for overriding the chart's default configuration wholesale, as
this would be required for a hierarchical federated setup. This
also strips out all of the default rules defined in the chart for
the same reason. There are example rules defined for the various
aspects of OSH's infrastructure in the prometheus/values_overrides
directory that are executed as part of the normal CI jobs. This
also adds a nonvoting federated-monitoring job that vets out the
ability to federate prometheus in a hierarchical fashion with
extremely basic overrides

Change-Id: I0f121ad5e4f80be4c790dc869955c6b299ca9f26
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2019-11-21 12:39:56 +00:00
Steve Wilkerson
c1555920e5 Update podManagementPolicy for Prometheus and Alertmanager
This updates the podManagementPolicy to 'Parallel' for Prometheus
and Alertmanager, as there's no need to handle deploying these
two services in a sequential manner

Change-Id: I2f33b9651bed20c4cb2e0c477ae2227cbf9310cf
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2019-11-20 21:37:55 +00:00
Steve Wilkerson
0c51a9cab8 Prometheus: Update version
This updates the Prometheus version deployed by default from
2.3.2 to 2.12.0

Change-Id: Ic10e02a6b136a7f65fb686f5ef1adf1bcf6a9a9d
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2019-11-19 12:03:43 -06:00
Zuul
81d2d687c8 Merge "Make corrections to pod lifecycle upgrade values" 2019-11-01 14:10:37 +00:00
Steven Fitzpatrick
1971d23da8 Make corrections to pod lifecycle upgrade values
It was observed in some charts' values.yaml that the values defining
lifecycle upgrade parameters were incorrectly placed.

This change aims to correct these instances by adding a deployment-
type subkey corresponding with the deployment types identified in
the chart's templates dir, and indenting the values appropriately.

Change-Id: Id5437b1eeaf6e71472520f1fee91028c9b6bfdd3
2019-10-31 20:34:07 +00:00
Steven Fitzpatrick
84113626bf Fix Prometheus Volume Claim Use Expression
This change updated the expression math so that the threshold value
can be reached.

Change-Id: Iae078d4c78a4403c410ae01e0a13a1dda25d40c7
2019-10-28 16:41:45 -05:00
Steve Wilkerson
b50fae62a4 Update kubernetes-entrypoint image reference
This updates the kubernetes-entrypoint image reference to consume
the publicly available kubernetes-entrypoint image that is built
and maintained under the airshipit namespace, as the stackanetes
image is no longer actively maintained

Change-Id: I5bfdc156ae228ab16da57569ac6b05a9a125cb6a
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2019-10-18 18:20:11 +00:00
Andrii Ostapenko
fdcc9b7e0e Make all prints python3 compatible
Change-Id: Ie5a08859010453d276b42253f5f2130f80b82224
2019-10-01 01:28:35 +00:00
Pai, Radhika (rp592h)
2358a8a710 Prometheus: Relabeling the node-exporter label
Added the reblabeling config lines to the kubernetes_sd_config key, to
replace the node_name with hostname for Node-exporter. This must now
display the hostname also as one of the labels of the Node-exporter
metrics.

Change-Id: Ic96a890552a1cd2f5e595c37330de048f31a0e75
2019-09-26 15:46:36 +00:00
Steve Wilkerson
40d26142d3 Prometheus: Fix volume utilization alert expression
Change-Id: I9a0ab85d7acf20e5b34ec62a95b3350aace8161a
Signed-off-by: Steve Wilkerson <sw5822@att.com>
2019-07-08 13:19:35 +00:00
sungil
fae650722f Fix templates of alert rules (ceph.rules)
This PS fix templates which generate errors on alert-manager.

Change-Id: I4201cc353848a8f121c2a755a93c1b462d1ab816
2019-07-02 14:50:37 +00:00
Hemant
b9a9ee323b Change the expression of defined alert in prometheus to avoid unnecessary errors
There were some false alerts about volume_claim_capacity_high_utilization
due to wrong formula used to determine the percentage of used capacity.

Change-Id: I24afed7946f915e5e13f0ba759eca252c2598af9
2019-06-18 20:19:29 +00:00
caoyuan
040edeb79a Replace git.openstack.org URLs with opendev.org URLs
Change-Id: I0e3af4a3385f5b2a7705bc19b775863b16c2e08e
2019-05-31 01:52:10 +00:00
Jean-Philippe Evrard
5f5e988fb3 Point to OSH-images images
We now have a process for OSH-images image building,
using Zuul, so we should point the images by default to those
images, instead of pointing to stale images.

Without this, the osh-images build process is completely not
in use (and completely opaque to deployers), and updating the
osh-images process or patching its code has no impact on OSH.

This should fix it.

Change-Id: Ic00bd98c151669dc2485cd88e0e8c2ab05445959
2019-05-17 08:17:32 +00:00
Roy Tang (rt7380)
85bd731562 Expose Anti-Affinity Weight Setting
This ps exposes the anti-affinity weight value, including
default, that will be consumed by the updated htk function.

Change-Id: Id8eb303674764ef8b0664f62040723aaf77e0a54
2019-05-14 17:04:52 -05:00
Meg Heisler
e1f2a3cf78 Fix broken network policy check/gate
This adds a basic egress policy to the charts run by the
network-policy check. A change was recently merged requiring
the eggress tag to be in the chart but did not add it, this
addresses that

Change-Id: I60669c9351db7854cba8c69723eb783a966d2a56
2019-05-10 05:55:22 +00:00
RAHUL KHIYANI
916bdabee7 prometheus: Fix security context
This PS fixes the use of the security context macros for the
prometheus chart.

Change-Id: I0abb309132a9954a140cbf76463724c5e2c7c5f3
2019-04-23 00:00:36 +00:00
Zuul
d27e548f8f Merge "OSH-Infra: Add emptydirs for tmp" 2019-04-21 02:21:11 +00:00
Pete Birley
2abf62ff4d OSH-Infra: Add emptydirs for tmp
This PS adds emptydirs backing the /tmp directory in pods, which
is required in most cases for full operation when using a read only
filesystem backing the container.

Additionally some yaml indent issues are resolved.

Change-Id: I8b7f1614da059783254aa6efc09facf23fca3cad
Signed-off-by: Pete Birley <pete@port.direct>
2019-04-20 20:50:59 +00:00
Rahul Khiyani
f25e458515 Prometheus: Add pod/container security context
This updates the prometheus chart to include the pod
security context on the pod template. This changes the pod's
user from root to the nobody user instead

This also adds the container security context to explicitly set
allowPrivilegeEscalation to false and readOnlyRootFilesystem to true

Change-Id: I2a3a4b77d9b25c086dc23b4fd66dca92872c422d
2019-04-20 18:54:44 +00:00
Steve Wilkerson
84f30ec103 Add release-annotation to pod spec, add missing annotations
This adds the release-annotation to the pod spec for the charts in
openstack-helm-infra. This also adds missing configmap annotations
to charts in openstack-helm-infra

Change-Id: Ie23f0c16a7a21d3929e98928db2bbcef69ae6490
2019-03-21 09:10:48 -05:00
Steve Wilkerson
3413dba8c0 Update ingress controller image, ingress cookie annotations
This updates the ingress controller image to v0.23.0, which was
required to add support for configuring cookie max age and expires
for ingresses via annotations on the ingress.

This also removes the --enable-dynamic-configuration flag, as the
flag is no longer valid in 0.23.0 due to the functionality being
a default behavior of the nginx ingress controller in recent
releases

Change-Id: I4917797c43ec973ed0bb311fc305b01f10abd4e5
2019-03-07 20:39:03 +00:00
Rahul Khiyani
bfa58f9177 readOnlyRootFilesystem: true for Prometheus chart
Fix for adding readOnlyRootFilesystem flag at pod
level

Change-Id: I04079be87780292da1bf9b2142f0a01a8b575b5b
2019-03-07 17:42:48 +00:00
Zuul
e836707ad0 Merge "Add east-west ingress network policy to Prometheus" 2019-03-07 04:44:10 +00:00
Meg Heisler
243f6c7608 Add east-west ingress network policy to Prometheus
This adds an ingress policy to Prometheus and utilizes
the helm-toolkit used in openstack-helm

Change-Id: Ia89d42a5305c94da26337aaf716978c1defae503
2019-03-06 11:56:13 -06:00
Steve Wilkerson
4c0fd492ee Update logging format and config for apache reverse proxies
This updates the logging format and configuration for the apache
reverse proxies used for elasticsearch, kibana, nagios and
prometheus to enable logging of the remote clients used to access
these services

Change-Id: Id07e4294ea18203fbb890b78424a232c2d59cb82
2019-02-25 09:21:41 -06:00
Chris Wedgwood
332d7a4e39 [Prometheus] Tweak K8SApiServerLatency to ignore DELETECOLLECTION
DELETECOLLECTION for some things like namespaces can be very slow.  As
it's not critical it should be safe to ignore it.

Change-Id: I513b2af45b703a73d20a98a7a770776632ae4b39
2019-02-16 16:58:16 +00:00
Chris Wedgwood
d7808468fc [Prometheus] Relax disk IO constraints
Relax the timing constrains for disk IO to accommodate rotating disks;
a "measured IO" might be the result of a small number of physical IOs,
allow for enough time for a small number of disk rotations (this isn't
perfect but seems to be about right in testing under load).

Change-Id: Ifb067a2218528e5918d2f4b2ba169b6e739084e0
2019-01-29 06:41:51 +00:00
Chris Wedgwood
4fb6ee6e35 [Prometheus] Fix filesystem space checks
Change-Id: Id527ea6e08070cb7d2634417a7c203c1c5c3d97c
2019-01-29 06:34:54 +00:00
Steve Wilkerson
87ff958fb8 Prometheus: Update pod container status alerts
This updates the Prometheus pod container status alerts. This
ensures there are alerts defined for ImagePullBackOff,
ErrImagePull, and CreateContainerConfigError errors.

This also updates the Nagios service checks to include correct
checks for those alerts

Change-Id: I91544e7dff8c6aac8c79cd8aa7d8f7bc03adaa9a
2019-01-23 16:26:39 +00:00
Steve Wilkerson
9e5a295465 Update Elasticsearch health status expressions
This updates the Elasticsearch health status expressions used in
Prometheus, Nagios and Grafana.  The previous Prometheus rule
defined for Elasticsearch health checked for a status that was
> 0 to trigger an alarm for a green health status. The correct
returned values are: 1 for green, 0 for both red and yellow. This
changes the expression to use arithmetic operators to give us a
result that maps to: 2 for green, 1 for yellow, 0 for red.

This also updates the Elasticsearch dashboard in Grafana to add a
new mapping for the updated 2g,1y,0r scale.

Finally, this also updates the Nagios service check to be a bit
more verbose in its output.

For reference, see:
https://github.com/justwatchcom/elasticsearch_exporter/issues/120

Change-Id: I6ef2a7c308c6ebfdb693b46127a285bceb6ba872
2019-01-16 11:11:59 -06:00