102 Commits

Author SHA1 Message Date
RAHUL KHIYANI
916bdabee7 prometheus: Fix security context
This PS fixes the use of the security context macros for the
prometheus chart.

Change-Id: I0abb309132a9954a140cbf76463724c5e2c7c5f3
2019-04-23 00:00:36 +00:00
Zuul
d27e548f8f Merge "OSH-Infra: Add emptydirs for tmp" 2019-04-21 02:21:11 +00:00
Pete Birley
2abf62ff4d OSH-Infra: Add emptydirs for tmp
This PS adds emptydirs backing the /tmp directory in pods, which
is required in most cases for full operation when using a read only
filesystem backing the container.

Additionally some yaml indent issues are resolved.

Change-Id: I8b7f1614da059783254aa6efc09facf23fca3cad
Signed-off-by: Pete Birley <pete@port.direct>
2019-04-20 20:50:59 +00:00
Rahul Khiyani
f25e458515 Prometheus: Add pod/container security context
This updates the prometheus chart to include the pod
security context on the pod template. This changes the pod's
user from root to the nobody user instead

This also adds the container security context to explicitly set
allowPrivilegeEscalation to false and readOnlyRootFilesystem to true

Change-Id: I2a3a4b77d9b25c086dc23b4fd66dca92872c422d
2019-04-20 18:54:44 +00:00
Steve Wilkerson
84f30ec103 Add release-annotation to pod spec, add missing annotations
This adds the release-annotation to the pod spec for the charts in
openstack-helm-infra. This also adds missing configmap annotations
to charts in openstack-helm-infra

Change-Id: Ie23f0c16a7a21d3929e98928db2bbcef69ae6490
2019-03-21 09:10:48 -05:00
Steve Wilkerson
3413dba8c0 Update ingress controller image, ingress cookie annotations
This updates the ingress controller image to v0.23.0, which was
required to add support for configuring cookie max age and expires
for ingresses via annotations on the ingress.

This also removes the --enable-dynamic-configuration flag, as the
flag is no longer valid in 0.23.0 due to the functionality being
a default behavior of the nginx ingress controller in recent
releases

Change-Id: I4917797c43ec973ed0bb311fc305b01f10abd4e5
2019-03-07 20:39:03 +00:00
Rahul Khiyani
bfa58f9177 readOnlyRootFilesystem: true for Prometheus chart
Fix for adding readOnlyRootFilesystem flag at pod
level

Change-Id: I04079be87780292da1bf9b2142f0a01a8b575b5b
2019-03-07 17:42:48 +00:00
Zuul
e836707ad0 Merge "Add east-west ingress network policy to Prometheus" 2019-03-07 04:44:10 +00:00
Meg Heisler
243f6c7608 Add east-west ingress network policy to Prometheus
This adds an ingress policy to Prometheus and utilizes
the helm-toolkit used in openstack-helm

Change-Id: Ia89d42a5305c94da26337aaf716978c1defae503
2019-03-06 11:56:13 -06:00
Steve Wilkerson
4c0fd492ee Update logging format and config for apache reverse proxies
This updates the logging format and configuration for the apache
reverse proxies used for elasticsearch, kibana, nagios and
prometheus to enable logging of the remote clients used to access
these services

Change-Id: Id07e4294ea18203fbb890b78424a232c2d59cb82
2019-02-25 09:21:41 -06:00
Chris Wedgwood
332d7a4e39 [Prometheus] Tweak K8SApiServerLatency to ignore DELETECOLLECTION
DELETECOLLECTION for some things like namespaces can be very slow.  As
it's not critical it should be safe to ignore it.

Change-Id: I513b2af45b703a73d20a98a7a770776632ae4b39
2019-02-16 16:58:16 +00:00
Chris Wedgwood
d7808468fc [Prometheus] Relax disk IO constraints
Relax the timing constrains for disk IO to accommodate rotating disks;
a "measured IO" might be the result of a small number of physical IOs,
allow for enough time for a small number of disk rotations (this isn't
perfect but seems to be about right in testing under load).

Change-Id: Ifb067a2218528e5918d2f4b2ba169b6e739084e0
2019-01-29 06:41:51 +00:00
Chris Wedgwood
4fb6ee6e35 [Prometheus] Fix filesystem space checks
Change-Id: Id527ea6e08070cb7d2634417a7c203c1c5c3d97c
2019-01-29 06:34:54 +00:00
Steve Wilkerson
87ff958fb8 Prometheus: Update pod container status alerts
This updates the Prometheus pod container status alerts. This
ensures there are alerts defined for ImagePullBackOff,
ErrImagePull, and CreateContainerConfigError errors.

This also updates the Nagios service checks to include correct
checks for those alerts

Change-Id: I91544e7dff8c6aac8c79cd8aa7d8f7bc03adaa9a
2019-01-23 16:26:39 +00:00
Steve Wilkerson
9e5a295465 Update Elasticsearch health status expressions
This updates the Elasticsearch health status expressions used in
Prometheus, Nagios and Grafana.  The previous Prometheus rule
defined for Elasticsearch health checked for a status that was
> 0 to trigger an alarm for a green health status. The correct
returned values are: 1 for green, 0 for both red and yellow. This
changes the expression to use arithmetic operators to give us a
result that maps to: 2 for green, 1 for yellow, 0 for red.

This also updates the Elasticsearch dashboard in Grafana to add a
new mapping for the updated 2g,1y,0r scale.

Finally, this also updates the Nagios service check to be a bit
more verbose in its output.

For reference, see:
https://github.com/justwatchcom/elasticsearch_exporter/issues/120

Change-Id: I6ef2a7c308c6ebfdb693b46127a285bceb6ba872
2019-01-16 11:11:59 -06:00
Steve Wilkerson
30d2cf00d4 Remove unused pod-etc-apache volumes
This removes unused pod-etc-apache volumes from the charts that
use an apache sidecar container as a reverse proxy.

Change-Id: Ibafff3b53f9d3c20f5aed30d40ee6470cb515a8a
2019-01-04 10:31:35 -06:00
Chris Wedgwood
0c4e37391f 'NOP' cleanup for more consistent white-space use in charts
Where we have the style '{{ ...' we should use the style '... }}'.

Change-Id: Ic3e779e4681370d396f95d3804ca27db5b9d3642
2019-01-03 22:45:49 +00:00
Pete Birley
0bf3674539 Revert "Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA"
This reverts commit 8d33a2911cda0c9e88406b9eeacbd8dfa70286f2.

Change-Id: Ic861b9bf9b337449b47a3558da8355e7a5bcacee
2018-12-16 04:21:46 +00:00
Mike Pham
8d33a2911c Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA
This PS implements the helm toolkit function to generate the
Egress in kubernetes network policy manifest based on overrideable values.
It also enbale the K8s network policy at Osh-infra gate.

Change-Id: Icbe2a18c98dba795d15398dcdcac64228f6a7b4c
2018-12-14 16:32:40 -05:00
Steve Wilkerson
71c1a16758 Prometheus: Add session affinity to ingress
This adds session affinity to Prometheus's ingress. This allows for
the use of cookies for Prometheus's session affinity

Change-Id: I2e7e1d1b5120c1fb3ddecb5883845e46d61273de
2018-11-26 14:30:08 +00:00
Zuul
a90ebb784c Merge "Prometheus: Update discovery configuration for ceph-mgr services" 2018-11-09 01:01:54 +00:00
Steve Wilkerson
e0f2d66ee3 Prometheus: Update discovery configuration for ceph-mgr services
This updates the Prometheus scrape configuration to use the
service based discovery mechanism instead of endpoints. This
removes issues associated with multiple ceph-mgr replicas deployed

Change-Id: I2c557af0c7200d0c4aea646c5f9ecd1a070db33e
2018-11-06 13:56:37 -06:00
kranthi guttikonda
fac358a575 prometheus ceph.rules changes
With new ceph luminous ceph.rules are obsolete.

Added a new rule for ceph-mgr count

Changed ceph_monitor_quorum_count to ceph_mon_quorum_count

Updated ceph_cluster_usage_highas ceph_cluster_used_bytes,
ceph_cluster_capacity_bytes aren't valid

Updated ceph_placement_group_degrade_pct_high as
ceph_degraded_pgs, ceph_total_pgs aren't valid

Updated ceph_osd_down_pct_high as ceph_osds_down,
ceph_osds_up aren't available, ceph_osd_up is
available but ceph_osd_down isn't. Need to
calculate the down based on count(ceph_osd_up==0)
and total osd using count(ceph_osd_metadata)

Removed ceph_monitor_clock_skew_high as the metric
ceph_monitor_clock_skew_seconds isn't  valid anymore

Added new alarms ceph_osd_down, ceph_osd_out

Implements: prometheus ceph.rules changes with new valid metrics
Closes-Bug: #1800548
Change-Id: Id68e64472af12e8dadffa61373c18bbb82df96a3
Signed-off-by: Kranthi Guttikonda <kranthi.guttikonda@b-yond.com>
2018-10-31 10:23:11 -04:00
Zuul
11ec46bdce Merge "Prometheus kubelet.rules change" 2018-10-23 17:57:26 +00:00
Tin Lam
92e68d33ea Add network policy toolkit function
This patch set implements the helm toolkit function to generate a
kubernetes network policy manifest based on overrideable values.
This also adds a chart that shuts down all the ingress and egress
traffics in the namespace. This can be used to ensure the
whitelisted network policy works as intended.

Additionally, implementation is done for some infrastructure charts.

Change-Id: I78e87ef3276e948ae4dd2eb462b4b8012251c8c8
Co-Authored-By: Mike Pham <tp6510@att.com>
Signed-off-by: Tin Lam <tin@irrational.io>
2018-10-15 13:50:50 +00:00
kranthi guttikonda
f995680e2a Prometheus kubelet.rules change
kube_node_status_ready and up metrics are obsolete to check the kubernetes
node condition. When a kubelet is down that means node itself in NotReady
state. With 1.3.1 kube-state-metrics exporter kube_node_status_condition
metric provides the status value of the kubelet (essentially node).
https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation
/node-metrics.md

kube_node_status_condition includes condition=Ready and status as true,
flase and unknown. When a kubelet is stopped the status will be unknown
since the kubelet itself will unable to talk to API. In other cases it
will be false. When the node is registered and available it will be set to
true.

Replaced the kube_node_status_ready with kube_node_status_condition and
changed the 1h to 1m and increased the severity to "critical". Also
modified the K8SKubeletDown definitions with 1m and critical sevrity

Implements: Bug 1797133
Closes-Bug: #1797133
Change-Id: I025adb13c9d8642a218dfda1ff30f1577fa8c826
Signed-off-by: Kranthi Kiran Guttikonda <kranthi.guttikonda@b-yond.com>
2018-10-11 16:31:16 -04:00
Steve Wilkerson
c7cbb9f4dd Charts: Update heat image used for jobs and helm tests
This changes the image used for various jobs and helm tests in the
osh-infra charts. This replaces the kolla heat image with the loci
based heat image used for jobs and helm tests in openstack-helm in
order to drive consistency

Change-Id: Ie9deedadb7507282fe62723ec4641dd508040364
2018-10-11 14:47:58 -05:00
Steve Wilkerson
bfa237d347 Charts: Update helm test pod templates
This updates the helm test pod templates in the charts with helm
tests defined. This change includes the addition of:

- Generate test pod cluster roles and role bindings
- Generate service accounts for test pods
- Add node selectors to the test pods
- Add service accounts to the test pods
- Addition of entrypoint container to the test pods
- Indentation fix for rabbitmq test pod template

Change-Id: I9a0dd8a1a87bfe5eaf1362e92b37bc004f9c2cdb
2018-10-09 21:00:00 +00:00
Steve Wilkerson
4c532bb8f3 Prometheus: Remove Kubernetes recording rules
This removes the recording rules for Kubernetes, as these rules
add signficant overhead to the total evaluation time for rules.
Any recording rules should be handled as operator overrides and
not set by default, in order to prevent undesired overhead time
for rules that aren't currently used by the charts

Change-Id: I183d32e62619b71b5020cd3733e4707d7c9ad11b
2018-10-01 11:56:34 +00:00
rakesh-patnaik
db0d653b4d Monitor postgresql, Openstack virt resources, api, logs, pod and nodes status
Fixing opebstack API monitors

Adding additional neutron services monitors
Adding new Pod CrashLoopBaackOff status check
Adding new Host readiness check

Updated the nagios image reference(https://review.gerrithub.io/c/att-comdev/nagios/+/420590 - Pending)

This updated image provides a mechanism for querying Elasticsearch
with the goal of triggering alerts based on specified applications
and log levels.

Finally, this moves the endpoints resulting from the authenticated
endpoint lookups required for Nagios to the nagios secret instead
of handled via plain text environment variables

Change-Id: I517d8e6e6e8fa1d359382be8a131a8e45bf243e2
2018-09-21 08:22:13 +00:00
Zuul
1c6a33d979 Merge "Prometheus: Prune large unused time series metrics" 2018-09-18 05:30:39 +00:00
Zuul
5ec85a5d70 Merge "Prometheus: Fix Prometheus endpoints in apache config" 2018-09-17 15:19:03 +00:00
Pete Birley
bb3ff98d53 Add release uuid to pods and rc objects
This PS adds the ability to attach a release uuid to pods and rc
objects as desired. A follow up ps will add the ability to add arbitary
annotations to the same objects.

Change-Id: Iceedba457a03387f6fc44eb763a00fd57f9d84a5
Signed-off-by: Pete Birley <pete@port.direct>
2018-09-13 05:35:35 +00:00
Scott Huang
bc54e72fd3 Monitor Cinder API and Scheduler
Change-Id: I159facb491d9a722d8c067ead25c470f00b83939
2018-09-07 15:12:32 +00:00
Steve Wilkerson
61b2dbf941 Prometheus: Fix Prometheus endpoints in apache config
This updates the endpoints in the apache configuration for
Prometheus to correctly define the file used for http basic auth
to validate the admin user. The Prometheus endpoints restricted to
the admin user specified file for the authbasicprovider, but did
not provide the file used for validating the user. This adds the
file correctly

Change-Id: I8561281236fb1efa2e51af342e30314aae8e5285
2018-09-04 13:26:02 +00:00
Steve Wilkerson
2e4db10e9b Prometheus: Prune large unused time series metrics
This begins to drop metrics from Prometheus scrape configurations.
The metrics dropped are metrics not currently used by any service
that interacts with Prometheus and are not used in any alerting
rules by default. Dropping these metrics reduces the resource use
by Prometheus, as it reduces the total number of time series data
ingested and analyzed by Prometheus

Change-Id: Ia09ddd482da0119167a19e7e4b092879b672c2ec
2018-09-04 13:25:45 +00:00
Steve Wilkerson
9a311475ba Charts: Use secrets for configs in chart
This updates the osh-infra charts to use a secret for their
configuration files instead of a configmap, allowing for the
storage of sensitive information

Change-Id: Ia32587162288df0b297c45fd43b55cef381cb064
2018-08-24 15:56:53 -05:00
Steve Wilkerson
d5dc97a431 Prometheus: Remove block duration flags, update cadvisor job
This removes the min_block_duration and max_block_duration flags
from the Prometheus chart, as the suggested best practice is to
use the defaults (2h min, 10% of retention time as max).

This also updates the scrape target configuration for cadvisor to
match the upstream example endpoint for kubernetes versions 1.7.3
and later

Change-Id: I200969d6c4da9d17d0a7d3a34a114ccc5f5ee70f
2018-08-20 13:26:40 -05:00
Steve Wilkerson
faef231b0b Prometheus: Update version to 2.3.2
This updates the Prometheus version to 2.3.2, which includes a fix
for memory leak issues with the kubernetes client and also adds a
dashboard for evaluating prometheus rule evaluation performance

Change-Id: I7b9e7bee114fa149db3733c0dacfefae36be7fa8
2018-08-16 16:48:27 +00:00
Steve Wilkerson
8652e14acb Add auth for prometheus
This adds authentication to Prometheus with an apache reverse
proxy, similar to elasticsearch, kibana and nagios. This adds an
admin user and password via htpasswd along with adding ldap
support.

This required modifying the grafana chart to configure the
prometheus datasource's basic auth credentials in the data sources
provisioning configuration file by checking whether basic auth is
enabled and injecting the username/password defined in the
corresponding endpoint definition.

This also modifies the nagios chart to use the authenticated
endpoint for prometheus, which is required for nagios to
successfully query the prometheus endpoint for its service
checking mechanism

Change-Id: Ia4ccc3c44a89b2c56594be1f4cc28ac07169bf8c
2018-08-08 18:49:45 +00:00
Seungkyu Ahn
a430533e6a Quoting node_select_value in Ingress Controller
In most cases, the ingress controller's nodeSelector key and value
are "node-role.kubernetes.io/ingress" and "true".
Using quote to treat the nodeSelector value as a string.

Change-Id: Ie1745629b90795e4d888d85f35565e6d6350e09b
2018-08-01 02:39:05 +00:00
Steve Wilkerson
a861c27a34 Prometheus: Update command line flags
This updates the default command line flags for Prometheus. It
explicitly sets the HTTP administrative settings to false and
gives a brief explanation of the security concerns associated
with enabling them

This also removes the honor_labels setting where set to false, as
false is the default setting for honor_labels

Change-Id: I69acdbce604864882d642e44c09a5f0b9c454a61
2018-07-27 16:33:37 -05:00
Steve Wilkerson
dc16a897d7 Add missing labels to helm test pods
This adds missing labels to the helm test pods in osh-infra

Change-Id: I618d9089bfde2d847411f5f876f0ff6afd9cce7f
2018-07-10 08:55:40 -05:00
Steve Wilkerson
c26a1b53f6 Update TLS secret templates, remove nagios readiness probe
This updates the TLS secret templates to include the backend
service in the dict supplied to the manifest template, as it is
required for the TLS secret to render correctly.

This also removes the readiness probe from the nagios container in
the deployment for the nagios chart, as it wasn't functioning as
intended due to the port not being available for the probe

Change-Id: Iabcfd40c74938e0497d08ffeeebc98ab722fa660
2018-06-27 18:56:45 -05:00
Steve Wilkerson
b823954787 Ingress: Add initial TLS Support for osh-infra public endpoints
Adds support for TLS on overriden fqdns for public endpoints for
the services that have them in openstack-helm-infra. Currently this
implementation is limited, in that it does not provide support for
dynamically loading CAs into the containers, or specifying them manually
via configuration. As a result only well known or CA's added manually
to containers will be recognised.

Change-Id: I4ab4bbe24b6544b64cd365467e8efb2a421ac3f4
2018-06-26 14:47:19 -05:00
Pete Birley
abb00e97fd Gotpl: remove quote and trunc to suppress output
This PS removes the use of the `quote and truncate` approach to
suppress output from gotpl actions in templates and replaces it
with the recommended practice of defining `$_` instead.

Change-Id: I5fedc3471dcbecef37d2fe1302bf9760b3163467
Signed-off-by: Pete Birley <pete@port.direct>
2018-06-16 16:37:08 -05:00
Zuul
e718d4d39b Merge "Prometheus: update function to live in correct location" 2018-06-14 00:50:59 +00:00
Zuul
01d196e761 Merge "Use current kubernetes API version" 2018-06-13 13:00:58 +00:00
Pete Birley
b6a51fb57f Use current kubernetes API version
This PS moves to use the current API version for kubernetes rcs'
that were previously using `apps/v1beta1`.

Story: 2002205
Task: 21735

Change-Id: Icb4e7aa2392da6867427a58926be2da6f424bd56
Signed-off-by: Pete Birley <pete@port.direct>
2018-06-12 17:35:13 -05:00
Steve Wilkerson
561780f347 PVC monitoring: Add alerting rules and service check for PVCs
This adds a basic check for capacity utilization for persistent
volume claims. To accomplish this, it adds a basic alerting rule
to prometheus that triggers after a persistent volume's usage
exceeds 80%, and triggers 5 minutes after that state has been
reached.  In addition, there is a service check added to the
nagios chart that will query Prometheus to check if the alarm
for that threshhold is firing for any of the volume claims.

Change-Id: I862c860ac479a715733202f679bb151885d7aa7c
2018-06-12 14:28:24 +00:00