This PS adds emptydirs backing the /tmp directory in pods, which
is required in most cases for full operation when using a read only
filesystem backing the container.
Additionally some yaml indent issues are resolved.
Change-Id: I8b7f1614da059783254aa6efc09facf23fca3cad
Signed-off-by: Pete Birley <pete@port.direct>
This updates the prometheus chart to include the pod
security context on the pod template. This changes the pod's
user from root to the nobody user instead
This also adds the container security context to explicitly set
allowPrivilegeEscalation to false and readOnlyRootFilesystem to true
Change-Id: I2a3a4b77d9b25c086dc23b4fd66dca92872c422d
This adds the release-annotation to the pod spec for the charts in
openstack-helm-infra. This also adds missing configmap annotations
to charts in openstack-helm-infra
Change-Id: Ie23f0c16a7a21d3929e98928db2bbcef69ae6490
This updates the ingress controller image to v0.23.0, which was
required to add support for configuring cookie max age and expires
for ingresses via annotations on the ingress.
This also removes the --enable-dynamic-configuration flag, as the
flag is no longer valid in 0.23.0 due to the functionality being
a default behavior of the nginx ingress controller in recent
releases
Change-Id: I4917797c43ec973ed0bb311fc305b01f10abd4e5
This updates the logging format and configuration for the apache
reverse proxies used for elasticsearch, kibana, nagios and
prometheus to enable logging of the remote clients used to access
these services
Change-Id: Id07e4294ea18203fbb890b78424a232c2d59cb82
DELETECOLLECTION for some things like namespaces can be very slow. As
it's not critical it should be safe to ignore it.
Change-Id: I513b2af45b703a73d20a98a7a770776632ae4b39
Relax the timing constrains for disk IO to accommodate rotating disks;
a "measured IO" might be the result of a small number of physical IOs,
allow for enough time for a small number of disk rotations (this isn't
perfect but seems to be about right in testing under load).
Change-Id: Ifb067a2218528e5918d2f4b2ba169b6e739084e0
This updates the Prometheus pod container status alerts. This
ensures there are alerts defined for ImagePullBackOff,
ErrImagePull, and CreateContainerConfigError errors.
This also updates the Nagios service checks to include correct
checks for those alerts
Change-Id: I91544e7dff8c6aac8c79cd8aa7d8f7bc03adaa9a
This updates the Elasticsearch health status expressions used in
Prometheus, Nagios and Grafana. The previous Prometheus rule
defined for Elasticsearch health checked for a status that was
> 0 to trigger an alarm for a green health status. The correct
returned values are: 1 for green, 0 for both red and yellow. This
changes the expression to use arithmetic operators to give us a
result that maps to: 2 for green, 1 for yellow, 0 for red.
This also updates the Elasticsearch dashboard in Grafana to add a
new mapping for the updated 2g,1y,0r scale.
Finally, this also updates the Nagios service check to be a bit
more verbose in its output.
For reference, see:
https://github.com/justwatchcom/elasticsearch_exporter/issues/120
Change-Id: I6ef2a7c308c6ebfdb693b46127a285bceb6ba872
This removes unused pod-etc-apache volumes from the charts that
use an apache sidecar container as a reverse proxy.
Change-Id: Ibafff3b53f9d3c20f5aed30d40ee6470cb515a8a
This PS implements the helm toolkit function to generate the
Egress in kubernetes network policy manifest based on overrideable values.
It also enbale the K8s network policy at Osh-infra gate.
Change-Id: Icbe2a18c98dba795d15398dcdcac64228f6a7b4c
This adds session affinity to Prometheus's ingress. This allows for
the use of cookies for Prometheus's session affinity
Change-Id: I2e7e1d1b5120c1fb3ddecb5883845e46d61273de
This updates the Prometheus scrape configuration to use the
service based discovery mechanism instead of endpoints. This
removes issues associated with multiple ceph-mgr replicas deployed
Change-Id: I2c557af0c7200d0c4aea646c5f9ecd1a070db33e
With new ceph luminous ceph.rules are obsolete.
Added a new rule for ceph-mgr count
Changed ceph_monitor_quorum_count to ceph_mon_quorum_count
Updated ceph_cluster_usage_highas ceph_cluster_used_bytes,
ceph_cluster_capacity_bytes aren't valid
Updated ceph_placement_group_degrade_pct_high as
ceph_degraded_pgs, ceph_total_pgs aren't valid
Updated ceph_osd_down_pct_high as ceph_osds_down,
ceph_osds_up aren't available, ceph_osd_up is
available but ceph_osd_down isn't. Need to
calculate the down based on count(ceph_osd_up==0)
and total osd using count(ceph_osd_metadata)
Removed ceph_monitor_clock_skew_high as the metric
ceph_monitor_clock_skew_seconds isn't valid anymore
Added new alarms ceph_osd_down, ceph_osd_out
Implements: prometheus ceph.rules changes with new valid metrics
Closes-Bug: #1800548
Change-Id: Id68e64472af12e8dadffa61373c18bbb82df96a3
Signed-off-by: Kranthi Guttikonda <kranthi.guttikonda@b-yond.com>
This patch set implements the helm toolkit function to generate a
kubernetes network policy manifest based on overrideable values.
This also adds a chart that shuts down all the ingress and egress
traffics in the namespace. This can be used to ensure the
whitelisted network policy works as intended.
Additionally, implementation is done for some infrastructure charts.
Change-Id: I78e87ef3276e948ae4dd2eb462b4b8012251c8c8
Co-Authored-By: Mike Pham <tp6510@att.com>
Signed-off-by: Tin Lam <tin@irrational.io>
kube_node_status_ready and up metrics are obsolete to check the kubernetes
node condition. When a kubelet is down that means node itself in NotReady
state. With 1.3.1 kube-state-metrics exporter kube_node_status_condition
metric provides the status value of the kubelet (essentially node).
https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation
/node-metrics.md
kube_node_status_condition includes condition=Ready and status as true,
flase and unknown. When a kubelet is stopped the status will be unknown
since the kubelet itself will unable to talk to API. In other cases it
will be false. When the node is registered and available it will be set to
true.
Replaced the kube_node_status_ready with kube_node_status_condition and
changed the 1h to 1m and increased the severity to "critical". Also
modified the K8SKubeletDown definitions with 1m and critical sevrity
Implements: Bug 1797133
Closes-Bug: #1797133
Change-Id: I025adb13c9d8642a218dfda1ff30f1577fa8c826
Signed-off-by: Kranthi Kiran Guttikonda <kranthi.guttikonda@b-yond.com>
This changes the image used for various jobs and helm tests in the
osh-infra charts. This replaces the kolla heat image with the loci
based heat image used for jobs and helm tests in openstack-helm in
order to drive consistency
Change-Id: Ie9deedadb7507282fe62723ec4641dd508040364
This updates the helm test pod templates in the charts with helm
tests defined. This change includes the addition of:
- Generate test pod cluster roles and role bindings
- Generate service accounts for test pods
- Add node selectors to the test pods
- Add service accounts to the test pods
- Addition of entrypoint container to the test pods
- Indentation fix for rabbitmq test pod template
Change-Id: I9a0dd8a1a87bfe5eaf1362e92b37bc004f9c2cdb
This removes the recording rules for Kubernetes, as these rules
add signficant overhead to the total evaluation time for rules.
Any recording rules should be handled as operator overrides and
not set by default, in order to prevent undesired overhead time
for rules that aren't currently used by the charts
Change-Id: I183d32e62619b71b5020cd3733e4707d7c9ad11b
Fixing opebstack API monitors
Adding additional neutron services monitors
Adding new Pod CrashLoopBaackOff status check
Adding new Host readiness check
Updated the nagios image reference(https://review.gerrithub.io/c/att-comdev/nagios/+/420590 - Pending)
This updated image provides a mechanism for querying Elasticsearch
with the goal of triggering alerts based on specified applications
and log levels.
Finally, this moves the endpoints resulting from the authenticated
endpoint lookups required for Nagios to the nagios secret instead
of handled via plain text environment variables
Change-Id: I517d8e6e6e8fa1d359382be8a131a8e45bf243e2
This PS adds the ability to attach a release uuid to pods and rc
objects as desired. A follow up ps will add the ability to add arbitary
annotations to the same objects.
Change-Id: Iceedba457a03387f6fc44eb763a00fd57f9d84a5
Signed-off-by: Pete Birley <pete@port.direct>
This updates the endpoints in the apache configuration for
Prometheus to correctly define the file used for http basic auth
to validate the admin user. The Prometheus endpoints restricted to
the admin user specified file for the authbasicprovider, but did
not provide the file used for validating the user. This adds the
file correctly
Change-Id: I8561281236fb1efa2e51af342e30314aae8e5285
This begins to drop metrics from Prometheus scrape configurations.
The metrics dropped are metrics not currently used by any service
that interacts with Prometheus and are not used in any alerting
rules by default. Dropping these metrics reduces the resource use
by Prometheus, as it reduces the total number of time series data
ingested and analyzed by Prometheus
Change-Id: Ia09ddd482da0119167a19e7e4b092879b672c2ec
This updates the osh-infra charts to use a secret for their
configuration files instead of a configmap, allowing for the
storage of sensitive information
Change-Id: Ia32587162288df0b297c45fd43b55cef381cb064
This removes the min_block_duration and max_block_duration flags
from the Prometheus chart, as the suggested best practice is to
use the defaults (2h min, 10% of retention time as max).
This also updates the scrape target configuration for cadvisor to
match the upstream example endpoint for kubernetes versions 1.7.3
and later
Change-Id: I200969d6c4da9d17d0a7d3a34a114ccc5f5ee70f
This updates the Prometheus version to 2.3.2, which includes a fix
for memory leak issues with the kubernetes client and also adds a
dashboard for evaluating prometheus rule evaluation performance
Change-Id: I7b9e7bee114fa149db3733c0dacfefae36be7fa8
This adds authentication to Prometheus with an apache reverse
proxy, similar to elasticsearch, kibana and nagios. This adds an
admin user and password via htpasswd along with adding ldap
support.
This required modifying the grafana chart to configure the
prometheus datasource's basic auth credentials in the data sources
provisioning configuration file by checking whether basic auth is
enabled and injecting the username/password defined in the
corresponding endpoint definition.
This also modifies the nagios chart to use the authenticated
endpoint for prometheus, which is required for nagios to
successfully query the prometheus endpoint for its service
checking mechanism
Change-Id: Ia4ccc3c44a89b2c56594be1f4cc28ac07169bf8c
In most cases, the ingress controller's nodeSelector key and value
are "node-role.kubernetes.io/ingress" and "true".
Using quote to treat the nodeSelector value as a string.
Change-Id: Ie1745629b90795e4d888d85f35565e6d6350e09b
This updates the default command line flags for Prometheus. It
explicitly sets the HTTP administrative settings to false and
gives a brief explanation of the security concerns associated
with enabling them
This also removes the honor_labels setting where set to false, as
false is the default setting for honor_labels
Change-Id: I69acdbce604864882d642e44c09a5f0b9c454a61
This updates the TLS secret templates to include the backend
service in the dict supplied to the manifest template, as it is
required for the TLS secret to render correctly.
This also removes the readiness probe from the nagios container in
the deployment for the nagios chart, as it wasn't functioning as
intended due to the port not being available for the probe
Change-Id: Iabcfd40c74938e0497d08ffeeebc98ab722fa660
Adds support for TLS on overriden fqdns for public endpoints for
the services that have them in openstack-helm-infra. Currently this
implementation is limited, in that it does not provide support for
dynamically loading CAs into the containers, or specifying them manually
via configuration. As a result only well known or CA's added manually
to containers will be recognised.
Change-Id: I4ab4bbe24b6544b64cd365467e8efb2a421ac3f4
This PS removes the use of the `quote and truncate` approach to
suppress output from gotpl actions in templates and replaces it
with the recommended practice of defining `$_` instead.
Change-Id: I5fedc3471dcbecef37d2fe1302bf9760b3163467
Signed-off-by: Pete Birley <pete@port.direct>
This PS moves to use the current API version for kubernetes rcs'
that were previously using `apps/v1beta1`.
Story: 2002205
Task: 21735
Change-Id: Icb4e7aa2392da6867427a58926be2da6f424bd56
Signed-off-by: Pete Birley <pete@port.direct>
This adds a basic check for capacity utilization for persistent
volume claims. To accomplish this, it adds a basic alerting rule
to prometheus that triggers after a persistent volume's usage
exceeds 80%, and triggers 5 minutes after that state has been
reached. In addition, there is a service check added to the
nagios chart that will query Prometheus to check if the alarm
for that threshhold is firing for any of the volume claims.
Change-Id: I862c860ac479a715733202f679bb151885d7aa7c