openstack-helm-infra

Author	SHA1	Message	Date
RAHUL KHIYANI	916bdabee7	prometheus: Fix security context This PS fixes the use of the security context macros for the prometheus chart. Change-Id: I0abb309132a9954a140cbf76463724c5e2c7c5f3	2019-04-23 00:00:36 +00:00
Zuul	d27e548f8f	Merge "OSH-Infra: Add emptydirs for tmp"	2019-04-21 02:21:11 +00:00
Pete Birley	2abf62ff4d	OSH-Infra: Add emptydirs for tmp This PS adds emptydirs backing the /tmp directory in pods, which is required in most cases for full operation when using a read only filesystem backing the container. Additionally some yaml indent issues are resolved. Change-Id: I8b7f1614da059783254aa6efc09facf23fca3cad Signed-off-by: Pete Birley <pete@port.direct>	2019-04-20 20:50:59 +00:00
Rahul Khiyani	f25e458515	Prometheus: Add pod/container security context This updates the prometheus chart to include the pod security context on the pod template. This changes the pod's user from root to the nobody user instead This also adds the container security context to explicitly set allowPrivilegeEscalation to false and readOnlyRootFilesystem to true Change-Id: I2a3a4b77d9b25c086dc23b4fd66dca92872c422d	2019-04-20 18:54:44 +00:00
Steve Wilkerson	84f30ec103	Add release-annotation to pod spec, add missing annotations This adds the release-annotation to the pod spec for the charts in openstack-helm-infra. This also adds missing configmap annotations to charts in openstack-helm-infra Change-Id: Ie23f0c16a7a21d3929e98928db2bbcef69ae6490	2019-03-21 09:10:48 -05:00
Steve Wilkerson	3413dba8c0	Update ingress controller image, ingress cookie annotations This updates the ingress controller image to v0.23.0, which was required to add support for configuring cookie max age and expires for ingresses via annotations on the ingress. This also removes the --enable-dynamic-configuration flag, as the flag is no longer valid in 0.23.0 due to the functionality being a default behavior of the nginx ingress controller in recent releases Change-Id: I4917797c43ec973ed0bb311fc305b01f10abd4e5	2019-03-07 20:39:03 +00:00
Rahul Khiyani	bfa58f9177	readOnlyRootFilesystem: true for Prometheus chart Fix for adding readOnlyRootFilesystem flag at pod level Change-Id: I04079be87780292da1bf9b2142f0a01a8b575b5b	2019-03-07 17:42:48 +00:00
Zuul	e836707ad0	Merge "Add east-west ingress network policy to Prometheus"	2019-03-07 04:44:10 +00:00
Meg Heisler	243f6c7608	Add east-west ingress network policy to Prometheus This adds an ingress policy to Prometheus and utilizes the helm-toolkit used in openstack-helm Change-Id: Ia89d42a5305c94da26337aaf716978c1defae503	2019-03-06 11:56:13 -06:00
Steve Wilkerson	4c0fd492ee	Update logging format and config for apache reverse proxies This updates the logging format and configuration for the apache reverse proxies used for elasticsearch, kibana, nagios and prometheus to enable logging of the remote clients used to access these services Change-Id: Id07e4294ea18203fbb890b78424a232c2d59cb82	2019-02-25 09:21:41 -06:00
Chris Wedgwood	332d7a4e39	[Prometheus] Tweak K8SApiServerLatency to ignore DELETECOLLECTION DELETECOLLECTION for some things like namespaces can be very slow. As it's not critical it should be safe to ignore it. Change-Id: I513b2af45b703a73d20a98a7a770776632ae4b39	2019-02-16 16:58:16 +00:00
Chris Wedgwood	d7808468fc	[Prometheus] Relax disk IO constraints Relax the timing constrains for disk IO to accommodate rotating disks; a "measured IO" might be the result of a small number of physical IOs, allow for enough time for a small number of disk rotations (this isn't perfect but seems to be about right in testing under load). Change-Id: Ifb067a2218528e5918d2f4b2ba169b6e739084e0	2019-01-29 06:41:51 +00:00
Chris Wedgwood	4fb6ee6e35	[Prometheus] Fix filesystem space checks Change-Id: Id527ea6e08070cb7d2634417a7c203c1c5c3d97c	2019-01-29 06:34:54 +00:00
Steve Wilkerson	87ff958fb8	Prometheus: Update pod container status alerts This updates the Prometheus pod container status alerts. This ensures there are alerts defined for ImagePullBackOff, ErrImagePull, and CreateContainerConfigError errors. This also updates the Nagios service checks to include correct checks for those alerts Change-Id: I91544e7dff8c6aac8c79cd8aa7d8f7bc03adaa9a	2019-01-23 16:26:39 +00:00
Steve Wilkerson	9e5a295465	Update Elasticsearch health status expressions This updates the Elasticsearch health status expressions used in Prometheus, Nagios and Grafana. The previous Prometheus rule defined for Elasticsearch health checked for a status that was > 0 to trigger an alarm for a green health status. The correct returned values are: 1 for green, 0 for both red and yellow. This changes the expression to use arithmetic operators to give us a result that maps to: 2 for green, 1 for yellow, 0 for red. This also updates the Elasticsearch dashboard in Grafana to add a new mapping for the updated 2g,1y,0r scale. Finally, this also updates the Nagios service check to be a bit more verbose in its output. For reference, see: https://github.com/justwatchcom/elasticsearch_exporter/issues/120 Change-Id: I6ef2a7c308c6ebfdb693b46127a285bceb6ba872	2019-01-16 11:11:59 -06:00
Steve Wilkerson	30d2cf00d4	Remove unused pod-etc-apache volumes This removes unused pod-etc-apache volumes from the charts that use an apache sidecar container as a reverse proxy. Change-Id: Ibafff3b53f9d3c20f5aed30d40ee6470cb515a8a	2019-01-04 10:31:35 -06:00
Chris Wedgwood	0c4e37391f	'NOP' cleanup for more consistent white-space use in charts Where we have the style '{{ ...' we should use the style '... }}'. Change-Id: Ic3e779e4681370d396f95d3804ca27db5b9d3642	2019-01-03 22:45:49 +00:00
Pete Birley	0bf3674539	Revert "Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA" This reverts commit 8d33a2911cda0c9e88406b9eeacbd8dfa70286f2. Change-Id: Ic861b9bf9b337449b47a3558da8355e7a5bcacee	2018-12-16 04:21:46 +00:00
Mike Pham	8d33a2911c	Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA This PS implements the helm toolkit function to generate the Egress in kubernetes network policy manifest based on overrideable values. It also enbale the K8s network policy at Osh-infra gate. Change-Id: Icbe2a18c98dba795d15398dcdcac64228f6a7b4c	2018-12-14 16:32:40 -05:00
Steve Wilkerson	71c1a16758	Prometheus: Add session affinity to ingress This adds session affinity to Prometheus's ingress. This allows for the use of cookies for Prometheus's session affinity Change-Id: I2e7e1d1b5120c1fb3ddecb5883845e46d61273de	2018-11-26 14:30:08 +00:00
Zuul	a90ebb784c	Merge "Prometheus: Update discovery configuration for ceph-mgr services"	2018-11-09 01:01:54 +00:00
Steve Wilkerson	e0f2d66ee3	Prometheus: Update discovery configuration for ceph-mgr services This updates the Prometheus scrape configuration to use the service based discovery mechanism instead of endpoints. This removes issues associated with multiple ceph-mgr replicas deployed Change-Id: I2c557af0c7200d0c4aea646c5f9ecd1a070db33e	2018-11-06 13:56:37 -06:00
kranthi guttikonda	fac358a575	prometheus ceph.rules changes With new ceph luminous ceph.rules are obsolete. Added a new rule for ceph-mgr count Changed ceph_monitor_quorum_count to ceph_mon_quorum_count Updated ceph_cluster_usage_highas ceph_cluster_used_bytes, ceph_cluster_capacity_bytes aren't valid Updated ceph_placement_group_degrade_pct_high as ceph_degraded_pgs, ceph_total_pgs aren't valid Updated ceph_osd_down_pct_high as ceph_osds_down, ceph_osds_up aren't available, ceph_osd_up is available but ceph_osd_down isn't. Need to calculate the down based on count(ceph_osd_up==0) and total osd using count(ceph_osd_metadata) Removed ceph_monitor_clock_skew_high as the metric ceph_monitor_clock_skew_seconds isn't valid anymore Added new alarms ceph_osd_down, ceph_osd_out Implements: prometheus ceph.rules changes with new valid metrics Closes-Bug: #1800548 Change-Id: Id68e64472af12e8dadffa61373c18bbb82df96a3 Signed-off-by: Kranthi Guttikonda <kranthi.guttikonda@b-yond.com>	2018-10-31 10:23:11 -04:00
Zuul	11ec46bdce	Merge "Prometheus kubelet.rules change"	2018-10-23 17:57:26 +00:00
Tin Lam	92e68d33ea	Add network policy toolkit function This patch set implements the helm toolkit function to generate a kubernetes network policy manifest based on overrideable values. This also adds a chart that shuts down all the ingress and egress traffics in the namespace. This can be used to ensure the whitelisted network policy works as intended. Additionally, implementation is done for some infrastructure charts. Change-Id: I78e87ef3276e948ae4dd2eb462b4b8012251c8c8 Co-Authored-By: Mike Pham <tp6510@att.com> Signed-off-by: Tin Lam <tin@irrational.io>	2018-10-15 13:50:50 +00:00
kranthi guttikonda	f995680e2a	Prometheus kubelet.rules change kube_node_status_ready and up metrics are obsolete to check the kubernetes node condition. When a kubelet is down that means node itself in NotReady state. With 1.3.1 kube-state-metrics exporter kube_node_status_condition metric provides the status value of the kubelet (essentially node). https://github.com/kubernetes/kube-state-metrics/blob/master/Documentation /node-metrics.md kube_node_status_condition includes condition=Ready and status as true, flase and unknown. When a kubelet is stopped the status will be unknown since the kubelet itself will unable to talk to API. In other cases it will be false. When the node is registered and available it will be set to true. Replaced the kube_node_status_ready with kube_node_status_condition and changed the 1h to 1m and increased the severity to "critical". Also modified the K8SKubeletDown definitions with 1m and critical sevrity Implements: Bug 1797133 Closes-Bug: #1797133 Change-Id: I025adb13c9d8642a218dfda1ff30f1577fa8c826 Signed-off-by: Kranthi Kiran Guttikonda <kranthi.guttikonda@b-yond.com>	2018-10-11 16:31:16 -04:00
Steve Wilkerson	c7cbb9f4dd	Charts: Update heat image used for jobs and helm tests This changes the image used for various jobs and helm tests in the osh-infra charts. This replaces the kolla heat image with the loci based heat image used for jobs and helm tests in openstack-helm in order to drive consistency Change-Id: Ie9deedadb7507282fe62723ec4641dd508040364	2018-10-11 14:47:58 -05:00
Steve Wilkerson	bfa237d347	Charts: Update helm test pod templates This updates the helm test pod templates in the charts with helm tests defined. This change includes the addition of: - Generate test pod cluster roles and role bindings - Generate service accounts for test pods - Add node selectors to the test pods - Add service accounts to the test pods - Addition of entrypoint container to the test pods - Indentation fix for rabbitmq test pod template Change-Id: I9a0dd8a1a87bfe5eaf1362e92b37bc004f9c2cdb	2018-10-09 21:00:00 +00:00
Steve Wilkerson	4c532bb8f3	Prometheus: Remove Kubernetes recording rules This removes the recording rules for Kubernetes, as these rules add signficant overhead to the total evaluation time for rules. Any recording rules should be handled as operator overrides and not set by default, in order to prevent undesired overhead time for rules that aren't currently used by the charts Change-Id: I183d32e62619b71b5020cd3733e4707d7c9ad11b	2018-10-01 11:56:34 +00:00
rakesh-patnaik	db0d653b4d	Monitor postgresql, Openstack virt resources, api, logs, pod and nodes status Fixing opebstack API monitors Adding additional neutron services monitors Adding new Pod CrashLoopBaackOff status check Adding new Host readiness check Updated the nagios image reference(https://review.gerrithub.io/c/att-comdev/nagios/+/420590 - Pending) This updated image provides a mechanism for querying Elasticsearch with the goal of triggering alerts based on specified applications and log levels. Finally, this moves the endpoints resulting from the authenticated endpoint lookups required for Nagios to the nagios secret instead of handled via plain text environment variables Change-Id: I517d8e6e6e8fa1d359382be8a131a8e45bf243e2	2018-09-21 08:22:13 +00:00
Zuul	1c6a33d979	Merge "Prometheus: Prune large unused time series metrics"	2018-09-18 05:30:39 +00:00
Zuul	5ec85a5d70	Merge "Prometheus: Fix Prometheus endpoints in apache config"	2018-09-17 15:19:03 +00:00
Pete Birley	bb3ff98d53	Add release uuid to pods and rc objects This PS adds the ability to attach a release uuid to pods and rc objects as desired. A follow up ps will add the ability to add arbitary annotations to the same objects. Change-Id: Iceedba457a03387f6fc44eb763a00fd57f9d84a5 Signed-off-by: Pete Birley <pete@port.direct>	2018-09-13 05:35:35 +00:00
Scott Huang	bc54e72fd3	Monitor Cinder API and Scheduler Change-Id: I159facb491d9a722d8c067ead25c470f00b83939	2018-09-07 15:12:32 +00:00
Steve Wilkerson	61b2dbf941	Prometheus: Fix Prometheus endpoints in apache config This updates the endpoints in the apache configuration for Prometheus to correctly define the file used for http basic auth to validate the admin user. The Prometheus endpoints restricted to the admin user specified file for the authbasicprovider, but did not provide the file used for validating the user. This adds the file correctly Change-Id: I8561281236fb1efa2e51af342e30314aae8e5285	2018-09-04 13:26:02 +00:00
Steve Wilkerson	2e4db10e9b	Prometheus: Prune large unused time series metrics This begins to drop metrics from Prometheus scrape configurations. The metrics dropped are metrics not currently used by any service that interacts with Prometheus and are not used in any alerting rules by default. Dropping these metrics reduces the resource use by Prometheus, as it reduces the total number of time series data ingested and analyzed by Prometheus Change-Id: Ia09ddd482da0119167a19e7e4b092879b672c2ec	2018-09-04 13:25:45 +00:00
Steve Wilkerson	9a311475ba	Charts: Use secrets for configs in chart This updates the osh-infra charts to use a secret for their configuration files instead of a configmap, allowing for the storage of sensitive information Change-Id: Ia32587162288df0b297c45fd43b55cef381cb064	2018-08-24 15:56:53 -05:00
Steve Wilkerson	d5dc97a431	Prometheus: Remove block duration flags, update cadvisor job This removes the min_block_duration and max_block_duration flags from the Prometheus chart, as the suggested best practice is to use the defaults (2h min, 10% of retention time as max). This also updates the scrape target configuration for cadvisor to match the upstream example endpoint for kubernetes versions 1.7.3 and later Change-Id: I200969d6c4da9d17d0a7d3a34a114ccc5f5ee70f	2018-08-20 13:26:40 -05:00
Steve Wilkerson	faef231b0b	Prometheus: Update version to 2.3.2 This updates the Prometheus version to 2.3.2, which includes a fix for memory leak issues with the kubernetes client and also adds a dashboard for evaluating prometheus rule evaluation performance Change-Id: I7b9e7bee114fa149db3733c0dacfefae36be7fa8	2018-08-16 16:48:27 +00:00
Steve Wilkerson	8652e14acb	Add auth for prometheus This adds authentication to Prometheus with an apache reverse proxy, similar to elasticsearch, kibana and nagios. This adds an admin user and password via htpasswd along with adding ldap support. This required modifying the grafana chart to configure the prometheus datasource's basic auth credentials in the data sources provisioning configuration file by checking whether basic auth is enabled and injecting the username/password defined in the corresponding endpoint definition. This also modifies the nagios chart to use the authenticated endpoint for prometheus, which is required for nagios to successfully query the prometheus endpoint for its service checking mechanism Change-Id: Ia4ccc3c44a89b2c56594be1f4cc28ac07169bf8c	2018-08-08 18:49:45 +00:00
Seungkyu Ahn	a430533e6a	Quoting node_select_value in Ingress Controller In most cases, the ingress controller's nodeSelector key and value are "node-role.kubernetes.io/ingress" and "true". Using quote to treat the nodeSelector value as a string. Change-Id: Ie1745629b90795e4d888d85f35565e6d6350e09b	2018-08-01 02:39:05 +00:00
Steve Wilkerson	a861c27a34	Prometheus: Update command line flags This updates the default command line flags for Prometheus. It explicitly sets the HTTP administrative settings to false and gives a brief explanation of the security concerns associated with enabling them This also removes the honor_labels setting where set to false, as false is the default setting for honor_labels Change-Id: I69acdbce604864882d642e44c09a5f0b9c454a61	2018-07-27 16:33:37 -05:00
Steve Wilkerson	dc16a897d7	Add missing labels to helm test pods This adds missing labels to the helm test pods in osh-infra Change-Id: I618d9089bfde2d847411f5f876f0ff6afd9cce7f	2018-07-10 08:55:40 -05:00
Steve Wilkerson	c26a1b53f6	Update TLS secret templates, remove nagios readiness probe This updates the TLS secret templates to include the backend service in the dict supplied to the manifest template, as it is required for the TLS secret to render correctly. This also removes the readiness probe from the nagios container in the deployment for the nagios chart, as it wasn't functioning as intended due to the port not being available for the probe Change-Id: Iabcfd40c74938e0497d08ffeeebc98ab722fa660	2018-06-27 18:56:45 -05:00
Steve Wilkerson	b823954787	Ingress: Add initial TLS Support for osh-infra public endpoints Adds support for TLS on overriden fqdns for public endpoints for the services that have them in openstack-helm-infra. Currently this implementation is limited, in that it does not provide support for dynamically loading CAs into the containers, or specifying them manually via configuration. As a result only well known or CA's added manually to containers will be recognised. Change-Id: I4ab4bbe24b6544b64cd365467e8efb2a421ac3f4	2018-06-26 14:47:19 -05:00
Pete Birley	abb00e97fd	Gotpl: remove quote and trunc to suppress output This PS removes the use of the `quote and truncate` approach to suppress output from gotpl actions in templates and replaces it with the recommended practice of defining `$_` instead. Change-Id: I5fedc3471dcbecef37d2fe1302bf9760b3163467 Signed-off-by: Pete Birley <pete@port.direct>	2018-06-16 16:37:08 -05:00
Zuul	e718d4d39b	Merge "Prometheus: update function to live in correct location"	2018-06-14 00:50:59 +00:00
Zuul	01d196e761	Merge "Use current kubernetes API version"	2018-06-13 13:00:58 +00:00
Pete Birley	b6a51fb57f	Use current kubernetes API version This PS moves to use the current API version for kubernetes rcs' that were previously using `apps/v1beta1`. Story: 2002205 Task: 21735 Change-Id: Icb4e7aa2392da6867427a58926be2da6f424bd56 Signed-off-by: Pete Birley <pete@port.direct>	2018-06-12 17:35:13 -05:00
Steve Wilkerson	561780f347	PVC monitoring: Add alerting rules and service check for PVCs This adds a basic check for capacity utilization for persistent volume claims. To accomplish this, it adds a basic alerting rule to prometheus that triggers after a persistent volume's usage exceeds 80%, and triggers 5 minutes after that state has been reached. In addition, there is a service check added to the nagios chart that will query Prometheus to check if the alarm for that threshhold is firing for any of the volume claims. Change-Id: I862c860ac479a715733202f679bb151885d7aa7c	2018-06-12 14:28:24 +00:00

1 2 3

102 Commits