openstack-helm-infra

Author	SHA1	Message	Date
Steve Wilkerson	12529ddf0a	Nagios: Update default workers back to 4 This moves the default workers defined for Nagios back to 4, as reducing the number to one has made Nagios deployments in our jobs inconsistent Change-Id: I288554c4f9d57714b3d70a241942b2e4fd334500	2019-02-04 14:21:15 -06:00
Steve Wilkerson	442e4985c3	Nagios: Reduce Nagios concurrent checks and workers This reduces the maximum concurrent checks Nagios will execute to prevent process sprawl on the host. This also reduces the number of default workers to a single worker, to prevent Nagios from forking off multiple processes that then execute service checks and commands in parallel Change-Id: I0d8445a265740b4a2491bdfd739cb0f27955f06d	2019-02-01 08:31:31 -06:00
Steve Wilkerson	87ff958fb8	Prometheus: Update pod container status alerts This updates the Prometheus pod container status alerts. This ensures there are alerts defined for ImagePullBackOff, ErrImagePull, and CreateContainerConfigError errors. This also updates the Nagios service checks to include correct checks for those alerts Change-Id: I91544e7dff8c6aac8c79cd8aa7d8f7bc03adaa9a	2019-01-23 16:26:39 +00:00
Steve Wilkerson	046742c9c6	Nagios: Update logging, add readiness probe This updates the Nagios chart configuration to not use syslog for logging, removes the logging of notifications, and drastically increases the number of concurrent checks executed. This also removes the hostPath for Nagios logs, as it seems to add no value over what's already reported to the console. Finally, as Nagios's log file has the potential to grow very rapidly while the service has no means to disable logging to disk, this adds a readiness probe that both checks whether Nagios's endpoint is being served and clears out the log file by redirecting the no-op commands output to the nagios log file. Change-Id: I81151c48ef4e0b7877f595c271f55b8fd479e8c1	2019-01-17 11:12:16 -06:00
Zuul	379d918a20	Merge "Update Elasticsearch health status expressions"	2019-01-16 21:14:44 +00:00
Steve Wilkerson	9e5a295465	Update Elasticsearch health status expressions This updates the Elasticsearch health status expressions used in Prometheus, Nagios and Grafana. The previous Prometheus rule defined for Elasticsearch health checked for a status that was > 0 to trigger an alarm for a green health status. The correct returned values are: 1 for green, 0 for both red and yellow. This changes the expression to use arithmetic operators to give us a result that maps to: 2 for green, 1 for yellow, 0 for red. This also updates the Elasticsearch dashboard in Grafana to add a new mapping for the updated 2g,1y,0r scale. Finally, this also updates the Nagios service check to be a bit more verbose in its output. For reference, see: https://github.com/justwatchcom/elasticsearch_exporter/issues/120 Change-Id: I6ef2a7c308c6ebfdb693b46127a285bceb6ba872	2019-01-16 11:11:59 -06:00
Steve Wilkerson	00b40480a3	Nagios: Fix elasticsearch query clause volume mount This fixes the Nagios volume mount for the Elasticsearch query file. Previously, the check for adding the volumemount to the pod definition was incorrect. This fixes the conditional check, and also adds the same conditional check to the configuration secret This adds a simple check to the monitoring and multinode jobs to validate the resulting json gets mounted into the pod successfully Change-Id: I2af289ccc4e1cff1669cb5e6e829514781b14dd3	2019-01-15 16:18:01 -06:00
Steve Wilkerson	30d2cf00d4	Remove unused pod-etc-apache volumes This removes unused pod-etc-apache volumes from the charts that use an apache sidecar container as a reverse proxy. Change-Id: Ibafff3b53f9d3c20f5aed30d40ee6470cb515a8a	2019-01-04 10:31:35 -06:00
Chris Wedgwood	0c4e37391f	'NOP' cleanup for more consistent white-space use in charts Where we have the style '{{ ...' we should use the style '... }}'. Change-Id: Ic3e779e4681370d396f95d3804ca27db5b9d3642	2019-01-03 22:45:49 +00:00
Pete Birley	0bf3674539	Revert "Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA" This reverts commit 8d33a2911cda0c9e88406b9eeacbd8dfa70286f2. Change-Id: Ic861b9bf9b337449b47a3558da8355e7a5bcacee	2018-12-16 04:21:46 +00:00
Mike Pham	8d33a2911c	Add Egress Helm-toolkit function & enforce the nework policy at OSH-INFRA This PS implements the helm toolkit function to generate the Egress in kubernetes network policy manifest based on overrideable values. It also enbale the K8s network policy at Osh-infra gate. Change-Id: Icbe2a18c98dba795d15398dcdcac64228f6a7b4c	2018-12-14 16:32:40 -05:00
Zuul	b591e0754a	Merge "Add Nagios Elasticsearch Query Command"	2018-12-06 20:50:28 +00:00
Huang, Scott (sh2725)	bd05126309	Add Nagios Elasticsearch Query Command Change-Id: I74a965a5397101793cae71228a6a5bd442bf9f5a	2018-12-03 09:09:03 -05:00
Steve Wilkerson	439079693d	Nagios: Update image tag This updates the Nagios image tag to include the updated plugin for querying Elasticsearch for alerting on logged events Change-Id: Idd61d82463b79baab0e94c20b32da1dc6a8b3634	2018-11-26 08:29:22 -06:00
Steve Wilkerson	dfb4654fba	Nagios: Configuration updates This moves to update the host used for the ceph health checks, as we should be checking the ceph-mgr service directly for ceph metrics instead of trying to curl the host directly. This also changes the ceph_health_check to use the base-os hostgroup instead of the placeholder ceph-mgr host group, as we're just executing a simple check against the ceph-mgr service. This also adds default configuration values for the max_concurrent_checks (60) and check_workers (4) values instead of leaving them at the defaults Nagios uses (0 and # cores, respectively) Change-Id: Ib4072fcd545d8c05d5e9e4a93085a8330be6dfe0	2018-11-09 13:28:50 -06:00
Steve Wilkerson	325b3cea4d	Nagios: Update host check mechanism This updates the Nagios image to use a tag that includes a fix for the service discovery mechanism used for updating host checks. After moving the Nagios chart to either run in shared or host PID namespaces, the service discovery mechanism no longer worked due to the plugin attempting to restart PID 1 instead of determining the appropriate PID to restart. For reference, see: https://review.gerrithub.io/#/c/att-comdev/nagios/+/432205/ Change-Id: Ie01c3a93dd109a9dc99cfac5d27991583546605a	2018-11-09 09:12:16 -06:00
Zuul	b55e9b10a7	Merge "Nagios: Add session affinity to ingress"	2018-11-09 04:45:36 +00:00
Steve Wilkerson	2c6aa8ad1b	Nagios: Add session affinity to ingress This adds session affinity to Nagios's ingress. This allows for the use of cookies for Nagios's session affinity Change-Id: I6054a92f644dc533dd06d35a2541fb44d46cba88	2018-11-09 02:07:39 +00:00
Steve Wilkerson	ba22b0e726	Nagios: Update ceph_health check The ceph_health check in Nagios incorrectly sets the warning and error level to 0. The ceph_health_status metric's value of 0 indicates the cluster is healthy, while 1 indicates a warning and 2 indicates an error state. The Nagios check for ceph_health is updated to reflect these values Change-Id: Iffe80f1c34f6edee6370dd7e707e5f55f83f1ec1	2018-11-06 14:51:40 -06:00
Steve Wilkerson	69196031cd	Nagios: Ensure processes are reaped This moves Nagios to run as child processes of either the pause container or use the hosts init system (for k8s <1.10) to prevent defunct process sprawl Change-Id: I6a93d446577674b0b012f9567d5e6a5794ebc44b	2018-11-02 08:12:24 -05:00
Steve Wilkerson	8d6cfd72d0	Nagios: Remove Nagios log monitors This removes the checks for Nagios to query Elasticsearch for logged events. The current plugin in the image is resulting in unstable behavior, and should be removed until this plugins been improved Change-Id: If1bdd954956f063ac1eebbb94d1128df8b8d2695	2018-10-25 05:21:22 +00:00
Huang, Scott (sh2725)	b99d39dd95	[467551] Mount Nagios Logfile Mount Nagios logfile to host to enable log streaming to elasticsearch Change-Id: I297f61067c0ff3e870e14b124a5c6fdd49e12b01	2018-10-21 15:37:40 +00:00
Zuul	1f9c8d7f42	Merge "Nagios: Update image with Elasticsearch plugin headers"	2018-10-15 17:58:17 +00:00
Steve Wilkerson	19248c11e9	Nagios: Update image with Elasticsearch plugin headers This updates the Nagios image to include an update to the Elasticsearch plugin that adds the appropriate headers to the request sent to Elasticsearch. As Elasticsearch >=6.0 no longer tries to determine the request data type, we need to explicitly tell Elasticsearch the request body is JSON. Since we use Elasticsearch 5.6.4 as default, this change will make the deprecation warnings for the 6.0 breaking change go away. Change-Id: I0dbd8859ca8d0bd0893832b4edd92742e575598b	2018-10-15 14:20:22 +00:00
Tin Lam	92e68d33ea	Add network policy toolkit function This patch set implements the helm toolkit function to generate a kubernetes network policy manifest based on overrideable values. This also adds a chart that shuts down all the ingress and egress traffics in the namespace. This can be used to ensure the whitelisted network policy works as intended. Additionally, implementation is done for some infrastructure charts. Change-Id: I78e87ef3276e948ae4dd2eb462b4b8012251c8c8 Co-Authored-By: Mike Pham <tp6510@att.com> Signed-off-by: Tin Lam <tin@irrational.io>	2018-10-15 13:50:50 +00:00
rakesh-patnaik	db0d653b4d	Monitor postgresql, Openstack virt resources, api, logs, pod and nodes status Fixing opebstack API monitors Adding additional neutron services monitors Adding new Pod CrashLoopBaackOff status check Adding new Host readiness check Updated the nagios image reference(https://review.gerrithub.io/c/att-comdev/nagios/+/420590 - Pending) This updated image provides a mechanism for querying Elasticsearch with the goal of triggering alerts based on specified applications and log levels. Finally, this moves the endpoints resulting from the authenticated endpoint lookups required for Nagios to the nagios secret instead of handled via plain text environment variables Change-Id: I517d8e6e6e8fa1d359382be8a131a8e45bf243e2	2018-09-21 08:22:13 +00:00
Pete Birley	bb3ff98d53	Add release uuid to pods and rc objects This PS adds the ability to attach a release uuid to pods and rc objects as desired. A follow up ps will add the ability to add arbitary annotations to the same objects. Change-Id: Iceedba457a03387f6fc44eb763a00fd57f9d84a5 Signed-off-by: Pete Birley <pete@port.direct>	2018-09-13 05:35:35 +00:00
Scott Huang	bc54e72fd3	Monitor Cinder API and Scheduler Change-Id: I159facb491d9a722d8c067ead25c470f00b83939	2018-09-07 15:12:32 +00:00
Zuul	86d67cede3	Merge "Nagios: Use public endpoint for ldap"	2018-08-31 22:00:37 +00:00
Pete Birley	1d8caeede6	Nagios: Use public endpoint for ldap This PS updates the nagios chart to use the public endpoint for ldap in the apache config. Change-Id: Ia7c1881a15fda3100fb006e9cf1d06d22dcd6a8d Signed-off-by: Pete Birley <pete@port.direct>	2018-08-30 13:47:37 -05:00
Steve Wilkerson	9a311475ba	Charts: Use secrets for configs in chart This updates the osh-infra charts to use a secret for their configuration files instead of a configmap, allowing for the storage of sensitive information Change-Id: Ia32587162288df0b297c45fd43b55cef381cb064	2018-08-24 15:56:53 -05:00
Steve Wilkerson	8652e14acb	Add auth for prometheus This adds authentication to Prometheus with an apache reverse proxy, similar to elasticsearch, kibana and nagios. This adds an admin user and password via htpasswd along with adding ldap support. This required modifying the grafana chart to configure the prometheus datasource's basic auth credentials in the data sources provisioning configuration file by checking whether basic auth is enabled and injecting the username/password defined in the corresponding endpoint definition. This also modifies the nagios chart to use the authenticated endpoint for prometheus, which is required for nagios to successfully query the prometheus endpoint for its service checking mechanism Change-Id: Ia4ccc3c44a89b2c56594be1f4cc28ac07169bf8c	2018-08-08 18:49:45 +00:00
Seungkyu Ahn	a430533e6a	Quoting node_select_value in Ingress Controller In most cases, the ingress controller's nodeSelector key and value are "node-role.kubernetes.io/ingress" and "true". Using quote to treat the nodeSelector value as a string. Change-Id: Ie1745629b90795e4d888d85f35565e6d6350e09b	2018-08-01 02:39:05 +00:00
Steve Wilkerson	6f6c6b8b99	Nagios/Kibana: Update configmap annotations This changes the ordering of the configmap annotations for kibana, as older versions of helm require the configmap with the values template definition for the apache proxy to be listed last. This was addressed in the elasticsearch-client template but missed in kibana. This also adds the configmap hash annotations to the nagios chart as they were previously missing. It also places them in the correct order as above Change-Id: I13befe8684d975f310f2723c5172b8a0f9f365d6	2018-07-30 12:33:17 -05:00
Steve Wilkerson	4f78e1f6fc	Drive apache proxy configuration via values templates This proposes defining the apache proxy hosts entirely via values templates. While complicated on its face, this gives flexibility by allowing the ability to define the desired authentication mechanism via values templates. These options can range from using http basic auth for development purposes to defining more complex ldap configurations without a need to modify the chart directly Change-Id: Ief1b6890444ff90cc9c0ca872087af74836c0771 Signed-off-by: Pete Birley <pete@port.direct>	2018-07-30 07:52:26 -05:00
Steve Wilkerson	7ea9a075ba	Nagios: Update image reference to include discovery fix This updates the Nagios image tag to include a version that fixes the service discovery bug that resulted in duplicate host group entries. The duplicate host group entries would prevent Nagios from restarting, resulting in the service never coming back up when duplicate host groups were identified and added Change-Id: I555c525e47deffd95eeb5a7276c00cf044e61e3a	2018-07-10 14:40:55 -05:00
Steve Wilkerson	c26a1b53f6	Update TLS secret templates, remove nagios readiness probe This updates the TLS secret templates to include the backend service in the dict supplied to the manifest template, as it is required for the TLS secret to render correctly. This also removes the readiness probe from the nagios container in the deployment for the nagios chart, as it wasn't functioning as intended due to the port not being available for the probe Change-Id: Iabcfd40c74938e0497d08ffeeebc98ab722fa660	2018-06-27 18:56:45 -05:00
Steve Wilkerson	b823954787	Ingress: Add initial TLS Support for osh-infra public endpoints Adds support for TLS on overriden fqdns for public endpoints for the services that have them in openstack-helm-infra. Currently this implementation is limited, in that it does not provide support for dynamically loading CAs into the containers, or specifying them manually via configuration. As a result only well known or CA's added manually to containers will be recognised. Change-Id: I4ab4bbe24b6544b64cd365467e8efb2a421ac3f4	2018-06-26 14:47:19 -05:00
Steve Wilkerson	cb7bf2c0b3	Add missing readiness probes to openstack-helm-infra charts This adds missing readiness probes to the following charts in openstack-helm-infra: elasticsearch, fluent-logging, kibana, nagios, prometheus-kube-state-metrics, prometheus-node-exporter, and prometheus-openstack-exporter Change-Id: I6a2635b08667c31eadb1b05ba848c658935a17e5	2018-06-26 12:25:36 +00:00
Steve Wilkerson	2dd5bf0594	Update ordering of auth providers in apache reverse proxy This updates the ordering of the basic auth providers in the elasticsearch and nagios chart to check the file provider first before going out to check the configured ldap server. Change-Id: I47ff8a1c7b2cefa8425914c5d4d7a76aa8d43216 Signed-off-by: Steve Wilkerson <wilkers.steve@gmail.com>	2018-06-25 12:43:06 -05:00
Zuul	1051065c2c	Merge "Daemonsets: Use current kubernetes daemonset api version"	2018-06-14 16:24:33 +00:00
Zuul	0c9eae2d84	Merge "Nagios: update functions to live in correct locations"	2018-06-14 00:55:48 +00:00
Pete Birley	fa629cdbbd	Daemonsets: Use current kubernetes daemonset api version This PS moves to use the current ga version for kubernetes daemonsets, additionally any remaining deployments that were using the `extensions/v1beta1` have been updated to `apps/v1`. Story: 2002205 Task: 21735 Change-Id: If9703162dc472af1e6096bf2b9062802fd5ce8ab Signed-off-by: Pete Birley <pete@port.direct>	2018-06-13 21:53:18 +00:00
Steve Wilkerson	561780f347	PVC monitoring: Add alerting rules and service check for PVCs This adds a basic check for capacity utilization for persistent volume claims. To accomplish this, it adds a basic alerting rule to prometheus that triggers after a persistent volume's usage exceeds 80%, and triggers 5 minutes after that state has been reached. In addition, there is a service check added to the nagios chart that will query Prometheus to check if the alarm for that threshhold is firing for any of the volume claims. Change-Id: I862c860ac479a715733202f679bb151885d7aa7c	2018-06-12 14:28:24 +00:00
Pete Birley	c48e47b47a	Nagios: update functions to live in correct locations This PS simply moves functions within the chart to their correct location. Change-Id: Ia3d693713903d226a864dcdcf9884dee67f07d2b Signed-off-by: Pete Birley <pete@port.direct>	2018-06-11 22:14:44 -05:00
Steve Wilkerson	c7d0317768	Add nagios cgi.cfg file control to values.yaml This adds the ability to drive the CGI configuration for nagios via values, similar to the other nagios configuration entities Change-Id: I8e9de21d141e0a87cdda11c4a778abec210277f3	2018-05-24 11:24:37 -07:00
Rakesh Patnaik	52c980b10c	Prometheus alerts, nagios defn - rabbitmq,mariadb,ES Change-Id: I71bc9f42aebc268ad2383a5a36a3405fc47c6c9e	2018-05-20 15:16:57 +00:00
Rakesh Patnaik	69cd66b7c9	Nagios notificiation on alerts and ceph monitoring Change-Id: I782f54b5ad8159e7a4375d336a42524f380e65d2	2018-05-20 15:16:42 +00:00
Steve Wilkerson	db89ab8204	Add ldap support to nagios This adds an apache reverse proxy to the nagios chart, similar to elasticsearch and kibana. It also adds authentication to nagios via ldap Change-Id: I7b17703b5d4c1e041691ffceb984a9f5951cbeb9	2018-05-15 09:21:18 -05:00
Zuul	cc06b57b42	Merge "Nagios chart modifications to use prometheus alert metric for monitoring"	2018-04-22 23:25:24 +00:00

1 2

59 Commits