This is to update the logic to check for incomplete pgs in ceph
cluster and proceed if there are no incomplete/inactive pgs and
will not wait for healthy ceph cluster.
Change-Id: I026d6cc378053e805680c31d75fdfb40bbb636f5
This patch fixes an issue with Postgres HA where
the PVC which stores the database was filling up with
WAL records and not deleting them due to some
misconfigurations with Postgres. Once the PVC
would fill up, replication would fail across the node
and the database would not be able to start, crashing
the system.
Specifically, archive_mode was turned on, but was not
supplied with a function through which to archive the
logs. When WAL archiving is turned on, old WAL files
cannot be removed until the system has archived them first.
However, since we never told the system how to archive the
files, it would repeatedly fail so the WAL files would
never be cleaned up.
Also in this patch are some small house keeping items:
- Lowered the wal_keep_segments drastically so Postgres
can't keep as many WAL segments around to minimize the
chance of PVC fill issues
- Turned the wal_level from 'logical' to 'hot_standby'
to keep it consistent with the fact that Patroni uses
streaming replication and not logical replication
- Removed the autovaccuum configurations as they are not
needed
Change-Id: Id48c3ee9976823b2bdb4395a029fe75476bdaa62
This adds a basic helm-toolkit snippet template for adding
kubernetes liveness and readiness probes to a container. This adds
flexibility by defining the probes contents via values overrides
wholesale
Change-Id: I0862ae59c87b8c0c4e2412030b1801bceb3e3c99
Signed-off-by: Pete Birley <pete@port.direct>
This updates the Nagios chart to include an init container for
generating the host and host group definitions Nagios requires to
function. The benefit is that Nagios does not need to constantly
attempt to update its host and host group definitions, which
currently triggers a restart of the Nagios service even in cases
where the host file hasn't changed. With the introduction of an
init container for handling this, we can also remove the service
check definition and command definition for executing the plugin
at periodic intervals
Depends-On: https://review.opendev.org/668197
Change-Id: Id1d63d8c99850b960eb352361d7796162bd6be2f
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This updates the Nagios image used to the image that is built
out of openstack-helm-images instead of the image hosted in quay.
This new image includes the updated host definition plugin that
uses the kubernetes python client instead of prometheus queries,
so the check_prometheus_hosts command has also been updated to
reflect the change in required arguments
Change-Id: If3440ca9be3227fc48cd698a7d44501e6747bb1e
Signed-off-by: Steve Wilkerson <sw5822@att.com>
The ldap overrides values file had been moved to
keystone/values_overrides[1]. This patch is to update the reference.
[1]
cede6c0d48 (diff-89208df3c46570cf56141a9353ce27a7)
Change-Id: Ib03bb979dc681a647abd36df77f55fd82e0d4df6
This is to fix static osd id logic to variable as we have an issue
in our current logic.
this is happening only when we have file backed journals and
block backed data as shown below.
ex:
storage:
osd:
- data:
type: block-logical
location: /dev/vdb
journal:
type: directory
location: /var/lib/openstack-helm/ceph/osd/journal-one
- data:
type: block-logical
location: /dev/vdc
journal:
type: directory
location: /var/lib/openstack-helm/ceph/osd/journal-two
Change-Id: I36d08b1b7aa5925831a64c03259098f6c4753c3e
This is to adjust helm test logic to proceed the deployment if 80% of
osds are up and running in the cluster .
Change-Id: I128266fd374426f75928332690e275b7f0175318
It can be that zuul_site_mirror_fqdn env variable will not be set,
in this case the whole job will fail, instead of simply not configuring
mirrors during image build. With this patch, if set_fact fails, mirrors
simply will not be configured during image build, as planned in lines 62
and 88 in this playbook
Change-Id: I049c696c7fb0d7cadb527a9f17dd01a42a671baa
Occasionally the default config can result in attempts
to bind to ipv6 which fail - so we explicity set the
host to ipv4.
Change-Id: I3c01ed0ef7c84cf779d88386c14f7c7bd2003310
Signed-off-by: Pete Birley <pete@port.direct>
This PS updates the start script to use `config set`, rather
than `config-key set` which has been depricated in Mimic.
Change-Id: I97d0c4385b016d73aa362c0fc293d235b532810c
Signed-off-by: Pete Birley <pete@port.direct>
This removes the tests that query the Grafana API for checking
whether the prometheus datasource has been provisioned and for
checking the number of active dashboards against the number of
expected dashboards determined via the chart's values.yaml.
The reason for removing these is that Grafana can be configured
to use data source types beyond just Prometheus and additional
dashboards can be added to Grafana via the Grafana UI. In cases
where dashboards are added via the Grafana UI, they are persisted
in the grafana database which will cause helm test failures during
upgrade scenarios. Now that we have selenium tests executed as
part of the Grafana helm tests that validate Grafana is
functional, these API tests add little value
Change-Id: I9f20ca28e9c840fb3f4fa0707a43c9419fafa2c1
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This disables the analytics settings for Grafana that will check
grafana.com for plugin/dashboard updates every 10 minutes and for
sending anonymous usage statistics
Change-Id: I0f5283a8a54b563199528bb612aa0cdc6cf238e2
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This PS adds several fixes to Selenium tests (for Kibana) and adds
role which allows to collect the results.
Change-Id: If9fb5f50e395379fdd3ccc46e945a93606dcbabe
This updates the Fluentd clusterrole to allow for getting
namespaces, as this is required for the fluentd kubernetes
plugin to function correctly
Change-Id: Id9d735310c53a922a62c6a82121edd332e7df724
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This fixes the whitespace chomps for adding extra volumes and
volume mounts via values.yaml for the Fluentd chart, as currently
too much whitespace is removed and the extra volumes and mounts
are not added correctly
Change-Id: I9cf67c3321339078ac795a7290f441b16cc41d41
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This updates the helm version from 2.13.1 to 2.14.1
Change-Id: I619351d846253bf17caa922ad7f7b0ff19c778a2
Signed-off-by: Steve Wilkerson <sw5822@att.com>
This PS stores the applied helm values for releases in the gate.
Change-Id: I6563104ded6631b63d9fced775b9b9dba7fd00ef
Signed-off-by: Pete Birley <pete@port.direct>
This adds a conditional check on the deployment type of the
Fluentd chart to determine whether to enable the current liveness
and readiness probes or not. The current probes are designed
around using fluentd as an aggregator and do not function properly
when fluentd is deployed as a daemonset. When run as a daemonset
and configured to tail files via the tail input plugin, fluentd
will prioritize reading the entirety of those files before
processing other input types, including opening the forward source
socket required for the current probes to function correctly. This
results in scenarios where the current probes will fail when in
fact fluentd is functioning correctly.
Daemonset focused probes to come as a follow on once a proper path
forward has been determined
Change-Id: I8a164bd47ce1950e0bd6c5043713f4cde9f85d79
Signed-off-by: Steve Wilkerson <sw5822@att.com>
root_conf area is used for host-specific configuration and overwritten in
each round of loop. It causes that all hosts will share same properties.
This makes use each host's own area in the loop.
Task: 34282
Story: 2005936
Change-Id: I0afb0b32ab80456aa3439b4221f2a95ca05ddf24
This PS updates the ingress controller configmap to be valid with
k8s schema validation turned on.
Change-Id: Ibbc82be62398ee63eb353aa58f1ebdf98e66b30d
Signed-off-by: Pete Birley <pete@port.direct>
This PS indroduces a simpler way to incorp over-rides into gate
runs, and also ensures that they are scoped to a single chart, rather
than all of the charts deployed within a gate run.
Change-Id: Iba80f645f33c6d5847fbbb28ce66ee3d23e4fce8
Signed-off-by: Pete Birley <pete@port.direct>
The PS allows to use tmpfs for etcd during the gates.
There is an assumption that it will improve the performance and
will allow to get rid of weird issues.
Change-Id: Id68645b6535c9b1d87c133431b7cd6eb50fb030e
This removes the old fluent-logging chart from network
policy and replaces it with the new fluentbit and fluentd
charts. This will return the network policy gate back to
passing
Change-Id: I060c6c3034fa798a131a053b9d496e5d8781c55d