With OSH now publishing charts regularly with each change, there
needs to be a way to track these changes in order to track the
changes between chart versions.
This proposed change adds in a reno check job to publish notes
based from the changes to each chart by version as a way to
track and document all the changes that get made to OSH-infra
and published to tarballs.o.o.
Change-Id: I5e6eccc4b34a891078ba816249795b2bf1921a62
This brings the Grafana version up to the current version
and fixes the selenium helm and gate test for the new login
dashboard.
Change-Id: I0b65412f4689c763b3f035055ecbb4ca63c21048
The volume naming convention prefixes logical volume names with
ceph-lv-, ceph-db-, or ceph-wal-. The code that was added recently
to remove orphaned DB and WAL volumes does a string replacement of
"db" or "wal" with "lv" when searching for corresponding data
volumes. This causes DB volumes to get identified incorrectly as
orphans and removed when "db" appears in the PV UUID portion of
the volume name.
Change-Id: I0c9477483b70c9ec844b37a6de10a50c0f2e1df8
This commit introduces the following helm test improvement for the
ceph-client chart:
1) Reworks the pg_validation function so that it allows some time for
peering PGs to finish peering, but fail if any other critical errors are
seen. The actual pg validation was split out into a function called
check_pgs(), and the pg_validation function manages the looping aspects.
2) The check_cluster_status function now calls pv_validation if the
cluster status is not OK. This is very similar to what was happening
before, except now, the logic will not be repeated.
Change-Id: I65906380817441bd2ff9ff9cfbf9586b6fdd2ba7
ClusterIssuer does not belong to a single namespace (unlike Issuer)
and can be referenced by Certificate resources from multiple different
namespaces. When internal TLS is added to multiple namespaces, same
ClusterIssuer can be used instead of one Issuer per namespace.
Change-Id: I1576f486f30d693c4bc6b15e25c238d8004b4568
This change adds cleanup mechanism to archive by following steps:
1) add archive_cleanup.sh under /tmp directory
2) through the start.sh this script will be triggered
3) It runs every hour, checking utilization of archive dir
4) If it is above threshold it deletes half of old files
Change-Id: I918284b0aa5a698a6028b9807fcbf6559ef0ff45
Found another issue in disk_zap() where a needed update was missed when
https://review.opendev.org/c/openstack/openstack-helm-infra/+/745166
changed the logical volume naming convention.
The above patch set renamed volumes that followed the old convention,
so this logic will never be correct and must be updated.
Also added logic to clean up orphaned DB/WAL volumes if they are
encountered and removed some cases where a data disk is marked as in use
when it isn't set up correctly.
Change-Id: I8deeecfdb69df1f855f287caab8385ee3d6869e0
For any host mounts that include /var/lib/kubelet, use HostToContainer
mountPropagation, which avoids creating extra references to mounts in
other containers.
Affects the following resources:
* ingress deployment
* openvswitch-vswitchd daemonset
Change-Id: I5964c595210af60d54158e6f7c962d5abe77fc2f
This PS is to address security best practices concerning running
containers as a non-privileged user and disallowing privilege
escalation. Ceph-client is used for the mgr and mds pods.
Change-Id: Idbd87408c17907eaae9c6398fbc942f203b51515
ADD: new snapshot policy template job which creates templates for
ES SLM manager to snapshot indicies instead of curator.
Change-Id: I629d30691d6d3f77646bde7d4838056b117ce091
OSD logical volume names used to be based on the logical disk path,
i.e. /dev/sdb, but that has changed. The lvremove logic in disk_zap()
is still using the old naming convention. This change fixes that.
Change-Id: If32ab354670166a3c844991de1744de63a508303
There are many race conditions possible when multiple ceph-osd
pods are initialized on the same host at the same time using
shared metadata disks. The locked() function was introduced a
while back to address these, but some commands weren't locked,
locked() was being called all over the place, and there was a file
descriptor leak in locked(). This change cleans that up by
by maintaining a single, global file descriptor for the lock file
that is only opened and closed once, and also by aliasing all of
the commands that need to use locked() and removing explicit calls
to locked() everywhere.
The global_locked() function has also been removed as it isn't
needed when individual commands that interact with disks use
locked() properly.
Change-Id: I0018cf0b3a25bced44c57c40e33043579c42de7a
Add openvswitch gate issue with systemd 237-3ubuntu10.43 to
multinode also. Added code from [0].
Additionally, made changes to support 1.18.9 version of kubeadm.
[0] https://review.opendev.org/c/openstack/openstack-helm-infra/+/763619
Change-Id: I2681feb1029e5535f3f278513e8aece821c715f1
This is to address zombie processes found in ceph-mon containers due
to the mon-check.sh monitoring script. With shareProcessNamespace the
/pause container will properly handle the defunct processes.
Change-Id: Ic111fd28b517f4c9b59ab23626753e9c73db1b1b
The build-chart playbook task to point each chart to helm-toolkit
has a find command that when used with another repo, will
include the charts for osh-infra as well.
This change modifies the playbook to only modify requirements
in charts in the repo being published.
Change-Id: I493b4c64fe2525bac0acae06bd40c3896c918e20
With current version of rabbitmq-exporter,
unable to retrieve data sometimes,
failing with rabbitmq timeout issues.
Rabbitmq timeout threshold is set as 10 sec
and is not configurable with current version.
Updating the rabbitmq-exporter version to
kbudde/rabbitmq-exporter:v1.0.0-RC7.1
(Default "RABBITMQ_TIMEOUT" set as 30 sec)
to solve rabbitmq timeout issues.
Change-Id: Ia51f368a1bba2b0fd9195cf9991b55864cdebfc1
This reverts commit 42f3b3eaf5a8794b1f247915fffbef68137e6c1c.
Reason for revert: dockerhub now sets a hard limit on daily pulls, lets switch back to using the opendev docker proxy.
Change-Id: I87e399c89d5736f39d7bdba2011655e5f5766180
This PS adds RABBIT_TIMEOUT parameter as configurable
with kbudde/rabbitmq-exporter:v1.0.0-RC7.1 version
Change-Id: I8faf8cd706863f65afb5137d93a7627d421270e9
The default, directory-based OSD configuration doesn't appear to work
correctly and isn't really being used by anyone. It has been commented
out and the comments have been enhanced to document the OSD config
better. With this change there is no default configuration anymore, so
the user must configure OSDs properly in their environment in
values.yaml in order to deploy OSDs using this chart.
Change-Id: I8caecf847ffc1fefe9cb1817d1d2b6d58b297f72
This change updates the fluentd chart to use HTK probe templates
to allow configuration by value overrides
Change-Id: I97a3cc0832554a31146cd2b6d86deb77fd73db41
OSD failures during an update can cause degraded and misplaced
objects. The post-apply job restarts OSDs in failure domain
batches in order to accomplish the restarts efficiently. There is
already a wait for degraded objects to ensure that OSDs are not
restarted on degraded PGs, but misplaced objects could mean that
multiple object replicas exist in the same failure domain, so the
job should wait for those to recover as well before restarting
OSDs in order to avoid potential disruption under these failure
conditions.
Change-Id: I39606e388a9a1d3a4e9c547de56aac4fc5606ea2
According to get-values-overrides.sh script it is expected to
have values_overrides directory, not value_overrides.
Change-Id: I53744117af6962d51519bc1d96329129473d9970
A recent change to wait_for_pods() to allow for fault tolerance
appears to be causing wait_for_pgs() to fail and exit the post-
apply script prematurely in some cases. The existing
wait_for_degraded_objects() logic won't pass until pods and PGs
have recovered while the noout flag is set, so the pod and PG
waits can simply be removed.
Change-Id: I5fd7f422d710c18dee237c0ae97ae1a770606605