3051 Commits

Author SHA1 Message Date
Zuul
b21126fed1 Merge "Add elasticsearch ILM functionality" 2021-01-22 23:43:08 +00:00
Graham Steffaniak
c1241918c2 Add elasticsearch ILM functionality
Add functionality to delete indexes older than 14 days. ILM api
will handle deleting indexes.

Change-Id: I22c02af78b6ce979d0c70b420c106917b0fc5a4e
2021-01-21 09:02:57 -06:00
Gage Hugo
2a1677a36a Add reno job to openstack-helm-infra repo
With OSH now publishing charts regularly with each change, there
needs to be a way to track these changes in order to track the
changes between chart versions.

This proposed change adds in a reno check job to publish notes
based from the changes to each chart by version as a way to
track and document all the changes that get made to OSH-infra
and published to tarballs.o.o.

Change-Id: I5e6eccc4b34a891078ba816249795b2bf1921a62
2021-01-21 14:36:59 +00:00
Zuul
1336da0c6f Merge "Update Grafana version" 2021-01-20 22:23:58 +00:00
Zuul
be1c673fba Merge "[ceph-osd] Fix a bug with DB orphan volume removal" 2021-01-19 22:35:58 +00:00
Meghan
0e66ef972a Update Grafana version
This brings the Grafana version up to the current version
and fixes the selenium helm and gate test for the new login
dashboard.

Change-Id: I0b65412f4689c763b3f035055ecbb4ca63c21048
2021-01-19 12:36:59 -08:00
Zuul
9f0b100f5e Merge "Improvements for ceph-client helm tests" 2021-01-19 18:29:49 +00:00
Stephen Taylor
b2c0028349 [ceph-osd] Fix a bug with DB orphan volume removal
The volume naming convention prefixes logical volume names with
ceph-lv-, ceph-db-, or ceph-wal-. The code that was added recently
to remove orphaned DB and WAL volumes does a string replacement of
"db" or "wal" with "lv" when searching for corresponding data
volumes. This causes DB volumes to get identified incorrectly as
orphans and removed when "db" appears in the PV UUID portion of
the volume name.

Change-Id: I0c9477483b70c9ec844b37a6de10a50c0f2e1df8
2021-01-19 10:10:38 -07:00
Parsons, Cliff (cp769u)
970c23acf4 Improvements for ceph-client helm tests
This commit introduces the following helm test improvement for the
ceph-client chart:

1) Reworks the pg_validation function so that it allows some time for
peering PGs to finish peering, but fail if any other critical errors are
seen. The actual pg validation was split out into a function called
check_pgs(), and the pg_validation function manages the looping aspects.

2) The check_cluster_status function now calls pv_validation if the
cluster status is not OK. This is very similar to what was happening
before, except now, the logic will not be repeated.

Change-Id: I65906380817441bd2ff9ff9cfbf9586b6fdd2ba7
2021-01-18 16:12:33 +00:00
sgupta
f60c94fc16 feat(tls): Change Issuer to ClusterIssuer
ClusterIssuer does not belong to a single namespace (unlike Issuer)
and can be referenced by Certificate resources from multiple different
namespaces. When internal TLS is added to multiple namespaces, same
ClusterIssuer can be used instead of one Issuer per namespace.

Change-Id: I1576f486f30d693c4bc6b15e25c238d8004b4568
2021-01-15 18:46:09 +00:00
Apurva Gokani
25aa369025 postgres archive cleanup script
This change adds  cleanup mechanism to archive by following steps:
1) add archive_cleanup.sh under /tmp directory
2) through the start.sh this script will be triggered
3) It runs every hour, checking utilization of archive dir
4) If it is above threshold it deletes half of old files

Change-Id: I918284b0aa5a698a6028b9807fcbf6559ef0ff45
2021-01-14 16:21:14 +00:00
Zuul
204c51a669 Merge "Run as ceph user and disallow privilege escalation" 2021-01-12 20:09:50 +00:00
Zuul
6af7303516 Merge "Add elasticsearch snapshot policy template for SLM" 2021-01-12 18:08:08 +00:00
Stephen Taylor
4c097b0300 [ceph-osd] dmsetup remove logical devices using correct device names
Found another issue in disk_zap() where a needed update was missed when
https://review.opendev.org/c/openstack/openstack-helm-infra/+/745166
changed the logical volume naming convention.

The above patch set renamed volumes that followed the old convention,
so this logic will never be correct and must be updated.

Also added logic to clean up orphaned DB/WAL volumes if they are
encountered and removed some cases where a data disk is marked as in use
when it isn't set up correctly.

Change-Id: I8deeecfdb69df1f855f287caab8385ee3d6869e0
2021-01-11 14:49:43 -07:00
Phil Sphicas
f08d30df6b Use HostToContainer mountPropagation
For any host mounts that include /var/lib/kubelet, use HostToContainer
mountPropagation, which avoids creating extra references to mounts in
other containers.

Affects the following resources:
* ingress deployment
* openvswitch-vswitchd daemonset

Change-Id: I5964c595210af60d54158e6f7c962d5abe77fc2f
2021-01-07 20:29:24 +00:00
Zuul
96e002c64e Merge "Fix spacing inconsistencies with flags" 2021-01-06 20:44:41 +00:00
Smith, David (ds3330)
1934d32cdd Fix spacing inconsistencies with flags
Change-Id: I83676f62a4cfc7d8e20145a72f28eeab5ef4cc8d
2021-01-06 00:16:16 +00:00
jh629g
67618474ce Update default Kubernetes API for use with Helm v3
Updated Kubernetes api from extensions/v1beta1 to
networking.k8s.io/v1beta1 per docs[0] for kubernetes
1.16 deprecations as helm v3 linting will fail
when it parses extensions/v1beta1 seen here[1]

[0] https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/
[1] https://zuul.opendev.org/t/openstack/build/82f92508fb31418aa377f91d62e0d42e

Change-Id: I0439272587a2afbccc4d7c49ef6ad053c8b305e7
2021-01-05 16:43:38 +00:00
Frank Ritchie
abf8d1bc6e Run as ceph user and disallow privilege escalation
This PS is to address security best practices concerning running
containers as a non-privileged user and disallowing privilege
escalation. Ceph-client is used for the mgr and mds pods.

Change-Id: Idbd87408c17907eaae9c6398fbc942f203b51515
2021-01-04 12:58:09 -05:00
Graham Steffaniak
fcb4681cb1 Add elasticsearch snapshot policy template for SLM
ADD: new snapshot policy template job which creates templates for
        ES SLM manager to snapshot indicies instead of curator.

Change-Id: I629d30691d6d3f77646bde7d4838056b117ce091
2020-12-29 15:55:53 +00:00
Zuul
3ded481794 Merge "Fix openvswitch gate issue for multinode" 2020-12-29 02:18:56 +00:00
jh629g
63f0bc364e Update hardcoded Google Resource URLs
Kubernetes charts from google are
deprecated resources. Updated to helm
repositories for kubernetes charts per [0]

[0] https://helm.sh/blog/new-location-stable-incubator-charts/

Change-Id: I31f29d8576b3d7e8a5ac1d14faa26f0fd6ba77a1
2020-12-23 15:16:09 +00:00
Stephen Taylor
213596d71c [ceph-osd] Correct naming convention for logical volumes in disk_zap()
OSD logical volume names used to be based on the logical disk path,
i.e. /dev/sdb, but that has changed. The lvremove logic in disk_zap()
is still using the old naming convention. This change fixes that.

Change-Id: If32ab354670166a3c844991de1744de63a508303
2020-12-17 09:29:51 -07:00
Zuul
81f928544b Merge "[ceph-osd] Alias synchronized commands and fix descriptor leak" 2020-12-16 20:51:51 +00:00
Zuul
794ee8ae6e Merge "Elasticsearch: Update to 7.6.2 image" 2020-12-16 20:01:52 +00:00
Stephen Taylor
885285139e [ceph-osd] Alias synchronized commands and fix descriptor leak
There are many race conditions possible when multiple ceph-osd
pods are initialized on the same host at the same time using
shared metadata disks. The locked() function was introduced a
while back to address these, but some commands weren't locked,
locked() was being called all over the place, and there was a file
descriptor leak in locked(). This change cleans that up by
by maintaining a single, global file descriptor for the lock file
that is only opened and closed once, and also by aliasing all of
the commands that need to use locked() and removing explicit calls
to locked() everywhere.

The global_locked() function has also been removed as it isn't
needed when individual commands that interact with disks use
locked() properly.

Change-Id: I0018cf0b3a25bced44c57c40e33043579c42de7a
2020-12-16 07:22:15 -07:00
Steven Fitzpatrick
6c05fee08d Elasticsearch: Update to 7.6.2 image
Change-Id: Ic0f5b6c802938ca91726210c43f81d2c73969575
2020-12-14 20:29:16 +00:00
Gupta, Sangeet (sg774j)
1f3fe0cb45 Fix openvswitch gate issue for multinode
Add openvswitch gate issue with systemd 237-3ubuntu10.43 to
multinode also. Added code from [0].
Additionally, made changes to support 1.18.9 version of kubeadm.

[0] https://review.opendev.org/c/openstack/openstack-helm-infra/+/763619

Change-Id: I2681feb1029e5535f3f278513e8aece821c715f1
2020-12-11 17:10:55 +00:00
Frank Ritchie
9b1ac0ffcb Enable shareProcessNamespace in mon daemonset
This is to address zombie processes found in ceph-mon containers due
to the mon-check.sh monitoring script. With shareProcessNamespace the
/pause container will properly handle the defunct processes.

Change-Id: Ic111fd28b517f4c9b59ab23626753e9c73db1b1b
2020-12-11 11:57:39 -05:00
Zuul
90a0fd7252 Merge "Collect dpkg -l for host" 2020-12-08 15:46:05 +00:00
Zuul
b952e99828 Merge "Update to container image repo k8s.gcr.io" 2020-12-07 23:37:31 +00:00
Zuul
a26891e5be Merge "[ceph-osd] Remove default OSD configuration" 2020-12-07 23:14:18 +00:00
Zuul
91df918e87 Merge "Update build-chart playbook" 2020-12-07 22:27:13 +00:00
Chris Wedgwood
82a828ce8d Update to container image repo k8s.gcr.io
gcr.io/google_containers/ no longer contains the image versions we
require, use the new location.

Change-Id: Iabb9e672e494f27d1a3691a9ce0dd2ccf10d5797
2020-12-07 19:34:09 +00:00
Zuul
e4683420d7 Merge "Revert "Don't use opendev docker proxy"" 2020-12-07 14:51:47 +00:00
Gage Hugo
4047ff0fd4 Update build-chart playbook
The build-chart playbook task to point each chart to helm-toolkit
has a find command that when used with another repo, will
include the charts for osh-infra as well.

This change modifies the playbook to only modify requirements
in charts in the repo being published.

Change-Id: I493b4c64fe2525bac0acae06bd40c3896c918e20
2020-12-04 17:31:18 -06:00
Gayathri Devi Kathiri
20d2aa1553 Update Rabbitmq exporter version
With current version of rabbitmq-exporter,
unable to retrieve data sometimes,
failing with rabbitmq timeout issues.
Rabbitmq timeout threshold is set as 10 sec
and is not configurable with current version.

Updating the rabbitmq-exporter version to
kbudde/rabbitmq-exporter:v1.0.0-RC7.1
(Default "RABBITMQ_TIMEOUT" set as 30 sec)
to solve rabbitmq timeout issues.

Change-Id: Ia51f368a1bba2b0fd9195cf9991b55864cdebfc1
2020-12-04 11:01:11 +00:00
Andrii Ostapenko
7be813374f
Collect dpkg -l for host
Change-Id: I8886e2bacb74f95ac117aad07c831c5c3803d5c0
Signed-off-by: Andrii Ostapenko <andrii.ostapenko@att.com>
2020-12-03 15:21:29 -06:00
Zuul
9187633822 Merge "Rabbitmq-exporter: Add configurable RABBIT_TIMEOUT parameter" 2020-12-03 20:56:00 +00:00
Gage Hugo
7fdf282271 Revert "Don't use opendev docker proxy"
This reverts commit 42f3b3eaf5a8794b1f247915fffbef68137e6c1c.

Reason for revert: dockerhub now sets a hard limit on daily pulls, lets switch back to using the opendev docker proxy.

Change-Id: I87e399c89d5736f39d7bdba2011655e5f5766180
2020-12-03 19:42:47 +00:00
Zuul
970ec5128a Merge "Make publish jobs more generic" 2020-12-02 19:51:46 +00:00
Gayathri Devi Kathiri
d7107a5c5c Rabbitmq-exporter: Add configurable RABBIT_TIMEOUT parameter
This PS adds RABBIT_TIMEOUT parameter as configurable 
with kbudde/rabbitmq-exporter:v1.0.0-RC7.1 version

Change-Id: I8faf8cd706863f65afb5137d93a7627d421270e9
2020-12-02 16:42:49 +00:00
Zuul
59164428d3 Merge "Fluentd: Add Configurable Readiness and Liveness Probes" 2020-12-01 20:22:57 +00:00
Singh, Jasvinder (js581j)
ae96308ef1 [ceph-osd] Remove default OSD configuration
The default, directory-based OSD configuration doesn't appear to work
correctly and isn't really being used by anyone. It has been commented
out and the comments have been enhanced to document the OSD config
better. With this change there is no default configuration anymore, so
the user must configure OSDs properly in their environment in
values.yaml in order to deploy OSDs using this chart.

Change-Id: I8caecf847ffc1fefe9cb1817d1d2b6d58b297f72
2020-12-01 10:44:21 -07:00
Steven Fitzpatrick
29489acf39 Fluentd: Add Configurable Readiness and Liveness Probes
This change updates the fluentd chart to use HTK probe templates
to allow configuration by value overrides

Change-Id: I97a3cc0832554a31146cd2b6d86deb77fd73db41
2020-11-30 18:39:07 +00:00
Taylor, Stephen (st053q)
e37d1fc2ab [ceph-osd] Add a check for misplaced objects to the post-apply job
OSD failures during an update can cause degraded and misplaced
objects. The post-apply job restarts OSDs in failure domain
batches in order to accomplish the restarts efficiently. There is
already a wait for degraded objects to ensure that OSDs are not
restarted on degraded PGs, but misplaced objects could mean that
multiple object replicas exist in the same failure domain, so the
job should wait for those to recover as well before restarting
OSDs in order to avoid potential disruption under these failure
conditions.

Change-Id: I39606e388a9a1d3a4e9c547de56aac4fc5606ea2
2020-11-30 10:17:40 -07:00
Zuul
3205c8b778 Merge "Fix values_overrides directory naming" 2020-11-27 19:22:34 +00:00
Zuul
5600c76e0b Merge "Changing the kube version to 1.18.9" 2020-11-27 19:20:56 +00:00
MirgDenis
5f6adeca06 Fix values_overrides directory naming
According to get-values-overrides.sh script it is expected to
have values_overrides directory, not value_overrides.

Change-Id: I53744117af6962d51519bc1d96329129473d9970
2020-11-27 10:59:20 +02:00
Taylor, Stephen (st053q)
791b0de5ee [ceph-osd] Fix post-apply job failure related to fault tolerance
A recent change to wait_for_pods() to allow for fault tolerance
appears to be causing wait_for_pgs() to fail and exit the post-
apply script prematurely in some cases. The existing
wait_for_degraded_objects() logic won't pass until pods and PGs
have recovered while the noout flag is set, so the pod and PG
waits can simply be removed.

Change-Id: I5fd7f422d710c18dee237c0ae97ae1a770606605
2020-11-24 06:30:37 -07:00