This change updates all of the charts that use Ceph images to use
new images based on the Squid 19.2.1 release.
Rook is also updated to 1.16.3 and is configured to deploy Ceph
19.2.1.
Change-Id: Ie2c0353a4bfa181873c98ce5de655c3388aa9574
This is the action item to implement the spec:
doc/source/specs/2025.1/chart_versioning.rst
Also add overrides env variables
- OSH_VALUES_OVERRIDES_PATH
- OSH_INFRA_VALUES_OVERRIDES_PATH
This commit temporarily disables all jobs that involve scripts
in the OSH git repo because they need to be updated to work
with the new values_overrides structure in the OSH-infra repo.
Once this is merged I4974785c904cf7c8730279854e3ad9b6b7c35498
all these disabled test jobs must be enabled.
Depends-On: I327103c18fc0e10e989a17f69b3bff9995c45eb4
Change-Id: I7bfdef3ea2128bbb4e26e3a00161fe30ce29b8e7
The Ceph defragosds cronjob script used to
connect to OSD pods not explicitly specifying
the ceph-osd-default container and eventually
tried to run the defrag script in the log-runner
container where the defrag script is mounted with
0644 permissions and shell fails to run it.
Change-Id: I4ffc6653070dbbc6f0766b278acf0ebe2b4ae1e1
Use quay.io/airshipit/kubernetes-entrypoint:latest-ubuntu_focal
by default instead of 1.0.0 which is v1 formatted and
not supported any more by docker.
Change-Id: I6349a57494ed8b1e3c4b618f5bd82705bef42f7a
This change updates the Ceph images to 18.2.2 images patched with a
fix for https://tracker.ceph.com/issues/63684. It also reverts the
package repository in the deployment scripts to use the debian-reef
directory on download.ceph.com instead of debian-18.2.1. The issue
with the repo that prompted the previous change to debian-18.2.1
has been resolved and the more generic debian-reef directory may
now be used again.
Change-Id: I85be0cfa73f752019fc3689887dbfd36cec3f6b2
This change converts the readiness and liveness probes in the Ceph
charts to use the functions from the Helm toolkit rather than
having hard-coded probe definitions. This allows probe configs to
be overridden in values.yaml without rebuilding charts.
Change-Id: I68a01b518f12d33fe4f87f86494a5f4e19be982e
Sometimes errors appear in the 'ceph osd pool get' output before
the JSON string. The returned string is saved and is assumed to
contain only the JSON string with the pool properties. When errors
appear in the string, pool properties are not read properly, which
can cause pools to be misconfigured. This change filters that
output so only the expected JSON string is returned. It can then be
parsed correctly.
Change-Id: I83347cc32da7e7af160b5cacc2a99de74eebebc7
This change allows the target pg_num_min value (global for all
pools) to be overridden on a per-pool basis by specifying a
pg_num_min value in an individual pool's values. A global value for
all pools may not suffice in all cases.
Change-Id: I42c55606d48975b40bbab9501289a7a59c15683f
This is simply to document the fact that mon_allow_pooL_size_one
must be configured via cluster_commands in the ceph-client chart.
Adding it to ceph.conf via the conf values in the ceph-mon chart
doesn't seem to configure the mons effectively.
Change-Id: Ic7e9a0eade9c0b4028ec232ff7ad574b8574615d
This change updates all Ceph image references to use Focal images
for all charts in openstack-helm-infra.
Change-Id: I759d3bdcf1ff332413e14e367d702c3b4ec0de44
The Pacific release of Ceph disabled 1x replication by default, and
some of the gate scripts are not updated to allow this explicitly.
Some gate jobs fail in some configurations as a result, so this
change adds 'mon_allow_pool_size_one = true' to those Ceph gate
scripts that don't already have it, along with
--yes-i-really-mean-it added to commands that set pool size.
Change-Id: I5fb08d3bb714f1b67294bb01e17e8a5c1ddbb73a
This change adjusts the minimum OSD count check to be based on the
osd value, and the maxiumum OSD count check to be based on the
final_osd value. This logic supports both full deployments and
partial deployments, with the caveat that it may allow partial
deployments to over-provision storage.
Change-Id: I93aac65df850e686f92347d406cd5bb5a803659d
The target OSD count and the final target OSD count may differ in
cases where a deployment may not include all of the hardware it is
expected to include eventually. This change corrects the check for
more OSDs running than expected to be based on the final OSD count
rather than the intermediate one to avoid false failures when the
intermediate target is exceeded and the final target is not.
Change-Id: I03a13cfe3b9053b6abc5d961426e7a8e92743808
The Ceph Pacific release has added a noautoscale flag to enable
and disable the PG autoscaler for all pools globally. This change
utilizes this flag for enabling and disabling autoscaler when the
Ceph major version is greater than or equal to 16.
Change-Id: Iaa3f2d238850eb413f26b82d75b5f6835980877f
Based on spec in openstack-helm repo,
support-OCI-image-registry-with-authentication-turned-on.rst
Each Helm chart can configure an OCI image registry and
credentials to use. A Kubernetes secret is then created with these
info. Service Accounts then specify an imagePullSecret specifying
the Secret with creds for the registry. Then any pod using one
of these ServiceAccounts may pull images from an authenticated
container registry.
Change-Id: Iebda4c7a861aa13db921328776b20c14ba346269
The major reason for the addition of this feature is to facilitate
an upgrade to the Pacific Ceph release, which now requires the
require-osd-release flag to be set to the proper release in order
to avoid a cluster warning scenario. Any Ceph command can be run
against the cluster using this feature, however.
Change-Id: I194264c420cfda8453c139ca2b737e56c63ef269
The mon version check in the rbd-pool job can cause the script to
error and abort if there are multiple mon versions present in the
Ceph cluster. This change chooses the lowest-numbered major version
from the available mon versions when performing the version check
since the check is performed in order to determine the right way to
parse JSON output from a mon query.
Change-Id: I51cc6d1de0034affdc0cc616298c2d2cd3476dbb
Currently if multiple instances of the ceph-client chart are
deployed in the same Kubernetes cluster, the releases will
conflict because the clusterrole-checkdns ClusterRole is a global
resources and has a hard-coded name. This change scopes the
ClusterRole name by release name to address this.
Change-Id: I17d04720ca301f643f6fb9cf5a9b2eec965ef537
Ceph cluster needs only one active manager to function properly.
This PS converts ceph-client-tests rules related to ceph-mgr deployment
from error into warning if the number of standby mgrs is less
than expected.
Change-Id: I53c83c872b95da645da69eabf0864daff842bbd1
This is a code improvement to reuse ceph monitor doscovering function
in different templates. Calling the mentioned above function from
a single place (helm-infra snippets) allows less code maintenance
and simlifies further development.
Rev. 0.1 Charts version bump for ceph-client, ceph-mon, ceph-osd,
ceph-provisioners and helm-toolkit
Rev. 0.2 Mon endpoint discovery functionality added for
the rados gateway. ClusterRole and ClusterRoleBinding added.
Rev. 0.3 checkdns is allowed to correct ceph.conf for RGW deployment.
Rev. 0.4 Added RoleBinding to the deployment-rgw.
Rev. 0.5 Remove _namespace-client-ceph-config-manager.sh.tpl and
the appropriate job, because of duplicated functionality.
Related configuration has been removed.
Rev. 0.6 RoleBinding logic has been changed to meet rules:
checkdns namespace - HAS ACCESS -> RGW namespace(s)
Change-Id: Ie0af212bdcbbc3aa53335689deed9b226e5d4d89
This change moves the ceph-mgr deployment from the ceph-client
chart to the ceph-mon chart. Its purpose is to facilitate the
proper Ceph upgrade procedure, which prescribes restarting mgr
daemons before mon daemons.
There will be additional work required to implement the correct
daemon restart procedure for upgrades. This change only addresses
the move of the ceph-mgr deployment.
Change-Id: I3ac4a75f776760425c88a0ba1edae5fb339f128d
This change updates the ceph.conf update job as follows:
* renames it to "ceph-ns-client-ceph-config"
* consolidates some Roles and RoleBindings
This change also moves the logic of figuring out the mon_host addresses
from the kubernetes endpoint object to a snippet, which is used by the
various bash scripts that need it.
In particular, this logic is added to the rbd-pool job, so that it does
not depend on the ceph-ns-client-ceph-config job.
Note that the ceph.conf update job has a race with several other jobs
and pods that mount ceph.conf from the ceph-client-etc configmap while
it is being modified. Depending on the restartPolicy, pods (such as the
one created for the ceph-rbd-pool job) may linger in StartError state.
This is not addressed here.
Change-Id: Id4fdbfa9cdfb448eb7bc6b71ac4c67010f34fc2c
This change fixes two issues with the recently introduced [0] job that
updates "ceph.conf" inside ceph-client-etc configmap with a discovered
mon_host value:
1. adds missing metadata.labels to the job
2. allows the job to be disabled
(fixes rendering when manifests.job_ns_client_ceph_config = false)
0: https://review.opendev.org/c/openstack/openstack-helm-infra/+/812159
Change-Id: I3a8f1878df4af5da52d3b88ca35ba0b97deb4c35
As ceph clients expect the ceph_mon config as shown below for Ceph
Nautilus and later releases, this change updates the ceph-client-etc
configmap to reflect the correct mon endpoint specification.
mon_host = [v1:172.29.1.139:6789/0,v2:172.29.1.139:3300/0],
[v1:172.29.1.140:6789/0,v2:172.29.1.140:3300/0],
[v1:172.29.1.145:6789/0,v2:172.29.1.145:3300/0]
Change-Id: Ic3a1cb7e56317a5a5da46f3bf97ee23ece36c99c
In cases where the pool deletion feature [0] is used, but the pool does
not exists, a pool is created and then subsequently deleted.
This was broken by the performance optimizations introduced with [1], as
the job is trying to delete a pool that does not exist (yet).
This change makes the ceph-rbd-pool job wait for manage_pools to finish
before trying to delete the pool.
0: https://review.opendev.org/c/792851
1: https://review.opendev.org/c/806443
Change-Id: Ibb77e33bed834be25ec7fd215bc448e62075f52a
This change updates the helm-toolkit path in each chart as part
of the move to helm v3. This is due to a lack of helm serve.
Change-Id: I011e282616bf0b5a5c72c1db185c70d8c721695e
This change attempts to reduce the number of Ceph commands required
in the ceph-rbd-pool job by collecting most pool properties in a
single call and by setting only those properties where the current
value differs from the target value.
Calls to manage_pool() are also run in the background in parallel,
so all pools are configured concurrently instead of serially. The
script waits for all of those calls to complete before proceeding
in order to avoid issues related to the script finishing before all
pools are completely configured.
Change-Id: If105cd7146313ab9074eedc09580671a0eafcec5
If labels are not specified on a Job, kubernetes defaults them
to include the labels of their underlying Pod template. Helm 3
injects metadata into all resources [0] including a
`app.kubernetes.io/managed-by: Helm` label. Thus when kubernetes
sees a Job's labels they are no longer empty and thus do not get
defaulted to the underlying Pod template's labels. This is a
problem since Job labels are depended on by
- Armada pre-upgrade delete hooks
- Armada wait logic configurations
- kubernetes-entrypoint dependencies
Thus for each Job template this adds labels matching the
underlying Pod template to retain the same labels that were
present with Helm 2.
[0]: https://github.com/helm/helm/pull/7649
Change-Id: I3b6b25fcc6a1af4d56f3e2b335615074e2f04b6d
Currently if pg_num_min is less than the value specified in values.yaml
or overrides no change to pg_num_min is made during updates when the value
should be increased. This PS will ensure the proper value is always set.
Change-Id: I79004506b66f2084402af59f9f41cda49a929794
The checkDNS script which is run inside the ceph-mon pods has had
a bug for a while now. If a value of "up" is passed in, it adds
brackets around it, but then doesn't check for the brackets when
checking for a value of "up". This causes a value of "{up}" to be
written into the ceph.conf for the mon_host line and that causes
the mon_host to not be able to respond to ceph/rbd commands. Its
normally not a problem if DNS is working, but if DNS stops working
this can happen.
This patch changes the comparison to look for "{up}" instead of
"up" in three different files, which should fix the problem.
Change-Id: I89cf07b28ad8e0e529646977a0a36dd2df48966d
This change configures Ceph daemon pods so that
/var/lib/ceph/crash maps to a hostPath location that persists
when the pod restarts. This will allow for post-mortem examination
of crash dumps to attempt to understand why daemons have crashed.
Change-Id: I53277848f79a405b0809e0e3f19d90bbb80f3df8
This will ease mirroring capabilities for the docker official images.
Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
Change-Id: I0f9177b0b83e4fad599ae0c3f3820202bf1d450d
Two new values, "delete" and "delete_all_pool_data," have been
added to the Ceph pool spec to allow existing pools to be deleted
in a brownfield deployment. For deployments where a pool does not
exist, either for greenfield or because it has been deleted
previously, the pool will be created and then deleted in a single
step.
Change-Id: Ic22acf02ae2e02e03b834e187d8a6a1fa58249e7
A new value "rename" has been added to the Ceph pool spec to allow
pools to be renamed in a brownfield deployment. For greenfield the
pool will be created and renamed in a single deployment step, and
for a brownfield deployment in which the pool has already been
renamed previously no changes will be made to pool names.
Change-Id: I3fba88d2f94e1c7102af91f18343346a72872fde
The current pool init job only allows the finding of PGs in the
"peering" or "activating" (or active) states, but it should also
allow the other possible states that can occur while the PG
autoscaler is running ("unknown" and "creating" and "recover").
The helm test is already allowing these states, so the pool init
job is being changed to also allow them to be consistent.
Change-Id: Ib2c19a459c6a30988e3348f8d073413ed687f98b
This patchset makes the current ceph-client helm test more specific
about checking each of the PGs that are transitioning through inactive
states during the test. If any single PG spends more than 30 seconds in
any of these inactive states (peering, activating, creating, unknown,
etc), then the test will fail.
Also, if after the three minute PG checking period is expired, we will
no longer fail the helm test, as it is very possible that the autoscaler
could be still adjusting the PGs for several minutes after a deployment
is done.
Change-Id: I7f3209b7b3399feb7bec7598e6e88d7680f825c4
This patchset will add the capability to configure the
Ceph RBD pool job to leave failed pods behind for debugging
purposes, if it is desired. Default is to not leave them
behind, which is the current behavior.
Change-Id: Ife63b73f89996d59b75ec617129818068b060d1c
This patch resolves a helm test problem where the test was failing
if it found a PG state of "activating". It could also potentially
find a number of other states, like premerge or unknown, that
could also fail the test. Note that if these transient PG states are
found for more than 3 minutes, the helm test fails.
Change-Id: I071bcfedf7e4079e085c2f72d2fbab3adc0b027c
When autoscaling is disabled after pools are created, there is an
opportunity for some autoscaling to take place before autoscaling
is disabled. This change checks to see if autoscaling needs to be
disabled before creating pools, then checks to see if it needs to
be enabled after creating pools. This ensures that autoscaling
won't happen when autoscaler is disabled and autoscaling won't
start prematurely as pools are being created when it is enabled.
Change-Id: I8803b799b51735ecd3a4878d62be45ec50bbbe19
The autoscaler was introduced in the Nautilus release. This
change only sets the pg_num value for a pool if the autoscaler
is disabled or the Ceph release is earlier than Nautilus.
When pools are created with the autoscaler enabled, a pg_num_min
value specifies the minimum value of pg_num that the autoscaler
will target. That default was recently changed from 8 to 32
which severely limits the number of pools in a small cluster per
https://github.com/rook/rook/issues/5091. This change overrides
the default pg_num_min value of 32 with a value of 8 (matching
the default pg_num value of 8) using the optional --pg-num-min
<value> argument at pool creation and pg_num_min value for
existing pools.
Change-Id: Ie08fb367ec8b1803fcc6e8cd22dc8da43c90e5c4
Currently pool quotas and pg_num calculations are both based on
percent_total_data values. This can be problematic when the amount
of data allowed in a pool doesn't necessarily match the percentage
of the cluster's data expected to be stored in the pool. It is
also more intuitive to define absolute quotas for pools.
This change adds an optional pool_quota value that defines an
explicit value in bytes to be used as a pool quota. If pool_quota
is omitted for a given pool, that pool's quota is set to 0 (no
quota).
A check_pool_quota_target() Helm test has also been added to
verify that the sum of all pool quotas does not exceed the target
quota defined for the cluster if present.
Change-Id: I959fb9e95d8f1e03c36e44aba57c552a315867d0
This reverts commit 910ed906d0df247f826ad527211bc86382e16eaa.
Reason for revert: May be causing upstream multinode gates to fail.
Change-Id: I1ea7349f5821b549d7c9ea88ef0089821eff3ddf