app-rook-ceph/setup.py at master - app-rook-ceph - OpenDev: Free Software Needs Free Tools

starlingx/app-rook-ceph

Files

Erickson Silva de Oliveira a8d37051f8 Fix OSD removal

When the rook-ceph backend is configured, the system host-stor-delete
command does not completely remove the OSD from the cluster.
While the OSD is deleted from the database, it remains present
in the cluster. This means that the OSD is not fully decommissioned,
and it still exists as part of the cluster, which could lead to
inconsistencies or resource issues.

Together with that, to a OSD be fully decommissioned a wipe on disk is
needed. But the rpcapi could not execute the wipe if the host is not
online or any other network issue between the hosts.

Also, if the host has been reinstalled after the disk was added
to the storage backend, the disk will be wiped on kickstart what
results on the rook-ceph-operator trying to create a new OSD in
the same mount.

Furthermore, the way in which the disks to be removed were obtained
was changed. From now on, when running the "host-stor-delete" command,
the stor state is set to the "deleting-with-app" state, and then in the
lifecycle, all OSDs that have this state are obtained and removed.

We are using a architeture based on kubernetes jobs (trigged on
lifecycle post-apply) that executes a script to fully remove the
OSDs from the cluster and also trigger the job to wipe disks
related to OSDs.

The lifecycle uses the new yaml templates folder to create k8s resources
needed to perform this operation.

All the places that was using rpcapi to prepare the disks were replaced
to use the ceph wipe disks job (the same used by remove OSDs job):

- pre-apply: to wipe all disks that will be configured on the app apply
- post-apply: triggered by remove-osds-job to fully delete the OSD
- pre-remove: when the ceph cluster are being cleaned up

The sync-osds-job must be changed to only check if the deployment OSD
count is less or equal to database OSD count to avoid the jobs stuck
during a removal OSDs operation.

Test Plan:
 - PASS: Remove a OSD from the cluster using host-stor-delete
 - PASS: Change the min replication factor and try to remove the OSD
 - PASS: Change the deployment_model and redo the tests
 - PASS: Add data to a OSD and check if the data was redistributed
 - PASS: Check if the OSD was removed from database
 - PASS: Check if expand an existing cluster was impacted after
         reduction change
 - PASS: Install Rook-Ceph and abort the apply in the middle,
         then try to apply again
 - PASS: Add data to the already configured OSDs and try to reapply,
         verify that no data was lost
 - PASS: Add an OSD with host-stor-add and then immediately reinstall
         the host that houses the OSD.
 - PASS: Add an OSD with host-stor-add, apply the rook ceph and then
         immediately reinstall the host that houses the OSD

Depends-On: https://review.opendev.org/c/starlingx/config/+/937730

Closes-Bug: 2093897

Change-Id: I969f891235b2b7fa6ba0a927a4a8e3419299ecb2
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
Signed-off-by: Gabriel Przybysz Gonçalves Júnior <gabriel.przybyszgoncalvesjunior@windriver.com>
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>

2025-02-04 08:38:52 -03:00

17 lines

298 B

Python

Raw Permalink Blame History

 #
 # Copyright (c) 2024-2025 Wind River Systems, Inc.
 #
 # SPDX-License-Identifier: Apache-2.0
 #
 import setuptools
 setuptools.setup(
     setup_requires=['pbr>=0.5'],
     pbr=True,
     include_package_data=True,
     package_data={
         'k8sapp_rook_ceph': ['lifecycle/templates/*.yaml']
     },)