
When the rook-ceph backend is configured, the system host-stor-delete command does not completely remove the OSD from the cluster. While the OSD is deleted from the database, it remains present in the cluster. This means that the OSD is not fully decommissioned, and it still exists as part of the cluster, which could lead to inconsistencies or resource issues. Together with that, to a OSD be fully decommissioned a wipe on disk is needed. But the rpcapi could not execute the wipe if the host is not online or any other network issue between the hosts. Also, if the host has been reinstalled after the disk was added to the storage backend, the disk will be wiped on kickstart what results on the rook-ceph-operator trying to create a new OSD in the same mount. Furthermore, the way in which the disks to be removed were obtained was changed. From now on, when running the "host-stor-delete" command, the stor state is set to the "deleting-with-app" state, and then in the lifecycle, all OSDs that have this state are obtained and removed. We are using a architeture based on kubernetes jobs (trigged on lifecycle post-apply) that executes a script to fully remove the OSDs from the cluster and also trigger the job to wipe disks related to OSDs. The lifecycle uses the new yaml templates folder to create k8s resources needed to perform this operation. All the places that was using rpcapi to prepare the disks were replaced to use the ceph wipe disks job (the same used by remove OSDs job): - pre-apply: to wipe all disks that will be configured on the app apply - post-apply: triggered by remove-osds-job to fully delete the OSD - pre-remove: when the ceph cluster are being cleaned up The sync-osds-job must be changed to only check if the deployment OSD count is less or equal to database OSD count to avoid the jobs stuck during a removal OSDs operation. Test Plan: - PASS: Remove a OSD from the cluster using host-stor-delete - PASS: Change the min replication factor and try to remove the OSD - PASS: Change the deployment_model and redo the tests - PASS: Add data to a OSD and check if the data was redistributed - PASS: Check if the OSD was removed from database - PASS: Check if expand an existing cluster was impacted after reduction change - PASS: Install Rook-Ceph and abort the apply in the middle, then try to apply again - PASS: Add data to the already configured OSDs and try to reapply, verify that no data was lost - PASS: Add an OSD with host-stor-add and then immediately reinstall the host that houses the OSD. - PASS: Add an OSD with host-stor-add, apply the rook ceph and then immediately reinstall the host that houses the OSD Depends-On: https://review.opendev.org/c/starlingx/config/+/937730 Closes-Bug: 2093897 Change-Id: I969f891235b2b7fa6ba0a927a4a8e3419299ecb2 Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com> Signed-off-by: Gabriel Przybysz Gonçalves Júnior <gabriel.przybyszgoncalvesjunior@windriver.com> Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
17 lines
298 B
Python
17 lines
298 B
Python
#
|
|
# Copyright (c) 2024-2025 Wind River Systems, Inc.
|
|
#
|
|
# SPDX-License-Identifier: Apache-2.0
|
|
#
|
|
|
|
import setuptools
|
|
|
|
|
|
setuptools.setup(
|
|
setup_requires=['pbr>=0.5'],
|
|
pbr=True,
|
|
include_package_data=True,
|
|
package_data={
|
|
'k8sapp_rook_ceph': ['lifecycle/templates/*.yaml']
|
|
},)
|