35 Commits

Author SHA1 Message Date
Erickson Silva de Oliveira
3fb190bfe6 Fix check if restore in progress
When performing the BnR procedure with the wipe_ceph_osds
flag and the rook-ceph backend configured, an error was
given when removing the app.

This happened because a restore in progress check in the
DB was done in the app's lifecycle and false was always
returned, as the insert had not yet been performed before
this task.

To fix this, the database query has been replaced by
checking the '/etc/platform/.restore_in_progress' flag.

Test Plan:
- PASS: Build rook-ceph app
- PASS: optimized AIO-SX B&R with wipe_ceph_osds flag
- PASS: legacy STD + DX B&R with wipe_ceph_osds flag

Partial-Bug: 2086473

Change-Id: Ica3befe51ff08a53eb1b33af12e96fa4358e6c0f
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2024-11-02 21:55:57 +00:00
Zuul
1f3081faaa Merge "Fix the upload and update from rook-ceph app" 2024-11-01 13:25:27 +00:00
Gustavo Ornaghi Antunes
538c42ecfa Fix the upload and update from rook-ceph app
This change is necessary to fix rook-ceph app upload when
service_config is None and fix rook-ceph app update when
conductor_obj is None

This issue occurs because these values must not be None to get data
from them.

The first issue related to uploading the rook-ceph application occurs
in the subcloud environment because the region_config in the system is
set to true and if the service_config is None the application cannot be
uploaded. So the issue is solved by adding a new check
in rook_ceph_provisioner.py

The second issue related to rook-ceph app update occurs because a new
check about --force argument was added, it uses conductor_obj to check
the metadata and in the update operation check the lifecycle does not
have access to conductor_obj. So, the issue is solved by adding new
checks in lifecycle_rook_ceph.py

Test Plan:
 - PASS: Upload/Apply Rook-ceph application
 - PASS: Update Rook-ceph application

Closes-Bug: 2086182

Change-Id: I4c2e51d3a79de1d5a37461f9f3c0d1da4bb244a5
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-10-31 10:58:00 -03:00
Caio Correa
75b9606be3 Fix ecblock as default storageClass
Marks ecblock's storage class general as the default storage class.

Test Plan:
  - PASS: Test installing Rook Ceph with ecblock present on the
          service list and check the default storage class
  - PASS: Test installing Rook Ceph with block present on the
          service list and check problems on the installation

Closes-bug: 2085652

Change-Id: Ib2bec10988294161891bd4ce8e2a6639486284b1
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-10-30 08:15:46 -03:00
Zuul
91ec2a7f2b Merge "Fix rook-ceph app with auto-apply and remove without force argument" 2024-10-29 14:58:49 +00:00
Caio Correa
a304e00987 Fix min_replication propagation in pools
Fixes propagation of min_replication from storage-backend to every pool
of every service.

Fixes a problem that prevents reapply when a replication parameter
is changed while using ecblock service.

Adds support for storage-backend-modify reapply trigger

Fixes RBD storageClass and pools from values.yaml being present when
using ecblock.

Test Plan:
  - PASS: Test changing replication and min_replication on SX, DX
          and STD environments and check for correct propagation
          on pools
  - PASS: Test storage-backend-modify reapply trigger
  - PASS: Test reapply when a replication parameter is changed using
          ecblock service
  - PASS: Check for unwanted pools or storageClasses in all variations
          of services

Story: 2011066
Task: 51217

Change-Id: I174e44a71f5ed515feb32c7e5909dfedac85e684
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-10-25 17:27:44 -03:00
Gustavo Ornaghi Antunes
52727806b0 Fix rook-ceph app with auto-apply and remove without force argument
These fixes prevent the rook-ceph application from being removed without
any force arguments and remain auto-apply feature working.

This issue occurred because functions were not returning and super was
being called after rook-ceph semantic checks and when executing the
auto-apply hook, super was raising an exception which prevented
auto-apply from working.

Test Plan:
 - PASS: Check if an exception message when trying to remove the
         rook-ceph app with no force argument
 - PASS: Check if the rook-ceph application removal normally occurs
         when using the force argument
 - PASS: Check if the rook-ceph is auto-applying when all requirements
         match

Closes-Bug: 2084681

Change-Id: I7bcb75f08b376d7c8a38dbc1c7df52e061fefd03
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-10-24 16:00:16 -03:00
Zuul
279a01075c Merge "Update osdid to sync with ceph cluster OSDs" 2024-10-23 03:49:31 +00:00
Zuul
660fa9e726 Merge "Fix the rook-ceph application removal without force argument" 2024-10-18 19:40:47 +00:00
Gustavo Ornaghi Antunes
dcdf500ac5 Fix the rook-ceph application removal without force argument
This fix prevents the rook-ceph application from being removed with
no force argument.

The issue occurred because the pre_remove_semantic_checks method was
called with return, which stopped the code execution and prevented
the logic that blocks the removal within the super command from
being executed.

Test Plan:
 - PASS: Check if an exception message when trying to remove the
         rook-ceph app with no force argument
 - PASS: Check if the rook-ceph application removal normally occurs
         when using the force argument

Closes-Bug: 2084681

Change-Id: I4ad11044659eed659c06f540a557b5f56c60c46a
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-10-18 09:39:04 -03:00
Zuul
bc2314dc52 Merge "Fix Rook Ceph removal stuck by a non-Ceph offline host" 2024-10-17 20:36:30 +00:00
Caio Correa
eba6fdab88 Fix Rook Ceph removal stuck by a non-Ceph offline host
Fixes a bug that prevents rook ceph removal when the system
have non-Ceph offline hosts.

Test Plan:
  - PASS: Test removal with STD 2c+2w with all variations of
	deployment models and topology of monitors.
  - PASS: Test removal with DX+ with all variations of deployment
	models and topology of monitores

Closes-Bug: 2084681

Change-Id: If2bd3bdd1b4e7199aa5547e6936ec8ed4ef81d21
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-10-16 14:49:06 +00:00
Gustavo Ornaghi Antunes
7faf4a2b56 Update osdid to sync with ceph cluster OSDs
This change is to synchronize the OSDs deployed in the ceph cluster and
the OSDs in the inventory to use the same OSD IDs

Updated the osd in host-stor-list to use the same ID used by the same
OSD in the ceph cluster

Test Plan:
 - PASS: Upload and apply the Rook-Ceph application and verify that the
         osdid in host-stor-list has been updated using the ceph cluster
         OSD IDs
 - PASS: Reapply the Rook-Ceph application after adding more osds
         and verify that the osdid in host-stor-list has been updated
         using the ceph cluster OSD IDs
 - PASS: Remove and apply the Rook-Ceph application and verify that the
         osdid in host-stor-list has been updated using the ceph cluster
         OSD IDs
 - PASS: Check the script using a shellcheck

Note: All tests were provisioned in SX IPv6, STD IPv6, DX+ IPv4,
STD IPv4

Depends-On: https://review.opendev.org/c/starlingx/config/+/931988

Closes-bug: 2083332

Change-Id: I1d48f634dcaf1ca4ebd5db375c5bd9c3d36b3967
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-10-15 14:51:28 -03:00
Zuul
4acf7905cc Merge "Fix unexpected behavior when moving the monitor" 2024-10-12 18:35:18 +00:00
Erickson Silva de Oliveira
0ebbcaefc9 Change the function to simplex checking
When analyzing the lifecycle, we observed an unusual way of
checking whether it is AIO-SX or not, using
'is_host_simplex_controller'. So to standardize and avoid
any problems, it was replaced by 'is_aio_simplex_system'.

Test Plan:
- PASS: Build app package
- PASS: STD fresh install
- PASS: Configure rook-ceph with the wrong amount
        of osd and mon
- PASS: Check the alarms
- PASS: Configure the missing osds and mons and
        reapply app
- PASS: Check if the alarms are gone

Closes-Bug: 2084202

Change-Id: I3fbc02f0973dce7b9318898b2f13775e3d1a7950
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2024-10-11 12:29:21 +00:00
Erickson Silva de Oliveira
e8286d5bc0 Fix unexpected behavior when moving the monitor
When a monitor is moved from one host to another, the
monitor name is being incremented and the monitor data
directory in /var/lib/ceph/data is not being deleted.

To resolve the monitor data directory issue, the
AgentAPI function "execute_command" was used to
execute the "rm -rf" command on the old host.

To resolve the monitor increment, the mon to be moved
was removed from the rook-ceph-mon-endpoints configmap
mapping. With this, the operator will not have host
information and will create it on the new host.

In case the mon is removed instead of moved, the configmap
is patched, removing the mon name from data. As a result,
the operator stops using that monitor and the monitor is
then removed from the cluster. So that the next monitor
comes with the same name, the 'maxMonId' is adjusted.

Example for moving from compute-o to compute-1
$ system host-fs-modify --functions= compute-0 ceph
$ system host-fs-modify --functions=monitor compute-1 ceph
$ system application-apply rook-ceph

Test Plan:
- PASS: Build app package
- PASS: Update app package on STD
- PASS: Remove one monitor only
- PASS: Remove two monitors at a time
- PASS: Move one monitor only
- PASS: Move two monitors at a time
- PASS: AIO-DX -> AIO-DX+
- PASS: For each apply after moving or removing a mon,
        check whether the "/var/lib/ceph/data/mon-X"
        directory has been removed.

Closes-Bug: 2082658

Change-Id: If3f2a27f9244ceff13e30fb3c25f3fa46432a1c8
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2024-10-04 19:48:57 -03:00
Caio Correa
e7c169573d Cleanup improvements with all services
Improvements on the cleanup to ensure that native cleanup
jobs are running well.

Added failsafe actions on cleanup in error cases

Removed some pool checks in post-install hook that was preventing
the correct installation with ecblock

Test Plan:
 - PASS: Remove successfull Rook Ceph installations with all
            services on SX/DX/STD
 - PASS: Reapply the application after removal on all deployments
            without any issues
 - PASS: Test backup&restore with OSD wipe flag on SX/DX/STD

Story: 2011066
Task: 51030

Change-Id: Id2d49deaece07e6b314d3fab823f207f13b0da31
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-10-04 09:21:11 -03:00
Hediberto C Silva
3e1430dc51 Updating lifecycle plugin to support the monitor function
Since the capability introduction, the monitor availability has been
based on the monitor function assignment to the ceph host filesystem
and the ceph-float controller filesystem.

Lifecycle plugin changes:
  - Add a new method to update controller filesystem status in case of
    post-apply and post-remove.
  - The update_controller_fs method is responsible for updating the
    filesystem's state on the app transition status (uploaded/applied
    and apply-failed).
  - Rework update_host_fs to verify the capabilities to determine based
    on the functions whether the filesystem should be set to Ready or
    In-Use state. Also, required_state accepts more than one input now.
  - Now in the semantic checking, monitor availability is based on
    monitor capability function instead of host-fs count, leading to
    consider local monitors (ceph host-fs) and the floating monitor
    (ceph-float controller-fs). It requires at least one monitor.
  - Rework apply_topology_labels to consider the monitor function to
    set the 'ceph-mon-placement' and 'ceph-mgr-placement' labels to
    the host.
  - Add a new method cleanup_mon_mgr in case of post-apply and
    post-remove.
  - The cleanup_mon_mgr is responsible for checking the
    monitor function and removing specific labels from the host when
    the capability is missing and delete deployments.
    The labels are removed from database and also from the node using
    the Kubernetes client.
  - Now, the is_floating_monitor_assigned method references to the
    sysinv common function that checks the monitor capability from the
    ceph-float controller-fs.
  - Add the new function is_local_monitor_assigned to verify if there is
    at least one ceph host-fs using the monitor function.
  - Add the new function is_monitor_assigned to verify if there is a
    local or floating monitor assigned.
  - Semantic Check to limit the number of local monitors assigned to 2
    while the floating monitor is assigned.
  - Add Semantic Check rejecting apply operation when the setup is an
    AIO-DX with worker nodes and the floating monitor is assigned.
  - Add new method delete_deployment to delete the
    local monitor and manager deployments where the label was removed
    from the host, and ensuring that it has been removed from the
    monitor quorum.
Helm Overrides:
  - Enabling floating monitor when the function is assigned.
  - Update the desired mon count to use the monitor capability function.

Test Plan:
  PASS: Setup 2+2, add a monitor and check the mon and mgr labels.
  PASS: Setup 2+2, remove a monitor and check the mon and mgr labels.
  PASS: Setup 2+2, move a monitor from one worker to another (and back).
  PASS: Setup 2+2, delete the worker nodes, and add floating mon.
  PASS: Setup AIO-DX, with floating mon enabled, install worker,
        move floating mon to fixed mon on worker.

Depends-on: https://review.opendev.org/c/starlingx/config/+/926098

Story: 2011066
Task: 50827

Change-Id: I2d0073e7f8c8c76c8505f3ad1abb7ebd4f09d4e3
Signed-off-by: Hediberto C Silva <hediberto.cavalcantedasilva@windriver.com>
2024-09-06 18:29:39 -03:00
Caio Correa
e3b3b2cec5 Cleanup Improvements
Improvements on the cleanup to ensure that native cleanup
jobs are running well.

Fixed floating monitor jobs. They are now deleted when the
application is removed.

Test Plan:
 - PASS: Remove successfull Rook Ceph installations on SX/DX/STD
 - PASS: Reapply the application after removal on all deployments
         without any issues

Story: 2011066
Task: 50976

Change-Id: I050df09a01a9f3869cac8544e8e6da512828d6b5
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-09-05 19:38:34 -03:00
Zuul
f809270c27 Merge "Add OSD removal possibility in the rook-ceph app" 2024-09-05 21:55:18 +00:00
Gustavo Ornaghi Antunes
27cd4293c7 Add OSD removal possibility in the rook-ceph app
This change includes the OSD removal possibility, so the user can remove
an OSD from ceph cluster using host-stor-delete, then application-apply
and wait for the OSD removal

Test Plan:
 - PASS: Upload/Apply rook-ceph app with deployment dedicated and
         3 workers with 1 OSD in 2 workers, and 2 OSDs in 1 worker and
         remove OSD from inventory using the host-stor-delete command,
         reapply the app, and wait for OSD removal. After OSD removal,
         add another OSD in the same worker using the host-stor-add
         command, reapply the app, and wait for OSD to be recreated.

Story: 2011066
Task: 50938

Change-Id: I945222181f04b297b9e79ccd323e9316c1f3f230
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-09-05 11:00:31 -03:00
Gustavo Ornaghi Antunes
9126829739 Add restful service for rook-ceph app
This change includes a new service template named
rook-ceph-mgr-restful in rook-ceph-provisioner chart
to enable access to ceph-mgr via restful API using
an endpoint with default port 7999 to support STX integration

Test Plan:
 - PASS: Upload/Apply rook-ceph app and check if the service and
         endpoint have been created.
 - PASS: Upload/Apply rook-ceph app and use cURL to access the
         restful service

Story: 2011066
Task: 50967

Change-Id: Ibb8deb69cd55122c2b6a2a08d2f8d9306ebcc06e
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-09-05 11:47:15 +00:00
Robert Church
17d9060285 Fix incorrect requirement: pycryptodome
This updates the misspelled pycryptodomex requirement is in
requirements.txt.

Change-Id: I858bb40f027470d5384309415bc7521bb09c9421
Story: 2011066
Task: 50873
Signed-off-by: Robert Church <robert.church@windriver.com>
2024-08-19 15:35:35 -05:00
Robert Church
41adbba935 Enable optional AIO-DX floating monitor
This will enable integration of the floating monitor chart into the
application with:
- SM service monitor changes:
  - Add and remove floating monitor placement labels in the start/stop
    functions. This will ensure that when SM is transitioning activity
    labels will align on the active controller.
  - The stop function will delete the pod to force a reschedule.
  - The status function will detect the presence of the DRBD mounted
    filesystem and adjust the labeling accordingly in case start/stop
    functions did not label as desired.
- application plugin changes:
  - Add constants support for 'rook-ceph-floating-monitor' helmrelease
  - Provide initial utility functions to detect if the DRBD controller
    filesystem is enabled and if the floating monitor is assigned (via a
    helm use override)
  - Add a new function to get the IP family from the cluster-pod network
    to set overrides and determine the IPv4/IPv6 static address
  - Update the ceph cluster plugin to use a new utility function for
    detecting the IP family
  - Add the floating monitor helm plugin to generate the ip_family and
    static ip_address based on that family. Initial support provided for
    the cluster-pod network
  - Update the lifecycle plugin to optionally remove the floating
    monitor helm release on application remove
- application metadata
  - disable the 'rook-ceph-floating-monitor' chart by default
- FluxCD manifest changes
  - Change helmrepository API to v1 to clean up an error
  - Add manifests for the 'rook-ceph-floating-monitor' helm release
  - Temporarily set deletionPropagation in the rook-ceph-cluster, the
    rook-ceph-provisioner and rook-ceph-floating-monitor helmreleases to
    provide more predictive delete behavior
  - Update rook-ceph-cluster-static-overrides.yaml to add network
    defaults and disable the host network as the default provider. This
    was done to avoid port conflicts with the floating monitor. The
    cluster-pod network will now be the network used for the ceph
    cluster and its pods

Enable monitor at runtime:
 - system helm-override-list rook-ceph -l
 - system helm-override-show rook-ceph rook-ceph-floating-monitor \
     rook-ceph
 - system helm-override-update rook-ceph rook-ceph-floating-monitor \
     rook-ceph  --set assigned="true"
 - system helm-override-show rook-ceph rook-ceph-floating-monitor \
     rook-ceph
 - system application-apply rook-ceph

Disable monitor at runtime:
 - system helm-override-list rook-ceph -l
 - system helm-override-show rook-ceph rook-ceph-floating-monitor \
     rook-ceph
 - system helm-override-update rook-ceph rook-ceph-floating-monitor \
     rook-ceph --set assigned="false"
 - system helm-override-show rook-ceph rook-ceph-floating-monitor \
     rook-ceph
 - system application-apply rook-ceph

Future Improvements:
- Pickup the desired network from the storage backend (cluster-pod,
  cluster-host, etc) and
  - update _get_ip_family() to use this value
  - update _get_static_floating_mon_ip() to get address pool range and
    calculate an appropriate static IP address for the monitor

Test Plan:
PASS - Pkg build + ISO generation
PASS - Successful AIO-DX Installation
PASS - Initial Rook deployment without floating monitor.
PASS - Initial Rook deployment with floating monitor.
PASS - Runtime override enable of Rook floating monitor + reapply
PASS - Runtime override disable of Rook floating monitor + reapply

Change-Id: Ie1ff75481b6c2f0d9d34eb228d3019465e36bc1e
Depends-On: https://review.opendev.org/c/starlingx/config/+/926374
Story: 2011066
Task: 50838
Signed-off-by: Robert Church <robert.church@windriver.com>
2024-08-15 12:54:12 -05:00
Erickson Silva de Oliveira
b0d53974d9 Fix app stuck on applying
When applying the app in some systems, the app was
stuck at 67%.

Analyzing the pod logs, it was possible to
observe that stx-ceph-manager was running with python2
instead of python3. Additionally, restful module certificates
were not being generated for all mgrs.

Finally, the rook-ceph-provision job pod was also observed
to have an IPV6 formatting error.

Test Plan:
- PASS: Build stx-ceph-manager image with the changes
	from the review in 'Depends-On' below.
- PASS: Change the stx-ceph-manager deployment in
	the rook-ceph app to use this image
- PASS: Build rook-ceph app
- PASS: Apply the app and check if it was applied successfully.

Story: 2011066
Task: 50703

Depends-On: https://review.opendev.org/c/starlingx/utilities/+/924883

Change-Id: Ic33dce418c11279462420c7f515fb443cbfe2379
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2024-08-01 10:59:22 -03:00
Erickson Silva de Oliveira
75e7ee0966 Fix ceph cluster cleanup
During the application-remove, sometimes gets stuck on
removing state because k8s resources were not removed
successfully.

This happens because some resources did not have its
"finalizers" changed, in addition, the directories
in /var/lib/ceph/data are not cleaned.

To resolve this, "cleanupPolicy" was defined in ceph-cluster,
this way the operator itself will do a complete cleanup
before removing the app, including wiping OSDs.

Additionally, it was also identified that the jobs.batch
resources were not being removed due to a permission failure
in the ClusterRole.

Finally, the versioning of the rook-ceph-provisioner chart
was fixed, which was always 2.0.0.

Before the change: rook-ceph-provisioner-2.0.0.tgz
After the change: rook-ceph-provisioner-2.0.6.tgz

NOTE: The operation of removing and deleting the application is
now forbidden and is only possible through the "--force" argument.

Test Plan:
- PASS: Remove rook-ceph app
- PASS: Check that the state is not stuck on removing
- PASS: Check if all resources are in rook-ceph
	namespace are deleted

Story: 2011066
Task: 50570

Change-Id: I007fcdf63ec9611c8839a6e7c0e2bff8d38e6086
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2024-07-22 18:58:26 -03:00
Gustavo Ornaghi Antunes
211d2a1efa Auto-apply the rook-ceph app
This change includes some improvements to
support the auto-apply functionality of the app,
the improvements are:
 - Update host-stor or host-fs states only if
   ceph-rook backend exists
 - Desired_state updated to applied
 - Add new locks on semantic checking to only
   allow auto_apply when minimum criteria match

Test Plan:
- PASS: Check if sysinv log is not show log about
        host-stor or host-fs state was updated.
- PASS: Check if the app is auto-applying only
        when the minimum criteria match.

Story: 2011066
Task: 50567

Change-Id: Ifd324e3eddbd3c1d2d3a22ad8ca0967e3f07be55
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-07-22 14:32:50 -03:00
Robert Church
06c8fd61d7 Refactor application
Changes include:
 - Add common/utils.py for functionality common to lifecycle and helm
   plugins
 - Disable floating monitor support until properly integrated with the
   optional controllerfs
 - Determine Ceph services count (MONs, MDSs, MGRs) based on hostfs
   assignemnts not labels as the lifecycle plugins will label AFTER the
   counts are needed for semantic checks
 - Disable ecblock and rgw until configuration can be validated
 - Disable the mon/osd audits temporarily
 - Disable host provisioning of /etc/ceph/ceph.conf so that the
   rook-ceph-tools pod is used for client access
 - Rename _get_hosts() to _get_hosts_by_deployment_model()
 - Rename _get_nodes_osds() to _get_osds_by_node()
 - Add support to the kustomize plugin to break out and enable/disable
   specific static overrides manifests based if the service is enabled
   in the backend
 - Rename pre_apply_check_ceph_rook() to pre_apply_semantic_checks()
 - Add handle_incomplete_config_alarm() to centralize alarming
   information
 - Update various messages for content and readability
 - Update the cephClusterSpec to use the ceph host-fs filesystem for
   dataDirHostPath
 - Disable logs when command succeeds
 - Update the remove custom resource finalizers command to run
   successfully
 - Enable host lock/unlock or remove the app when the app is in the
   uploaded state and the ceph-rook backend has not yet been create.
 - Remove all resources from rook-ceph namespace
 - Enable the triggering of alarms, if necessary, in the lifecycle
   pre-apply hook

Test Plan:
 PASS - AIO-SX install and deployment of Rook via

        system storage-backend-add ceph-rook --confirmed
        system host-fs-add controller-0 ceph=20
        system host-disk-wipe -s --confirm controller-0 /dev/sdb
        system host-disk-list controller-0 | \
          awk '/\/dev\/sdb/{print $2}' | \
          xargs -i system host-stor-add controller-0 {}
        system host-stor-list controller-0
        system application-apply rook-ceph
        ROOK_TOOLS_POD=$(kubectl -n rook-ceph get pod \
          -l "app=rook-ceph-tools" \
          -o jsonpath='{.items[0].metadata.name}')
        kubectl -n rook-ceph exec -it $ROOK_TOOLS_POD -- ceph -s

Change-Id: Ib5964bbc3eaae173d6a47da3d44c71db9b35ee55
Depends-On: https://review.opendev.org/c/starlingx/config/+/922365
Story: 2011066
Task: 50391
Signed-off-by: Robert Church <robert.church@windriver.com>
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-07-02 13:57:51 +00:00
Gustavo Ornaghi Antunes
80502b1ef1 Add new pre-apply checks in rook-ceph app
This change implements new checks in the rook-ceph app, the checks are:
 - Raise an alarm based on replication factor and OSDs
 - Block the app when the k8s version is not the latest

Test Plan:
 - PASS: Check whether the alarm is raised when the replication factor
         is greater than OSD quantity in SX.
 - PASS: Check whether the alarm is raised when the replication factor
         is greater than the number of hosts with OSDs.
 - PASS: Check if the app is being blocked when the k8s version is not
         the latest supported version.

Story: 2011066
Task: 50370

Change-Id: I71df6bb5816cc5e8271c8e5b77e702db772a9a72
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-07-02 12:13:05 +00:00
Gabriel de Araújo Cabral
808fae55e0 Add state changes to host-fs ceph in lifecycle
This commit adds the state transition in the rook-ceph app
lifecycle to each host's host-fs 'ceph'.

Added state transitions:
- constants.HOST_FS_STATUS_READY -> constants.HOST_FS_STATUS_IN_USE
  (During and after apply the app)
- constants.HOST_FS_STATUS_IN_USE -> constants.HOST_FS_STATUS_READY
  (When the app is uploaded and the old state of host-fs was in use)

Test Plan:
 PASS: AIO-SX -> Check that the state of host-fs 'ceph' of
        controller-0 is ready before applying and that it changes
        to in-use upon completion of the application-apply.
 PASS: AIO-SX -> With the app applied, check that host-fs 'ceph' is
       in use, remove the app and when returning to uploaded check
       that the state goes to ready.

Story: 2011117
Task: 50343

Depends-On: https://review.opendev.org/c/starlingx/config/+/921446

Change-Id: I63af4325c6386879794d7e09ca4de99d3ca0c37d
Signed-off-by: Gabriel de Araújo Cabral <gabriel.cabral@windriver.com>
2024-06-26 15:25:21 -03:00
Gustavo Ornaghi Antunes
cd79d4443a Add dynamic overrides in rook-ceph app
This change add new dynamic overrides and enable/disable services based
on storage-backend.

Dynamic overrides added:
  Overrides based on how many hosts have host-fs ceph:
    - mds replicas size
    - mon count
    - mgr count
  Overrides based on host-stor
   - nodes
     - devices (osds)

Services that can be enabled:
 - CephFS (filesystem)
 - RBD (block or ecblock)
 - RGW (object)

Test Plan:
 - PASS: Load the rook-ceph app and check system-overrides for each
         chart
 - PASS: Apply the rook-ceph app and check if system-overrides have
         changed, only if something has changed before applying the app
 - PASS: Check if the services are enabled correctly based on the
         storage-backend services column
 - PASS: Check if the ceph is in HEALTH_OK status

Depends-On: https://review.opendev.org/c/starlingx/config/+/921801

Story: 2011066
Task: 50298

Change-Id: Ib245b0f1195d4c6437ed45346fe00cf16a69f67f
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-06-19 10:53:47 -05:00
Gustavo Ornaghi Antunes
a9f3b1e3da Implementing lifecycle actions in rook-ceph app
This change implements the lifecycle features in pre-apply and
post-apply actions. The features added were about checks and additions
automatically.

Post-upload:
 - Change storage-backend task to app status

Pre-apply:
 - Add Topology labels in each host on the system
 - Block if there is no ceph-rook backend in configuring state
 - Block if there is another backend in storage-backend list.
 - Block if the cephmon-label was not correctly added (host-fs)
 - Block if the OSDs were not correctly added (host-stor)

Post-apply:
 - Change storage-backend state to configured when app applied success
 - Change storage-backend state to configuration-failed when app fails
   on apply
 - Change host-stor state to configured when app applied success
 - Change host-stor state to configuration-failed when app fails on
   apply
 - Change storage-backend task to app status

Post-remove:
 - Change storage-backend task to app status
 - Change host-stor state to configuring

Post-delete:
 - Change storage-backend task to app status

Test Plan:
 - PASS: Check if the Topology Labels are added in each host
 - PASS: Check if the rook-ceph app can't be applied when
         storage-backend there isn't
 - PASS: Check if the rook-ceph app can't be applied when cephmon-label
         is not correctly added (host-fs)
 - PASS: Check if the rook-ceph app can't be applied when OSDs is not
         correctly added.
 - PASS: Check if when the rook-ceph app applied successfully is
         changing the storage-backend and host-stor states to configured
 - PASS: Check if when the rook-ceph app applied fails is changing the
         storage-backend and host-stor states to configuration-failed
 - PASS: Check if storage-backend task are being updated on each
         lifecycle hook
 - PASS: Check if storage-backend and host-stor states are being reset
         to configuring on app removal

Story: 2011066
Task: 50126

Change-Id: Ic77db9176b53411635ad0fc87b0fc57a12620679
Signed-off-by: Gustavo Ornaghi Antunes <gustavo.ornaghiantunes@windriver.com>
2024-06-14 09:09:24 -03:00
Caio Correa
4a335eb7f6 Fix mds pod scheduling
Fixed mds pod scheduling:
  - Adds control-plane and master tolerations.
  - Adds nodeAffinity for ceph-mon-placement.
  - Adds starlingx.io/component label without need for a post-install
  helm hook.

Test Plan:
  PASS: Test pod allocation on controller nodes in STD.
  PASS: Test nodeAffinity by turning off all nodes with
  ceph-mon-placement label. Pod should be pending.

Story: 2011066
Task: 50232

Change-Id: Iba7c097b9f58826d01008c41fd0caa84a24a94a3
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-06-03 11:36:25 +00:00
Caio Correa
b2e3e03a0e Add starlingx label to all application pods
Adding the app.starlingx to all pods to ensure that the entire
application runs under platform cores.

Also a correction in setup.cfg regarding the name of the app.

Test Plan:
    PASS: build all app-rook-ceph packages successfully.
    PASS: app-rook-ceph upload/apply/remove/delete on
          SX platform.
    PASS: cluster status HEALTH_OK.
    PASS: all pods contains the app.starlingx.io/component=true
          label.

Story: 2011066
Task: 50109

Change-Id: Iee3055fa916828f4e5627a072e245aa9aec850a9
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-05-16 20:07:41 +00:00
Caio Correa
326f833d3e Initial commit for app-rook-ceph
The app is based on the old StarlingX Rook Ceph application.

This provides support for the latest versions of Rook Ceph
storage and packs it as a StarlingX Application.

Auto-increment helm chart versions is already present on this
initial commit.

Support for Dual-Stack.

Partial IPv6 support was added: there is a bug with DX IPv6
configuration involving the floating monitor.

Remove/delete is successful for FluxCD, however some residual
kubernetes assets remains on the system after the remove.

Rook Ceph version: 1.13.7

Test Plan:
    PASS: build all app-rook-ceph packages successfully.
    PASS: app-rook-ceph upload/apply/remove/delete on
          SX/DX/DX+/Standard platforms.
    PASS: create a volume using PVC through cephfs and rbd
          storageClasses and test read/write on the corresponding
          pools at SX/DX/DX+/Standard plaforms.

Story: 2011066
Task: 49846

Change-Id: I7aa6b08a30676095c86a974eaca79084b2f06859
Signed-off-by: Caio Correa <caio.correa@windriver.com>
2024-05-08 09:51:44 -03:00