Files
distcloud/distributedcloud/dcmanager/state
Victor Romano a3ddcf472d Improve dcmanager-state scalability
This commit includes the following changes:

- Implement the new fm-api methods regarding raising/clearing alarms
in batches. The new keep_existing_alarms option was also implemented
to make sure we don't update alarms as we're not checking if it
exist before trying to raise them again.

- Moving from FaultAPIs to FaultAPIsV2, which raises exceptions if
there's an error in FM, preventing state to continue and not
clearing/raising alarms when FM is offline. This can happen during a
swact where FM process is stopped before state.

- Introduce a db call to get the subcloud object and current status
of endpoint instead of receiving a simplified subcloud through RPC.
The reason for doing this instead of a simplified subcloud is that
dcmanager-audit is faster to process than state, so until state
updates, audit will keep sending information causing duplicated
updates, slowing down the time it takes to update every subcloud.

- Convert all logs into a default format with the subcloud name at
the start, for better traceability. E.g: "Subcloud: subcloud1. <msg>".

- Removed unused function update_subcloud_sync_endpoint_type.

Test plan:
  - PASS: Deploy a subcloud and verify state communicates to cert-mon
          that it became online and then updates the dc_cert endpoint
          after receiving the response.
  - PASS: Manage the subcloud and verify all endpoint are updated and
          the final sync status is in-sync.
  - PASS: Force a subcloud to have an out-of-sync kube root-ca and
          kubernetes and verify state correctly updates the db and
          raise the alarms.
  - PASS: Turn off the subcloud and verify:
            - Subcloud availability was updated in db
            - All endpoints were updated in db
            - Dcorch was communicated
            - All endpoints alarms were cleared
            - The offline alarm was raised
  - PASS: Unmanage the subcloud and verify all endpoints, whith the
          exception of dc_cert, were updated to unknown.
  - PASS: Unmanage and stop the fm-mgs service and turn off the
          subcloud. Verify the subcloud is not updated to offline
          until fm comes back on.
  - PASS: Perform scale tests and verify that updating availability
          and endpoints is faster.

Story: 2011311
Task: 52283

Depends-on: https://review.opendev.org/c/starlingx/fault/+/952671

Change-Id: I8792e1cbf8eb0af0cc9dd1be25987fac2503ecee
Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
2025-06-16 09:39:59 -03:00
..
2025-06-16 09:39:59 -03:00

Service

DC Manager State Service has responsibility for:

Subcloud state updates coming from dcmanager-manager service

service.py:

run DC Manager State Service in multi-worker mode, and establish RPC server

subcloud_state_manager.py:

Provide subcloud state updates