
This commit includes the following changes: - Implement the new fm-api methods regarding raising/clearing alarms in batches. The new keep_existing_alarms option was also implemented to make sure we don't update alarms as we're not checking if it exist before trying to raise them again. - Moving from FaultAPIs to FaultAPIsV2, which raises exceptions if there's an error in FM, preventing state to continue and not clearing/raising alarms when FM is offline. This can happen during a swact where FM process is stopped before state. - Introduce a db call to get the subcloud object and current status of endpoint instead of receiving a simplified subcloud through RPC. The reason for doing this instead of a simplified subcloud is that dcmanager-audit is faster to process than state, so until state updates, audit will keep sending information causing duplicated updates, slowing down the time it takes to update every subcloud. - Convert all logs into a default format with the subcloud name at the start, for better traceability. E.g: "Subcloud: subcloud1. <msg>". - Removed unused function update_subcloud_sync_endpoint_type. Test plan: - PASS: Deploy a subcloud and verify state communicates to cert-mon that it became online and then updates the dc_cert endpoint after receiving the response. - PASS: Manage the subcloud and verify all endpoint are updated and the final sync status is in-sync. - PASS: Force a subcloud to have an out-of-sync kube root-ca and kubernetes and verify state correctly updates the db and raise the alarms. - PASS: Turn off the subcloud and verify: - Subcloud availability was updated in db - All endpoints were updated in db - Dcorch was communicated - All endpoints alarms were cleared - The offline alarm was raised - PASS: Unmanage the subcloud and verify all endpoints, whith the exception of dc_cert, were updated to unknown. - PASS: Unmanage and stop the fm-mgs service and turn off the subcloud. Verify the subcloud is not updated to offline until fm comes back on. - PASS: Perform scale tests and verify that updating availability and endpoints is faster. Story: 2011311 Task: 52283 Depends-on: https://review.opendev.org/c/starlingx/fault/+/952671 Change-Id: I8792e1cbf8eb0af0cc9dd1be25987fac2503ecee Signed-off-by: Victor Romano <victor.gluzromano@windriver.com>
Service
- DC Manager State Service has responsibility for:
-
Subcloud state updates coming from dcmanager-manager service
- service.py:
-
run DC Manager State Service in multi-worker mode, and establish RPC server
- subcloud_state_manager.py:
-
Provide subcloud state updates