Files
distcloud/distributedcloud/dcmanager/state
srana 91ef6f00b7 Fix: Clear Stale Subcloud Offline Alarm
Currently, subcloud status alarms are updated strictly when
the subcloud availability changes. An audit which doesn't
update the subcloud availability will not update the status
alarm. Typically, this is fine; however, it can be problematic
when alarming services are unavailable (i.e., FM failure) during
an availability update. In this rare case, the subcloud status
alarm will not match the availability status. Ultimately, this can
result in an inconsistent stale alarm status, with an offline alarm
raised indefinitely for an available subcloud. Overall, the subcloud
state manager should not assume that the subcloud availability status
and the subcloud status alarms are aligned. This change ensures that
the subcloud alarm status is eventually aligned with the actual
availability by forcing alarm updates when the availability remains
unchanged (during audit’s update_subcloud_availability).

Test Plan:
 1. PASS: Ensure subcloud offline (280.001) alarm is cleared for
          subcloud restarts interleaved with a host-swact.
          - Power off subcloud, confirm subcloud offline alarm raised,
            power-on subcloud and initiate host-swact
 2. PASS: Induce FM failure during an availability update and ensure
          that the subcloud offline (280.001) alarm status
          is eventually cleared:
          - Power-off subcloud
          - Wait for availability status of subcloud to show offline
            (dcmanager subcloud list)
            Subcloud offline alarm should be raised
          - unmanage FM-mgr service, ps kill FM and power-on subcloud
          - Check alarm list, subcloud offline should remain raised
            It should FAIL to CLEAR at this point
          - Manage FM-mgr (ensure FM is connected) and wait for
            next "Handling update_subcloud_availability request"
            in state.log
          - Check offline alarm has been cleared

Closes-Bug: 2040204

Change-Id: I8c3dd10ca0b3cdfadf7672adfb6165b3194f64aa
Signed-off-by: Salman Rana <salman.rana@windriver.com>
2023-10-27 14:06:42 +00:00
..
2023-09-07 10:30:06 -03:00

Service

DC Manager State Service has responsibility for:

Subcloud state updates coming from dcmanager-manager service

service.py:

run DC Manager State Service in multi-worker mode, and establish RPC server

subcloud_state_manager.py:

Provide subcloud state updates