Fix: Clear Stale Subcloud Offline Alarm

Currently, subcloud status alarms are updated strictly when
the subcloud availability changes. An audit which doesn't
update the subcloud availability will not update the status
alarm. Typically, this is fine; however, it can be problematic
when alarming services are unavailable (i.e., FM failure) during
an availability update. In this rare case, the subcloud status
alarm will not match the availability status. Ultimately, this can
result in an inconsistent stale alarm status, with an offline alarm
raised indefinitely for an available subcloud. Overall, the subcloud
state manager should not assume that the subcloud availability status
and the subcloud status alarms are aligned. This change ensures that
the subcloud alarm status is eventually aligned with the actual
availability by forcing alarm updates when the availability remains
unchanged (during audit’s update_subcloud_availability).

Test Plan:
 1. PASS: Ensure subcloud offline (280.001) alarm is cleared for
          subcloud restarts interleaved with a host-swact.
          - Power off subcloud, confirm subcloud offline alarm raised,
            power-on subcloud and initiate host-swact
 2. PASS: Induce FM failure during an availability update and ensure
          that the subcloud offline (280.001) alarm status
          is eventually cleared:
          - Power-off subcloud
          - Wait for availability status of subcloud to show offline
            (dcmanager subcloud list)
            Subcloud offline alarm should be raised
          - unmanage FM-mgr service, ps kill FM and power-on subcloud
          - Check alarm list, subcloud offline should remain raised
            It should FAIL to CLEAR at this point
          - Manage FM-mgr (ensure FM is connected) and wait for
            next "Handling update_subcloud_availability request"
            in state.log
          - Check offline alarm has been cleared

Closes-Bug: 2040204

Change-Id: I8c3dd10ca0b3cdfadf7672adfb6165b3194f64aa
Signed-off-by: Salman Rana <salman.rana@windriver.com>
This commit is contained in:
srana
2023-10-24 10:20:24 -04:00
committed by Salman Rana
parent 2438807fc8
commit 91ef6f00b7
2 changed files with 38 additions and 0 deletions

View File

@@ -429,6 +429,13 @@ class SubcloudStateManager(manager.Manager):
raise
if update_state_only:
# Ensure that the status alarm is consistent with the
# subcloud's availability. This is required to compensate
# for rare alarm update failures, which may occur during
# availability updates.
self._raise_or_clear_subcloud_status_alarm(subcloud.name,
availability_status)
# Nothing has changed, but we want to send a state update for this
# subcloud as an audit. Get the most up-to-date data.
self._update_subcloud_state(context, subcloud.name,

View File

@@ -1484,6 +1484,37 @@ class TestSubcloudManager(base.DCManagerTestCase):
fake_dcmanager_cermon_api.subcloud_online.\
assert_called_once_with(self.ctx, subcloud.region_name)
@mock.patch.object(subcloud_state_manager.SubcloudStateManager,
'_raise_or_clear_subcloud_status_alarm')
def test_update_state_only(self, mock_update_status_alarm):
subcloud = self.create_subcloud_static(self.ctx, name='subcloud1')
self.assertIsNotNone(subcloud)
# Set the subcloud to online/managed
db_api.subcloud_update(self.ctx, subcloud.id,
management_state=dccommon_consts.MANAGEMENT_UNMANAGED,
availability_status=dccommon_consts.AVAILABILITY_ONLINE)
ssm = subcloud_state_manager.SubcloudStateManager()
with mock.patch.object(db_api, "subcloud_update") as subcloud_update_mock:
ssm.update_subcloud_availability(self.ctx, subcloud.region_name,
availability_status=dccommon_consts.AVAILABILITY_ONLINE,
update_state_only=True)
# Verify that the subcloud was not updated
subcloud_update_mock.assert_not_called()
# Verify alarm status update was attempted
mock_update_status_alarm.assert_called_once()
# Verify dcorch was notified
self.fake_dcorch_api.update_subcloud_states.assert_called_once_with(
self.ctx, subcloud.region_name, subcloud.management_state,
dccommon_consts.AVAILABILITY_ONLINE)
# Verify audits were not triggered
self.fake_dcmanager_audit_api.trigger_subcloud_audits.assert_not_called()
def test_update_subcloud_availability_go_online_unmanaged(self):
# create a subcloud
subcloud = self.create_subcloud_static(self.ctx, name='subcloud1')