282 Commits

Author SHA1 Message Date
Zuul
73e3241b6d Merge "Improve DOR Recovery banner to include all hosts and their status" 2025-04-10 17:36:03 +00:00
Zuul
34207b1895 Merge "Remove Start Host Service Launch in mtcAgent & enhance fault detection" 2025-04-10 17:36:02 +00:00
Zuul
f5bd70538b Merge "Send Node Locked command on pxeboot network as corrective action." 2025-04-04 12:58:21 +00:00
Eric MacDonald
1daa670126 Improve DOR Recovery banner to include all hosts and their status
A power outage of a running system is referred to by StarlingX as a
DOR or Dead Office Recovery event where some or all the servers of
a system loose power and then automatically regain power and reboot
after the power outage is resolved.

The Maintenance Agent (mtcAgent) detects a DOR (Dead Office Recovery)
event when it starts up on a controller whose uptime is less than 15
minutes and remains in a DOR mode active state for up to 20 minutes.

Servers of different model, make and vintages can recover from a power
outage at different rates. Some may take longer than others. Therefore,
while in DOR mode maintenance is more forgiving towards node recovery.

The existing implementation of DOR handling produces what is called a
"DOR Recovery" banner in its log files. The logs that comprise this
banner are produced at the times when maintenance detects those hosts'
recoveries. The banner can be displayed by the following command

cat /var/log/mtcAgent.log | grep "DOR Recovery"

See DOR banner as comment in this review.

The issue leading to this update was from hosts that experienced no
heartbeat failure over the DOR recovery were not included in the DOR
recovery banner. The DOR recovery banner contains key performance
indicator (kpi) for DOR recovery and is much more useful if  all hosts
were included in the DOR banner.

This update adds a heartbeat soak to the maintenance add handler as a
means to affirmatively detect and report successful DOR recoveries.

The following addition fixes were implemented:

- The hard coded 15 minute (900 seconds) DOR uptime threshold is
  sometimes too short for some servers/systems. As a result, on such
  systems, DOR mode in real DOR events is never activated so therefore
  the banner is not produced.
  This update modifies that threshold to 20 minutes and makes it
  configurable through a dor_mode_detect label in mtc.conf.
- added DOR recovery count summary; x of y nodes recovered during DOR

Test Plan:

PASS: Install AIO DX plus 3 worker nodes system

Verify DOR recovery banner after the following DOR recovery conditions

PASS: - all nodes recover enabled    ; DOR Recovery Banner all ENABLED
PASS: - node that recovers degraded  ; DOR Recovery Banner DEGRADED
PASS: - node that gracefully recovers; DOR Recovery Banner ENABLED
PASS: - node that fails heartbeat    ; DOR Recovery Banner FAILED
PASS: - node that fails goenabled    ; DOR Recovery Banner FAILED
PASS: - node that fails config error ; DOR Recovery Banner FAILED
PASS: - node that fails host services; DOR Recovery Banner FAILED
PASS: - node that never comes online ; DOR Recovery Banner OFFLINE
PASS: - node in auto recovery disable; DOR Recovery Banner DISABLE

Combination Cases:

PASS: - all worker nodes do not recover online; with and without MNFA.
PASS: - when one controller powers up 90 seconds after the other,
        one compute is unlocked-disabled-offline while the other
        computes are powered up 4 minutes after initial controller.
PASS: - when both controllers reboot but computes don't ; no power loss
PASS: - when one worker experiences a double config error while another
        worker experiences UNHEALTY error.
PASS: - when one controller never recovers online.
PASS: - when only one controller of a 2+3 node set recovers online.
PASS: - when worker nodes come up well before the controller nodes.
PASS: - when one controller and 1 worker is locked.

Regression:

PASS: Verify locked nodes do not show up in the banner.
PASS: Verify heartbeat loss handling due to spontaneous reboot.
PASS: Verify mtcAgent and mtcClient logging for all above test cases.
PASS: Verify uptimes reported in the DOR Recovery banner.
PASS: Verify results of overnight DOR soak ; 30+ DORs.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/942744

Closes-Bug: 2100667
Change-Id: I2f28dd1fd6e8544b9cda9dedda2023b6f76ceeda
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-04-03 15:23:05 +00:00
Eric MacDonald
6106051f1c Remove Start Host Service Launch in mtcAgent & enhance fault detection
This update removes the Start Host Services launch from mtcAgent,
leveraging a recent enhancement (1) where the mtcClient auto runs
the Start Host Services just as it already does for GoEnabled tests.
Since mtcClient now handles the necessary scripts based on learned
node traits, mtcAgent no longer needs to initiate these requests.

1: https://review.opendev.org/c/starlingx/metal/+/925987

Therefore, this update removes the mtcAgent code that initiates
the Start Host Services request as well as the handling of those
request responses while maintaining support for launching
Stop Host Services in its existing form.

This is done largely to handle the occasionally lost test request
ACK and/or result messages over a Swact or Dead Office Recovery
scenarios following the introduction of IPSec on the management
network. IPSec needs time to startup or switch over thereby creating
the potential for lost messages during those short early periods.
Since Maintenance uses UDP rather than TCP for messaging, maintenance
itself needs to handle the lost messages as a requirement.

This update introduces out-of-band failure signaling for host services
and goEnabled failures from mtcClient to mtcAgent within existing
mtcAlive messaging. This way the mtcAgent is persistently informed of
test failures that may have happened while the mtcAgent wasn't running.

The mtcAgent is enhanced to monitor this new out-of-band failure
signaling. This way, asynchronous goEnable or Host service failures
that may happen during DOR recovery or puppet apply, such as a failure
over a mtcClient process restart, is also detected and handled.

Testing active controller failure modes revealed some inconsistent
handling that was modified to align with the following active
controller failure case design behavior ...

Simplex - self reboot, auto recover disable applies, lock and then
          unlock to recover from auto recovery disable mode.
Duplex  - with no enabled standby controller, the active controller
          raises critical alarm, remains enabled but goes degraded.
        - with enabled standby controller there is a force swact to
          enabled standby controller followed by a self reboot.

Note: Cannot fail the active controller in a DX system without an
      unlocked-enabled standby to take over. Would need to support
      auto recovery disable to avoid a perpetual reboot loop.
      Recovery from an auto recovery disable case requires a lock/
      unlock of the active controller which is only supported in SX.

The following few other minor issues were identified to affect this
update's testing so were also fixed:
- added the host's uptime to each mtc daemon first startup log.
- added the 'disabled' state to the list of node states that bypass
  pxeboot network monitoring (aka alarming).
- removed the need to delay (default 20 seconds) before adding
  hosts over a process restart. No longer required with this update.
- fixed auto recovery disable algorithm.
- migrated SX auto recovery failed host dir path to /var/persist/mtc

Test Plan:

PASS: Verify start host services is ...
PASS:  - not auto-run on a locked node
PASS:  - main and subf scripts are auto-run for AIO controller
PASS:  - auto-run on a standard controller node
PASS:  - auto-run on a worker node
PASS:  - auto-run on a storage node

PASS: Verify new Start Host Services and GoEnabled failure signaling
PASS: Verify start host services failure handling for all traits
PASS: Verify start host services timeout handling for all traits
PASS: Verify start host services auto recovery disable over
      back to back failures

PASS: Verify start host service 'main' failure handling on ...
PASS: - simplex controller unlock
PASS: - simplex controller over mtcClient restart
PASS: - simplex controller auto recovery disable handling for this case
PASS: - active controller over mtcClient restart ; with enabled standby
PASS: - standby controller unlock
PASS: - unlocked standby controller over mtcClient restart
PASS: - locked standby controller over mtcClient restart
PASS: - worker or storage unlock
PASS: - worker or storage heartbeat failure.
PASS: - worker or storage over mtcClient restart
PASS: - standby controller handling that leads to auto recovery disable
PASS: - no behavior on a locked node
COND: - active controller over mtcClient restart ; no enabled standby
        Note: SM shuts the system down due to goenabled file removal.

PASS: Verify start host service 'subf' failure handling AIO controller
PASS: - simplex controller unlock
PASS: - simplex controller over mtcClient restart
PASS: - simplex controller auto recovery disable handling for this case
PASS: - active controller over mtcClient restart ; no enabled standby
PASS: - active controller over mtcClient restart ; with enabled standby
PASS: - standby controller unlock
PASS: - standby controller over mtcClient restart
PASS: - standby controller handling that leads to auto recovery disable

PASS: Verify goEnabled 'main' failure handling on
PASS: - simplex controller unlock
PASS: - simplex controller over mtcClient restart
PASS: - simplex controller auto recovery disable handling for this case
PASS: - active controller over mtcClient restart ; with enabled standby
PASS: - unlocked standby controller over mtcClient restart
PASS: - locked standby controller over mtcClient restart
PASS: - worker or storage unlock
PASS: - worker or storage heartbeat failure.
PASS: - worker or storage over mtcClient restart
PASS: - standby controller handling that leads to auto recovery disable
PASS: - no behavior on a locked node
COND: - active controller over mtcClient restart ; no enabled standby
        Note: SM shuts the system down due to goenabled file removal.

PASS: Verify goEnabled 'subf' failure handling on
PASS: - simplex controller unlock
PASS: - simplex controller over mtcClient restart
PASS: - simplex controller auto recovery disable handling for this case
PASS: - active controller over mtcClient restart ; no enabled standby
PASS: - active controller over mtcClient restart ; with enabled standby
PASS: - standby controller unlock
PASS: - standby controller over mtcClient restart
PASS: - standby controller handling that leads to auto recovery disable

Regression:

PASS: Verify persistent goEnable or Start Host Services failure on
      simplex leads to autorecovery disable with critical enable alarm.
PASS: Verify persistent goEnable or Start Host Services failure on
      Duplex with no enabled standby controller leads to
      enabled-degrade active controller with critical alarm.
PASS: Verify persistent goEnable or Start Host Services failure on
      Duplex with enabled standby controller leads to force swact to
      standby followed by a node failure of the previous active that
      leads to auto recovery disable with critical enable alarm.
PASS: Verify stop host services behavior over a node lock operation
PASS: Verify mtcAgent/mtcClient logging for all test case verification

Closes-Bug: 2100227
Change-Id: Ie1bfa7b89c39636ff5dd7330ab9f8b89e6c14e53
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-04-03 15:17:05 +00:00
Zuul
121425ce70 Merge "Make Hardware Monitor sensor list a thread local variable" 2025-03-31 12:33:40 +00:00
Eric MacDonald
a1256a3c32 Make Hardware Monitor sensor list a thread local variable
The current sensor list is shared across all hosts. On large systems,
this can lead to list corruption when host sensor read threads output
data concurrently.

This update moves sensor_list to be thread local, so each thread
gets its own unique instance. Although thread_local variables are not
on the stack, their memory is tied to the thread’s resources. In many
cases, this memory is drawn from the same per-thread region as the
stack, also known as TLS (Thread-Local Storage).

The TLS area is often allocated adjacent to or within the thread’s
stack mapping. A large thread_local variable increases the TLS
requirement, and if it exceeds the reserved space or overlaps with the
stack, thread creation may fail with Resource temporarily unavailable.

To accommodate this, the per-thread stack size was increased.
The sensor_list allocates for up to 512 sensors per host, which is
excessive. This update reduces the max sensors per host to 256, cutting
the list size from 327 KB to 163 KB per thread.

Even with this reduction, the thread stack size needed to be increased
from 128 KB to 512 KB. The Mtce Thread utility was updated to support
custom stack sizes. This allows mtcAgent to remain at 128 KB while
hwmond threads can specify a larger size.

This update also adds a debug feature to create dated sensor reading
files for each host. While testing, it was found that output files were
created with inconsistent permissions. This update fixes the file mode
to 0644.

Test Plan: Verified in 2+2+50 node system

PASS: Verify large system install and sensor monitoring
PASS: Verify large system sensor monitoring over DOR and Swact
PASS: Verify the sensor_sample list storage is unique per thread
PASS: Verify sensor read file permissions
PASS: Verify dated debug sensor read files
PASS: Verify added debug options are disabled by default
PASS: Verify 24 hour provision/monitor/deprovision soak
PASS: Verify sensor monitoring following host delete and readd
PASS: Verify sensor model is deleted completely with host delete
PASS: Verify sensor model is recreated over host readd

Regression:

PASS: Verify sensor monitoring and alarm management
PASS: Verify hardware monitor process restart handling
PASS: Verify no coredumps
PASS: Verify logging for all test cases

Closes-Bug: 2102671
Change-Id: I9263ec2242e03d46e9dc768af965fed7e1ac9175
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-03-29 01:23:16 +00:00
Eric MacDonald
cbcb19420c Send Node Locked command on pxeboot network as corrective action.
This update affects only locked nodes.

If a remote node fails early config in a way that prevents IPSec
over management from being established, and no cluster interface
is configured or provisioned, then Node Locked commands sent from
mtcAgent over management and cluster networks are not received by
mtcClient.

This leads to a perpetual watchdog reset loop. The pmon process fails
to reach the configured state, and without the presence of
the .node_locked file, the watchdog treats the node as unlocked.
A quorum failure triggers a crashdump reset, repeating indefinitely.

The mtcAgent detects this and attempts corrective action by resending
the Node Locked command over the same failing networks, which also
fails.

This update adds a fallback: the Node Locked command is also sent
over the pxeboot network.

Testing also revealed that mtcClient socket recovery stops at the
first socket failure rather than try and rcover them all.

This update improves socket recovery by attempting all sockets in
order. The pxeboot socket is tried first, now followed by management
and cluster sockets.

Test Plan:

PASS: Verify mtcClient socket init and failure recovery handling.
PASS: Verify the mtcAgent sends the Node Locked command on the
      pxeboot network when it sees a node locked state mismatch.
PASS: Verify a locked node with failing management and cluster
      networking will get the node locked command serviced and
      node locked file produced as expected on the remote node.
      This event is noted by the following host specific mtcAgent log.

      "hostname mtcAlive reporting unlocked while locked ; correcting"

      Note: that before this update we see the above 'correcting' log
            every 5 seconds. With this update we see that log only
            once and the remote node does not go into a perpetual
            crashdump loop.

      Note: The host watchdog will not force a quorum failure
            crashdump if the /var/run/.noide_locked file is present.

Closes-Bug: 2103863
Change-Id: I020c7ebe1e83254c52219546ec938f6cf3284c2e
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-03-25 12:48:10 +00:00
Eric MacDonald
0bd607fdcd Extend bmc secret timeout and randomize retry delay
The bmc secret query utility and fsm is mostly common to both the
mtcAgent and hwmond processes

This update extends the timeout for bmc secret requests from 5 to 20.
This lines up with other http request timeouts and gives at scale
systems more time to respond during busy controller use cases like
swact or dead office recovery.

In the event of a failure, this update also randomizes the retry delay
between 10 and 100 seconds per host so that retries don't happen in
bulk all at once. This spreads the load secret request query retries
has on system inventory.

Testing also revealed that the hardware monitor was not freeing its
connection resources after a successful bmc secret query.
This update fixes that and adds a failure handling path for the case
where the bmc secret response payload is empty. Similar handling was
fine for the mtcAgent.

Test Plan:

PASS: Verify large system bmc secret fetch over DOR soak;5 loops
PASS: Verify randomized delay handling upon all failure case handling.
PASS: Verify large system bmc provisioning/deprovisioning soak;10 loops
PASS: Verify large system swact soak;30 loops
PASS: Verify mtcAgent and hwmond logging ovber all test cases

Closes-Bug: 2103925
Change-Id: I4a696a7d5e4452a8fbd9f25cf11ddf0f065dbe1a
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-03-25 12:42:51 +00:00
Eric MacDonald
4074c6c25d Move the mtce /etc/mtc/tmp/.node_locked flag file out of /etc
When the mtcAgent locks a node, it commands the mtcClient to create a
persistent .node_locked flag file at /etc/mtc/tmp/.node_locked.
Conversely, when the node is unlocked, the mtcAgent commands the
mtcClient to remove this flag file.

However, an issue arises where an unlocked node may still have the
.node_locked file present after an upgrade-rollback or patch-removal
operation.

The issue occurs because the OSTree upgrade deployment process runs
while the node is locked. During this process, OSTree takes a snapshot
of the /etc directory, which includes the .node_locked file.

Even if the file is later removed by maintenance actions, after deploy
but before reboot, OSTree restores it from the snapshot resulting in
the reinstatement of the .node_locked file on an unlocked node.

To eliminate this file management conflict, this update moves the
persistent .node_locked flag file to a location outside of OSTree's
management, specifically to /var/persist/mtc/.node_locked.

The directory name 'persist' was chosen to clearly indicate that the
files in this directory are intended to persist across reboots.

This update also fixed a post install script logging error trying
to rename the hwclock.sh.<init>.bak file with one already present.

Test Plan:

PASS: Verify the creation of the new /var/persist/mtc directory.
PASS: Verify any files under this directory persist over reboot.
PASS: Verify proper management of the node locked file over upgrade
      and rollback.
PASS: Install a AIO DX and verify the node locked file management.
PASS: Verify AIO DX upgrade from MR2PLUS to 24.09 master.
PASS: Install a Standard DX System with 1 worker and 2 storage and
      verify the node locked file management over and following an
      upgrade from MR2PLUS to 24.09 master.
PASS: Verify obsoleted /etc/mtc/tmp/.node_locked file is auto removed
      by both package install and over a mtcClient startup/restart.
PASS: Verify /etc/mtc/tmp dir remains.
PASS: Verify mtce debian package installs without error or warning.

Closes-Bug: 2095212
Change-Id: I3431abfef74c678fbeaa149bf6ac29ee254be111
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-02-06 16:54:33 +00:00
Eric MacDonald
0853bb3fcc Add configured add host delay to mtcAgent
The active 'controller' domain name is used by the mtcAgent
management interface to communicate with the mtcClient.

The System Swact (Switch Activity) function dynamically migrates
active controller services between controller-0 and controller-1.
During this process, the mtcAgent, along with other services, are
restarted on the newly active controller.

When the mtcAgent starts, it reads the system inventory and adds the
hosts to its internal control structure. During this "add" operation,
the mtcAgent sends commands and expects responses from the local and
remote mtcClients on individual nodes, using the controller domain
name, which represents the management network's floating IP address.

A new feature, the FQDN (Fully Qualified Domain Name) Resolution
Manager, was introduced to handle domain name resolution in the
StarlingX system. However, an issue was identified where the FQDN
resolution manager does not have the 'controller' domain name
resolution support fully available (qualified) when the mtcAgent
starts messaging with its mtcClients.

As a result, the communication between the mtcAgent and mtcClient can
lead to silent message loss. This issue can cause the "add host"
operation to fail, potentially being service affecting for that host.

This update adds a small, manually configurable delay, to the mtcAgent
host add operation start. This gives FQDN the time to complete setting
up name resolution for the required 'controller' domain name.

The default add_host_delay of 20 seconds was selected after seeing
the occasional failure with a 10 second delay.

This update can be removed in the future if the system makes changes
to avoid starting the mtcAgent before all name resolution is ready.

Test Plan:

PASS: Verify issue in system, apply update, verify issue is resolved.
PASS: Verify package/iso build along with AIO DX system install.
PASS: Verify mtcAgent logging.

Regression:

PASS: Verify standby controller lock/unlock soak ; 10+ loops.
PASS: Verify Swact soak of 20+ swacts succeeds without reproducing
      the issue this update is designed to fix.
PASS: Verify heart beating is enabled on all remote hosts on both
      controllers following an install and multiple Swacts.
PASS: Verify sensor monitoring is enabled on all hosts that have
      their BVMC provisioned over a Swact.
PASS: Verify mtcClient, mtcAgent, hbsAgent and hbsClient logs for
      unexpected behavior.
PASS: Verify default add hosts delay can be changed and a mtcAgent
      configuration reload or process restart uses the modified value.
PASS: Verify no add host delay is imposed if the new configuration
      label is removed from the config file or set to 0.
PASS: Verify host lock immediately following a swact and
      successful system host-list.

Closes-Bug: 2093381
Change-Id: I694322eff0945c7c56bf21051b3d6cccacf829a2
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-01-10 12:54:46 +00:00
Zuul
1db4094905 Merge "Remove sw-patch-agent from kickstart" 2024-11-26 18:57:22 +00:00
mmachado
1924d5191a Remove sw-patch-agent from kickstart
sw-patch is being removed and mentions of it should be removed
as they serve no purpose.

Depends-On: https://review.opendev.org/c/starlingx/update/+/934968

Test-Plan:
PASS: AIO-SX upgrade using sw-manager strategy
PASS: AIO-DX System Controller upgrade using strategy
PASS: subcloud upgrade using dcmanager strategy
PASS: AIO-DX initial install and bootstrap

Story: 2010676
Task: 51401

Change-Id: I980aafc59b2abf6ecf405add8cdeef7ae4b3a7a3
Signed-off-by: mmachado <mmachado@windriver.com>
2024-11-25 19:34:53 +00:00
Jim Gauld
d368475197 Configure systemd CPUShares/CPUQuota for pmon.service
This updates CPUShares and CPUQuota for pmon.service.
This gives reduced shares and quota since pmon.service has sporadic CPU
usage yet is not latency critical. Significant hirunner CPU usage comes
from various audits (unrelated to pmon process itself) running under the
systemd pmon.service cgroup.

For example: ceph health audit, ceph osd audit, can easily require 100%
cpu for several seconds, often taking 30% occupancy for multiple
seconds.

This reduces pmon cgroup to 150 CPUShare from 1024 and sets
CPUQUota 15%. This smoothes out behaviour of poorly behaved audits.
This effectively slows down the audit behaviour by a few seconds due
to throttling.

This is part of an overall set of adjustments are required for systemd
cgroups CPUShares, CPUQuota, and AllowedCPUs for key system services.
This will improve latency of Kubernetes critical components, and
throttles lesser important services.

Partial-Bug: 2084714

TEST PLAN:
AIO-SX, AIO-DX, Standard, Storage, DC, AIO-DX with ceph:
- PASS: Fresh install
- PASS: verify systemd parameters for pmon

        Example:
        systemctl show pmon.service | grep -e CPUShares -e CPUQuota

AIO-SX, AIO-DX:
- PASS: B&R

AIO-DX:
- PASS: K8S orchestrated upgrade 1.24 - 1.29
- TODO: controller swact

Change-Id: I6ee5c6029c2a5a0fae26e9231401e4d4f1c016df
Signed-off-by: Jim Gauld <James.Gauld@windriver.com>
2024-11-15 09:21:05 -05:00
fperez
e62642e97f Fix intermittent process failure alarms auto-clear issues
This commit addresses the issue of intermittent failures that occur
when errors are encountered while opening files with extra text for
specific processes. These errors led to mismatches between the
entity_instance_id of the created alarm and the alarm being deleted.

With this commit, the extra text is now appended only to the alarm
when it is created, and it will not be considered when the system
attempts to remove the alarm. This change helps prevent the
mismatches caused by file errors and ensure alarms are handled
correctly.

Test plan
PASS: Build package.
PASS: Install package and bootstrap system
PASS: Use Eric macDonald's pmon regression tests to verify
      behavior.

closes-bug: 2078986

Change-Id: I622450c45770d251d62a80ccb964c65ce9e4d935
Signed-off-by: fperez <fabrizio.perez@windriver.com>
2024-09-11 16:09:12 -03:00
Eric MacDonald
dab9c4774b Maintenance does not auto-start worker host services in AIO
The mtcClient is required to 'start host services' autonomously
following a node reboot. This is to handle the usecase where
the administrator disables maintenance heartbeat loss auto recovery.
If that node then reboots on its own, for whatever reason, maintenance
needs to ensure that it auto starts 'host services'.

A fairly recent update delivered support for that usecase:

    https://opendev.org/starlingx/metal/commit/
    1335bc484df331771e995ae822df3af84cc5739d

However, the current mechanism the mtcClient used to manage auto-
starting host services did not handle the worker subfunction case.
Moreover, the current implementation is not handling the potential
concurrency between the mtcClient process startup case and mtcAgent
requests during unlock recovery.

This case also fixes an issue where the mtcClient sometimes gets
into a mode where it floods the mtcAgent with a start host services
result message ; 20 unnecessary messages / sec. The aforementioned
update modified the mtcAgent to log receipt of this message which
then floods the mtcAgent log leading to unnecessary message handling
and log rotations.

Test Plan:

Success Path:

PASS: Verify mtcClient success path handling of start and stop host
      services function for the various node types in a ...
      - standard system with worker and storage nodes
      - all-in-one system with worker node
PASS: Verify appropriate start host services are run on each node
      type following a Dead Office Recovery (DOR).
      - standard system with worker and storage nodes
      - all-in-one system with worker node
PASS: Verify the mtcClient does not unnecessarily send host services
      result messages.
PASS: Verify handling of periodic start host services message while
      a node is in service.

Failure Path:

PASS: Verify mtcClient failure path handling of start and stop host
      services function for the various node types in a ...
      - standard system with worker and storage nodes
      - all-in-one system with worker node

PASS: Verify mtcClient start host services command handling when
      when message requests interleave with auto start handling
      during unlock recovery.

Closes-Bug: 2073802
Change-Id: I0da7a16c1f600cc60364f6bcec7587e2ff71c624
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-08-09 14:48:05 +00:00
Eric MacDonald
50204147ff Add PS Redundancy Sensor to Redfish server power sensor group
This update adds the Power Supply Redundancy sensor to the redfish
server power sensor group.

Some special handling is required to make the assertion of this
new sensor have a 'major' severity level and only while there are
2 or more power supplies provisioned. See code comments in the review
that highlight the assertion only applies when the redundancy sensor
count is 2 and severity is overridden from critical to major.

This update does not apply to the IPMI 'server power' sensor group.
This is because the IPMI protocol does not distinguish between single
and redundant power supply provisioning cases and reports a redundancy
loss in the single power supply case even when that power supply is
operating fine.

Test Plan:

PASS: Verify new PS Redundancy sensor is added to the server
      power sensor group with redfish sensor monitoring.
PASS: Verify no PS Redundancy assertion with redundant power
      supplies installed while both have AC power input.
PASS: Verify major PS Redundancy assertion with redundant power
      supplies installed while one not receiving AC power input.
PASS: Verify no PS Redundancy assertion with single power supply.

PASS: Verify PS Redundancy sensor goes offline when 'state' is
      not 'Enabled' and returns to operating state when re-Enabled.

PASS: Verify PS Redundancy sensor goes 'offline' when
      Redundancy label is missing.
PASS: Verify PS Redundancy sensor goes 'offline' when
      RedundancySet count is missing.
PASS: Verify PS Redundancy sensor goes 'offline' when
      Status label is missing.

PASS: Verify PS Redundancy sensor assertion when Status:Health
      is not 'OK'.
PASS: Verify PS Redundancy sensor goes 'offline' when Status:State
      is not 'Enabled'.
PASS: Verify new PS Redundancy sensor survives a process restart.
PASS: Verify new PS Redundancy sensor asserts with non-OK status
      while redundancy count is greater than one.

Regression:

PASS: Verify host is degraded when PS redundancy alarm is asserted.
PASS: Verify alarm and degrade is cleared if sensor reads OK.
PASS: Verify alarm and degrade is cleared if sensor goes offline.
PASS: Verify a 'logged-major' PS Redundancy assertion raises alarm
      when the group action is changed to 'alarm'.
PASS: Verify a' alarm-major' PS Redundancy assertion clears alarm
      when the group action is changed to 'log'.
PASS: Verify no PS Redundancy sensor is added to the server
      power sensor group with ipmi sensor monitoring.
PASS: Verify no PS Redundancy assertion with single or redundant
      power supplies with ipmi sensor monitoring.
PASS: Verify all sensor assertions are cleared when a server's BMC
      is reprovisioned by bm_type or bm_ip address or completely
      deprovisioned by bm_type=none.
PASS: Verify basic hardware monitor sensor assertion/clear operations.

Closes-Bug: 2076200
Change-Id: Ieae8f2b8681d1a2b29da0707b2f439cf10c47a2c
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-08-07 12:23:40 +00:00
Eric MacDonald
b29fb32f60 Clear 200.014 sensor=profile alarm over model relearn and deprovision
The 200.014 Sensor Config sensor=profile alarm was does not get cleared
over a Sensor Profile Relearn nor BMC Deprovision actions.

This can then lead to a stuck alarm if the sensor read / groups create
issue never resolves. Sensor alarms against a host must get deleted
if the BMC for that host is deprovisioned.

This update removes the long time obsolete sensor=sensors alarm
references and adds a clear sensor config "profile" alarm to the
'sensor group profile relearn' and 'bmc deprovisioning' code paths.

Test Plan:

PASS: Verify sensor config profile alarm is deleted when
PASS: - sensor model is relearned
PASS: - bmc deprovisioned
PASS: - sensor model is properly created (FIT tested)
PASS: Verify raised 200.014 alarm persists over a hwmond restart

Regression:

PASS: Verify basic hardware monitoring and alarming
PASS: Verify sensor deprovisioning
PASS: Verify sensor model relearn operation
PASS: Verify sensor alarming and clear function

Closes-Bug: 2074760
Change-Id: I3165105e9e4e933ab7b723bd0b6241a6a2b046ae
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-07-30 16:16:33 +00:00
Eric MacDonald
fd66519339 Fix Start Host Services race condition
The following update, merged in early June, introduced a change to
the mtcClient to auto-run the Start Host Services command on process
startup like it does for the goenable tests.

https://opendev.org/starlingx/metal/
        commit/1335bc484df331771e995ae822df3af84cc5739d

This change introduced the potential for a race condition that did not
occur during the testing of that update.
Likely due to the low reproduction rate.

With that update in place it is possible for maintenbance to receive
the acknowlegement of a "Start Host Services" request followed
immediately by the "Start Host Services Result" message.

Receiving these messages back to back in a batch does not give
maintenance enough time to update its command handler with the
next expected message. The Command handler is a separate time-sliced
FSM that needs to run at least once following the start request's
message ack. Otherwise, the result message is dropped which leads
to a Start Host Services timeout.

The fix is to accept a "Start Host Services Result" response anytime
it arrives while a "Start Host Services" request is outstanding.

Test Plan:

PASS: Verify issue occurs at a rate greater than 75% and then apply
      this change and verify there are no failures in a lock/unlock
      soak of 100 iterations.

Closes-Bug: 2073802
Change-Id: I657e5fd917073f6c7a37dc13517559a9740a62e9
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-07-23 15:54:42 +00:00
Eric MacDonald
fb36d3b810 Prevent maintenance setup of the pxeboot network on simplex systems
The pxeboot network is used to install system nodes.
However, simplex systems do not have system nodes.
Therefore, the pxeboot network setup is not needed on SX systems.

This update implements changes to Maintenance, specifically the
mtcAgent and mtcClient processes, to not setup and service messaging
on the pxeboot network on simplex systems.

Test Plan:

PASS: Verify before and after update behavior
PASS: Verify Build, install and enable AIO SX
PASS: Verify the pxeboot network is not setup on SX systems
PASS: Verify pxeboot messaging and alarming works on DX systems
PASS: Verify install and enable DX systems with no pxeboot alarms
PASS: Verify mtcAgent and mtcClient logging
PASS: Verify SX to DX Migration

Closes-Bug: 2073292
Change-Id: I0e3749bab29d88917f36bc29e8b775dfd5e8a13f
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-07-23 13:51:34 +00:00
Zuul
13df7262c0 Merge "Remove CentOS/OpenSUSE build support" 2024-07-10 12:08:09 +00:00
Kyale, Eliud
94b9761011 Replace bmc system() commands with fork() execv()
Mtce uses the system() command to run the ipmitool and redfishtool.
The system() command launches a shell process that is susceptible
to code injection.
By switching to fork() execv() we can prevent command injection attacks
if for example the bmc parameters are compromised.

The bmc parameters are:
- bm_type
- bm_ip
- bm_username
- bm_password

These are initially provided as user input and stored
in either barbican (bm_password) or the sysinv postgres database.

If these parameters are compromised, the injected code will not be run.
For example, if bm_username="root; reboot&"
the reboot command will not be run.

Test plan:

PASS - Code testing: designer testing of failure paths, verifying logs
                     by compiling errors in the code
               - fork fail error path
               - file open failure path
               - dup/dup2 failure path
               - execv failure

PASS - AIO-SX: iso install
PASS - AIO-DX: iso install
PASS - AIO-SX: ipmi bmc sensor/device queries
               system host-sensor-list <controller-0>
PASS - AIO-SX: ipmi bmc reset
               designer modification of sysinv to allow simplex reset
PASS - AIO-SX: modify bmc parameters in postgres
               and verify bmc command failure and proper handling
               e.g bm_username="root; reboot&"
PASS - AIO-SX: file leak testing of execv error path
               sudo lsof -p `pidof mtcAgent`
               sudo lsof -p `pidof hwmond`
PASS - AIO-SX: memory leak and file leak testingsoak
               sudo /usr/sbin/dmemchk.sh --C mtcAgent hwmond
PASS - AIO-DX: ipmi bmc reset
               Virtual machine AIO-DX configured to physical bmc
               simulate reset on virtual machine by power down
               at the same time as system host-reset <controller>
PASS - AIO-DX: ipmi bmc sensor/device queries
               system host-sensor-list <controller-0|1>

Example postgres commands to compromise the bm_username parameter:

sudo -u postgres \
psql -d sysinv \
-c "select bm_username from i_host where hostname='controller-0';"

sudo -u postgres \
psql -d sysinv \
-c \
"update i_host set bm_username='root; reboot&' "\
"where hostname='controller-0';"

Story: 2011095
Task: 50344

Change-Id: I250900d1c757d7e04058f4c954502b1a38db235e
Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
2024-06-13 14:43:44 -04:00
Zuul
dbb9543c08 Merge "Deprovision mtcClient bmc info when bmc for node is deprovisioned" 2024-06-10 16:37:24 +00:00
Eric MacDonald
508b619400 Deprovision mtcClient bmc info when bmc for node is deprovisioned
A node's BMC is provisioned and deprovisioned through the system CLI.

Maintenance shares controller node BMC provisioning info with
the mtcClient on each controller node. The mtcClient uses this
BMC provisioning info to reset its peer controller when it sees
the appropriate signal from SM (a flag file).

However, when a controller node's BMC is deprovisioned from the
system CLI, the mtcAgent does not send a the deprovisioned data
to the mtcClient. Without getting the deprovisioning data the
mtcClient will continue to use the previous provisioning data.
This is incorrect and the reason for this fix.

This update fixes this by having the mtcAgent periodically share
controller node BMC provisioning data to each controller's mtcClient
regardless of its provisioning state. The BMC provisioning data update
period remains the same as it was while the BMCs were provisioned.

This update also offers the followiong messaging/logging improvements.

 - restrict the updates to the management network only.
   There is no need to send the same data over the pxeboot.

 - stop logging while the BMC is deprovisioned. The absence/presence
   of the logs is sufficient to know what the provisioning state is
   without needlessly logging when the BMCs are not provisioned.

 - bypasses sending the bmc provisioning data to the controller-0
   mtcClient in an SX system. The data is only needed in a DX system.

Test Plan:

PASS: Verify mtcClient gets BMC deprovisioning data ; fix for this bug.
PASS: Verify mtcClient periodically logs valid BMC provisioning data.
PASS: Verify mtcClient doesn't log unprovisioned BMC provisioning data.
PASS: Verify mtcAgent does not send bmc provision data on SX systems.
PASS: Verify mtcAgent does send bmc provision data on DX systems.
PASS: Verify worker and storage never receive bmc provisioning data.

Closes-Bug: 2067925
Change-Id: I29e5eb0b072ee38358d99d682555c466de322f2d
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-06-05 15:27:19 +00:00
Eric MacDonald
1335bc484d Add auto run goenabled and start hosts services to mtcClient
The 'mtcClient' currently automatically runs the main function's
'goenabled' scripts on process startup for all nodes if and when
their run preconditions are met.

However, that is not true for 'start host services' and, in the AIO
system type case, the subfunction 'goenabled' scripts.

Typically, this is acceptable because the 'mtcAgent' will request
these scripts to be run during unlock and failure recovery scenarios.

However, if the system administrator reconfigures the maintenance
heartbeat fault handling action from the default 'fail' to any other
setting [degrade,alarm,none] and a node reboots outside of maintenance
control, then upon reboot recovery, the 'start host services' and,
if the node is an AIO controller, the required subfunction 'goenabled'
scripts are not executed. In such a case, the missing subfunction
'goenabled' flag file (/var/run/goenabled_subf) prevents the hbsAgent
and hbsClient on that node from entering its in-service mode of
operation. Instead they run waiting for the node's In-Test phase to
complete ; which never happens.

This can lead to what appears to be suck maintenance heartbeat alarms.
However, its really caused by the maintenance heartbeat processes on
that node gated from performing their mission mode function.

The /var/run/goenabled_subf flag file is the AIO In-Test complete gate.
It is set if the subfunction 'goenabled' tests pass. However, because
this flag file is in /var/run (a volatile directory) it is lost/cleared
over a reboot.

This update adds the automatic execution of the AIO controller's
subfunction 'goenabled' scripts and the 'start host services' for
all nodes. Once all the required preconditions are met the scripts
are run and that node is ready for service, regardless of how and
the conditions underwhich it rebooted.

Testing of this update is focused on
- Verifying the originating issue is resolved.
- Verify the changed behavior over the install of all system types.
- Verify the changed behavior with an uncontrolled reboot or each
  node type for all the supported maintenance heartbeat failure
  action modes.

Test Plan:

PASS: Verify install of the following system types
PASS: - AIO SX
PASS: - AIO DX and AIO DX Plus
PASS: - Standard DX with worker and storage nodes (vbox)
PASS: - System Controller with 1 subcloud (dc-libvirt)

PASS: Verify spontaneous reboot of unlocked active AIO controller with
PASS: - heartbeat_failure_action=fail
PASS: - heartbeat_failure_action=degrade
PASS: - heartbeat_failure_action=alarm
PASS: - heartbeat_failure_action=none

PASS: Verify spontaneous reboot of unlocked standby AIO controller with
PASS:  - heartbeat_failure_action=fail
PASS:  - heartbeat_failure_action=degrade
PASS:  - heartbeat_failure_action=alarm
PASS:  - heartbeat_failure_action=none

PASS: Verify reboot recovery after spontaneous reboot of worker
PASS: Verify reboot recovery after spontaneous reboot of storage
PASS: Verify start host services is run on mtcClient process startup.
PASS: Verify start host services is run on worker and storage nodes
      when rebooted with all heartbeat failure recovery action modes.

Regression:

PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=fail
PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=degrade
PASS: Verify degrade and alarm management over in-service heartbeat
      failure while when heartbeat_failure_action=alarm
PASS: Verify no alarm or degrade over in-service heartbeat
      failure while when heartbeat_failure_action=none
PASS: Verify mtcClint over AIO standby controller lock/unlock
PASS: Verify start host services is run on mtcClient on every node
      by command from mtcAgent process startup.
PASS: Verify start host services is run on mtcClient over a unlock or
      graceful recovery by command from mtcAgent.
PASS: Verify start host services check follows goenabled test
      completion on process startup.
PASS: Verify stop host services is run over a node lock.
PASS: Verify goenable main and subfunction failure handling
PASS: Verify start hosts service failure handling
PASS: Verify no coredump or crashdumps
PASS: Verify no stuck alarms

Closes-Bug: 2067917
Change-Id: Ie8aaf5da20b092267f637ad3df125019c244991b
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-06-04 19:42:54 +00:00
Scott Little
b31d11314b Remove CentOS/OpenSUSE build support
StarlingX stopped supporting CentOS builds in the after release 7.0.
This update will strip CentOS from our code base.  It will also remove
references to the failed OpenSUSE feature as well.

There are centos references in the kickstarts that still appear to be
packaged in the debian build.  I won't touch those.

Story: 2011110
Task: 49956
Change-Id: Ifb5aa75b71a17db52e66d6fd91e7c52ed931532d
Signed-off-by: Scott Little <scott.little@windriver.com>
2024-05-02 16:01:04 -04:00
Eric MacDonald
4e62e3ac9f Prevent process coredump due to missing token in response header
Both Maintenance and the Hardware Monitor use a common token refresh
utility that has been seen to crash the calling process when a token
'get' request is missing the token in its response header.

This update avoids that by exiting the token handler at error
detection point rather than continue handling the response with
invalid data.

Significant fault insertion testing was performed on the update
which lead to some additional improvements in token request error
handling that both processes benefit from.

Additional specific fixes include
- fixed race condition memory leak around authentication error handling
- differentiate token refresh from failure recovery renewal.
- fixed a few missing event status / rc updates.

Test Plan:
 - used mtce fault insertion tools to create failure modes
 - 24+ hr memory leak test run for both success & token error handling
 - all tests were done with both hwmond and mtcAgent

PASS: Verify build and AIO DX install.
PASS: Verify reported hwmon coredump issue is avoided/resolved.
PASS: Verify issue also exists in the mtcAgent and is also
      avoided/resolved by this update.

Regression:

PASS: Verify token get failure retry handling:
PASS: - get first token inline - retry cadence: 5 seconds
PASS: - refresh token by http  - retry cadence: 10, 30 and 1200 secs
PASS: Verify recovery handling cases:
PASS: - corrupt token
PASS: - no token present
PASS: - no token in header
PASS: Verify token renewal stress soak ; every 10 seconds for 24+ hrs
PASS: - repeat over token get failure cases
PASS: - in each success and failure case verify no memory leaks.
PASS: Verify authentication error handling soak
      - every 10-60 secs for 24+ hrs
      - token is corrupted followed by a sysinv request to
        exercise authentication error handling and renewal process.
PASS: Verify no coredumps.
PASS: Verify logging and token retry.
PASS: Verify process continues to use the previous token until a new
      one is acquired.
      - Token Refresh is on time.
      - Token Renew is on event.
PASS: Verify soak of persistent authentication error / token
      renewal cycle. No memory leak or coredumps.

Closes-Bug: 2063475
Change-Id: I5eef62518ac606e6b54323b46fbb6f9475b5c1ef
2024-04-29 13:11:26 +00:00
Zuul
975e868431 Merge "Change mtcInfo log in mtcCtrlMsg.cpp to a dlog" 2024-04-17 12:10:16 +00:00
Eric MacDonald
97092bd38b Change mtcInfo log in mtcCtrlMsg.cpp to a dlog
The mtcInfo message log was enhanced to include the payload in a
previous update without realizing that message contained the target
BMC's username and password.

This update switches that log to a debug (not enabled by default) log
to avoid revealing provisioned BMC crediatials in the mtce logs.

Test Plan:

PASS: Verify mtce package build
PASS: Verify mtcInfo log with bmc info payload is no longer logged.

Story: 2010940
Task: 49857
Change-Id: I35db04e9292471d92c24c98922350cfb72b5035e
2024-04-11 17:17:45 +00:00
Eric MacDonald
649e94c8da Add pxeboot mtcAlive messaging alarm handling
This update adds alarm handling to the recently introduced pxeboot
network mtcAlive messaging, see depends on review below.

A new 200.003 maintenance alarm is introduced with the second depends
on update below. This new alarm is MINOR but also Management Affecting
because the pxeboot network is required for node installation.

This update enhances the new pxeboot_mtcAlive_monitor FSM for the
purpose of detecting pxeboot mtcAlive message loss, alarming and
then clearing the alarm once pxceboot mtcAlive messaging resumes.

The new alarm assertion and clear is debounced:
 - alarm is asserted if message loss persists to the accumulation of
   12 missed messages or after 2 minutes of complete message loss.
 - alarm is cleared after decrementing the message missed counter to
   zero or 1 minute of loss-less messaging.

Upgrades are supported with the addition of a features list to the
mtcClient ready event. All new mtcClients that support pxeboot network
messaging now publish pxeboot mtcAlive support through this new
features list. This is rendered in the logs like this:

    <hostname> mtcClient ready ; with pxeboot mtcAlive support

The mtcAgent does not expect/monitor pxeboot mtcAlive messages from
hosts that don't publish the feature support.

Test Plan:

PASS: Verify mtcAlive period is 5 seconds.
PASS: Verify pxeboot mtcAlive monitor period is 10 seconds.
PASS: Verify mtcAgent sends mtcClient a mtcAlive request on every
      mtcAlive monitor miss.
PASS: Verify pxeboot mtcAlive alarm is not raised while a node is
      locked.

Alarm attributes:

PASS: Verify severity is minor.
PASS: Verify alarm is cleared while node is locked.
PASS: Verify alarm can be suppressed while unlocked.
PASS: Verify asserted alarm is management affecting.
PASS: Verify alarm-show output format including cause and repair
      action text.

Process Restart Handling:

PASS: Verify alarm is maintained over a mtcAgent process restart.
PASS: Verify pxeboot monitoring resumes with or without asserted alarm
      immediately following a mtcAgent process restart.
PASS: Verify mtcClient learns and starts pxeboot mtcAlive messaging
      immediately following mtcClient process restart for locked or
      unlocked nodes.

Alarm Debounce Handling:

PASS: Verify alarm assertion only after 2 minutes of mtcAlive loss.
PASS: Verify alarm clear after 1 minutes of mtcAlive recovery.
PASS: Verify assertion and recovery debounce logging.
PASS: Verify alarm management miss and loss controls handle all
      boundary conditions exercised by a 12 hr soak with randomized
      period between message loss and recovery.

Host Action Handling:

PASS: Verify mtcAlive alarm is not raised over a Host Unlock Enable.
PASS: Verify mtcAlive alarm is not raised over a Host Graceful Recovery.
PASS: Verify mtcAlive alarm is not raised over a Host Power Off/On.
PASS: Verify mtcAlive alarm is not raised over a Host Reboot/Reset.
PASS: Verify mtcAlive alarm is not raised over a Host Reinstall.
PASS: Verify pxeboot mtcAlive is factored into Host Offline Handling.
PASS: Verify pxeboot alarm handling for node that does not send
      pxeboot mtcAlive after unlock.

Stuck Alarm Avoidance Handling:

PASS: Verify typical alarm assertion and clear handling.
PASS: Verify alarm is maintained or cleared over node reboot if the
      messaging issue persists or resolves over the reboot recovery.
PASS: Verify mtcAlive alarm is maintained over a Swact and cleared
      if the messaging is ok on the newly active controller.
PASS: Verify mtcAlive alarm assertion recovery case over uncontrolled
      Swact due to active controller reboot.
PASS: Verify alarm is cleared over a spontaneous reboot if pxeboot
      messaging recovers over that reboot.

Upgrades Case:

PASS: Verify pxeboot mtcAlive monitoring only occurs on mtcClients
      that actually support pxeboot network mtcAlive monitoring.

PASS: Verify mtcClient new features list, parsing which enables
      pxeboot  mtcAlive monitoring for that node.

PASS: Verify pxeboot mtcAlive messaging monitoring is not enabled
      towards nodes whose mtcClient does publish pxeboot mtcAlive
      messaging feature support.
PROG: Verify AIO DX upgrade from 22.12 to current master branch.
      Focus on pxeboot messaging over the upgrade process.

Depends-On: https://review.opendev.org/c/starlingx/metal/+/912654
Depends-On: https://review.opendev.org/c/starlingx/fault/+/914660
Story: 2010940
Task: 49542
Change-Id: I1b51ad9ebcf010f5dee9a86c0295be3da6e2f9b1
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-04-09 14:13:23 +00:00
Eric MacDonald
14bb67789e Add pxeboot network mtcAlive messaging to Maintenance
The introduction of the new pxeboot network requires maintenance
verify and report on messaging failures over that network.

Towards that, this update introduces periodic mtcAlive messaging
between the mtcAgent and mtcClinet.

Test Plan:

PASS: Verify install and provision each system type with a mix
             of networking modes ; ethernet, bond and vlan
             - AIO SX, AIO DX, AIO DX plus
             - Standard System 2+1
             - Storage System 2+1+1
PASS: Verify feature with physical on management interface
PASS: Verify feature with vlan on management interface
PASS: Verify feature with bonded management interface
PASS: Verify feature with bonded vlans on management interface
PASS: Verify in bonded cases handling with 2, 1 or no slaves found
PASS: Verify mgmt-combined or separate cluster-host network
PASS: Verify mtcClient pxeboot interface address learning
             - for worker and storage nodes       ; dhcp leases file
             - for controller nodes before unlock ; dhcp leases file
             - for controller nodes after unlock  ; static from ifcfg
             - from controller within 10 seconds of process restart
PASS: Verify mtcAgent pxeboot interface address learning from
             dnsmasq.hosts file
PASS: Verify pxeboot mtcAlive initiation, handling, loss detection
             and recovery
PASS: Verify success and failure handling of all new pxeboot ip
             address learning functions ;
             - dhcp - all system node installs.
             - dnsmasq.hosts - active controller for all hosts.
             - interfaces.d - controller's mtcClient pxeboot address.
             - pxeboot req mtcAlive - mtcAgent mtcAlive request message.
PASS: Verify mtcClient pxeboot network 'mtcAlive request' and 'reboot'
             command handling for ethernet, vlan and bond configs.
PASS: Verify mtcAlive sequence number monitoring, out-of-sequence
             detection, handling and logging.
PASS: Verify pxeboot rx socket binding and non-blocking attribute
PASS: Verify mtcAgent handling stress soaking of sustained incoming
             500+ msgs/sec ; batch handling and logging.
PASS: Verify mtcAgent and mtcClient pxeboot tx and rx socket messaging,
             failure recovery handling and logging.
PASS: Verify pxeboot receiver is not setup on the oam interface on
             controller-0 first install until after initial config
             complete.

Regression:

PASS: Verify mtcAgent/mtcClient online and offline state management
PASS: Verify mtcAgent/mtcClient command handling
      - over management network
      - over cluster-host network
PASS: Verify mtcClient interface chain log for all iface types
      - bond    : vlan123 -> pxeboot0 (802.3ad 4) -> enp0s8 and enp0s9
      - vlan    : vlan123 -> enp0s8
      - ethernet: enp0s8
PASS: Verify mtcAgent/mtcClient handling and logging including debug
      logging for standard operations
      - node install and unlock
      - node lock and unlock
      - node reinstall, reboot, reset
PASS: Verify graceful recovery handling of heartbeat loss failure.
      - node reboot
      - management interface down
PASS: Verify systemcontroller and subcloud install with dc-libvirt
PASS: Verify no log flooding, coredumps, memory leaks

Story: 2010940
Task: 49541
Change-Id: Ibc87b85e3e0e07c3b8c40b5291bd3372506fbdfb
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-28 15:28:27 +00:00
Eric MacDonald
3c94b0e552 Avoid creating non-volatile node locked file while in simplex mode
It is possible to lock controller-0 on a DX system before controller-1
has been configured/enabled. Due to the following recent updates this
can lead to SM disabling all controller services on that now locked
controller-0 thereby preventing any subsequent controller-0 unlock
attempts.

https://review.opendev.org/c/starlingx/metal/+/907620
https://review.opendev.org/c/starlingx/ha/+/910227

This update modifies the mtce node locked flag file management so that
the non-volatile node locked file (/etc/mtc/tmp/.node_locked) is only
created on a locked host after controller-1 is installed, provisioned
and configured.

This prevents SM from shutting down if the administrator locks
controller-0 before controller-1 is configured.

Test Plan:

PASS: Verify AIO DX Install.
PASS: Verify Standard System Install.
PASS: Verify Swact back and forth.
PASS: Verify lock/unlock of controller-0 prior to controller-1 config
PASS: Verify the non-volatile node locked flag file is not created
      while the /etc/platform/simplex file exists on the active
      controller.
PASS: Verify lock and delete of controller-1 puts the system back
      into simplex mode where the non-volatile node locked flag file
      is once again not created if controller-0 is then unlocked.
PASS: Verify an existing non-volatile node locked flag file is removed
      if present on a node that is locked without new persist option.
PASS: Verify original reported issue is resolved for DX systems.

Closes-Bug: 2051578
Change-Id: I40e9dd77aa3e5b0dc03dca3b1d3d73153d8816be
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-03-09 12:45:54 +00:00
Eric MacDonald
d9982a3b7e Mtce: Create non-volatile backup of node locked flag file
The existing /var/run/.node_locked flag file is volatile.
Meaning it is lost over a host reboot which has DOR implications.

Service Management (SM) sometimes selects and activates services
on a locked controller following a DOR (Dead Office Recovery).

This update is part one of a two-part update that solves both
of the above problems. Part two is a change to SM in the ha git.
This update can be merged without part two.

This update maintains the existing volatile node locked file because
it is looked at by other system services. So to minimize the change
and therefore patchback impact, a new non-volatile 'backup' of the
existing node locked flag file is created.

This update incorporates modifications to the mtcAgent and mtcClient,
introducing a new backup file and ensuring their synchronized
management to guarantee their simultaneous presence or absence.

Note: A design choice was made to not use a symlink of one to the
      other rather than add support to manage symlinks in the code.
      This approach was chosen for its simplicity and reliability
      in directly managing both files. At some point in the future
      volatile file could be deprecated contingent upon identifying
      and updating all services that directly reference it.

This update also removes some dead code that was adjacent to my update.

Test Plan: This test plan covers the maintenance management of
           both files to ensure they always align and the expected
           behavior exists.

PASS: Verify AIO DX Install.
PASS: Verify Storage System Install.
PASS: Verify Swact back and forth.
PASS: Verify mtcClient and mtcAgent logging.
PASS: Verify node lock/unlock soak.

Non-volatile (Nv) node locked management test cases:

PASS: Verify Nv node locked file is present when a node is locked.
      Confirmed on all node types.
PASS: Verify any system node install comes up locked with both node
      locked flag files present.
PASS: Verify mtcClient logs when a node is locked and unlocked.
PASS: Verify Nv node locked file present/absent state mirrors the
      already existing /var/run/.node_locked flag file.
PASS: Verify node locked file is present on controller-0 during
      ansible run following initial install and removed as part
      of the self-unlock.
PASS: Verify the Nv node locked file is removed over the unlock
      along with the administrative state change prior to the
      unlock reboot.
PASS: Verify both node locked files are always present or absent
      together.
PASS: Verify node locked file management while the management
      interface is down. File is still managed over cluster network.
PASS: Verify node locked file management while the cluster interface
      is down. File is still managed over management network.
PASS: Verify behavior if the new unlocked message is received by a
      mtcClient process that does not support it ; unknown command log.
PASS: Verify a node locked state is auto corrected while not in a
      locked/unlocked action change state.
      ... Manually remove either file on locked node and verify
          they are both recreated within 5 seconds.
      ... Manually create either node locked file on unlocked worker
          or storage node and verify the created files are removed
          within 5 seconds.
          Note: doing this to the new backup file on the active
                controller will cause SM to shutdown as expected.
PASS: Verify Nv node locked file is auto created on a node that
      spontaneously rebooted while it was unlocked. During the
      reboot the node was administratively locked.
      The node should come online with both node locked files present.

Partial-Bug: 2051578
Change-Id: I0c279b92491e526682d43d78c66f8736934221de
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-14 00:54:11 +00:00
Eric MacDonald
191c0aa6a8 Add a wait time between http request retries
Maintenance interfaces with sysinv, sm and the vim using http requests.
Request timeout's have an implicit delay between retries. However,
command failures or outright connection failures don't.

This has only become obvious in mtce's communication with the vim
where there appears to be a process startup timing change that leads
to the 'vim' not being ready to handle commands before mtcAgent
startup starts sending them after a platform services group startup
by sm.

This update adds a 10 second http retry wait as a configuration option
to mtc.conf. The mtcAgent loads this value at startup and uses it
in a new HTTP__RETRY_WAIT state of http request work FSM.

The number of retries remains unchanged. This update is only forcing
a minimum wait time between retries, regardless of cause.

Failure path testing was done using Fault Insertion Testing (FIT).

Test Plan:

PASS: Verify the reported issue is resolved by this update.
PASS: Verify http retry config value load on process startup.
PASS: Verify updated value is used over a process -sighup.
PASS: Verify default value if new mtc.conf config value is not found.
PASS: Verify http connection failure http retry handling.
PASS: Verify http request timeout failure retry handling.
PASS: Verify http request operation failure retry handling.

Regression:

PASS: Build and install ISO - Standard and AIO DX.
PASS: Verify http failures do not fail a lock operation.
PASS: Verify host unlock fails if its http done queue shows failures.
PASS: Verify host swact.
PASS: Verify handling of random and persistent http errors involving
      the need for retries.

Closes-Bug: 2047958
Change-Id: Icc758b0782be2a4f2882efd56f5de1a8dddea490
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2024-02-07 20:33:01 +00:00
Eric Macdonald
50dc29f6c0 Improve maintenance power/reset control command retry handling
This update improves on and drives consistency into the
maintenance power on/off and reset handling in terms of
retries and use of graceful and immediate commands.

This update maintains the 10 retries for both power-on
and power-off commands and increases the number of retries
for the reset command from 5 to 10 to line up with the
power operation commands.

This update also ensures that the first 5 retries are done
with the graceful action command while the last 5 are with
the immediate.

This update also removed a power on handling case that could
have lead to a stuck state. This case was virtually impossible
to hit based on the required sequence of intermittent command
failures but that scenario handling was fixed up anyway.

Issues have been seen with the power-off handling on some servers.
Suspect that those servers need more time to power-off. So, this
introduced a 30 seconds delay following a power-off command before
issuing the power status query to give the server some time to
power-off before retrying the power-off command.

Test Plan: Both IPMI and Redfish

PASS: Verify power on/off and reset handling support up to 10 retries
PASS: Verify graceful command is used for the first power on/off
      or reset try and the first 5 retries
PASS: Verify immediate command is used for the final 5 retries
PASS: Verify reset handling with/without retries (none/mid/max)
PASS: Verify power-on  handling with/without retries (none/mid/max)
PASS: Verify power-off handling  with/without retries (none/mid/max)
PASS: Verify power status command failure handling for power on/off
NOTE: FIT (fault insertion testing) was used to create retry scenarios

PASS: Verify power-off inter retry delay feature
PASS: Verify 30 second power-off to power query delay
PASS: Verify redfish power/reset commands used are logged by default
PASS: Verify power-off/on and reset logging

Regression:

PASS: verify power-on/off and reset handling without retries
PASS: Verify power-off handling when power is already off
PASS: Verify power-on handling when power is already on

Closes-Bug: 2031945
Signed-off-by: Eric Macdonald <eric.macdonald@windriver.com>
Change-Id: Ie39326bcb205702df48ff9dd090f461c7110dd36
2024-01-25 22:42:26 +00:00
Zuul
125601c2f9 Merge "Failure case handling of LUKS service" 2023-12-14 18:09:46 +00:00
Jagatguru Prasad Mishra
1210ed450a Failure case handling of LUKS service
luks-fs-mgr service creates and unseals the LUKS volume used to store
keys/secrets. This change handles the failure case if this essential
service is inactive. It introduces an alarm LUKS_ALARM_ID which is
raised if service is inactive which implies that there is an issue in
creating or unsealing the LUKS volume.

Test Plan:
PASS" build-pkgs -c -p mtce-common
PASS: build-pkgs -c -p mtce
PASS: build-image
PASS: AIO-SX bootstrap with luks volume status active
PASS: AIO-DX bootstrap with volume status active
PASS: Standard setup with 2 controllers and 1 compute node with luks
      volume status active. There should not be any alarm and node
      status should be unlocked/enabled/available.
PASS: AIO-DX node enable failure on the controller where luks volume
      is inactive. Node availability should be failed. A critical
      alarm with id 200.016 should be displayed with 'fm alarm-list'
PASS: AIO-SX node enable failure on the controller-0. Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list'
PASS: Standard- node enable failure on the node (controller-0,
      controller-1, storage-0, compute-1). Node availability
      should be failed. A critical alarm with id 200.016 should be
      displayed with 'fm alarm-list' for the failed host.
PASS: AIO-DX In service volume inactive should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-SX In service volume inactive  status should be detected
      and a critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service
      volume inactive status should be detected and a
      critical alarm should be raised with ID 200.016. Node
      availability should be changed to degraded.
PASS: AIO-DX In service: If volume becomes active and a LUKS alarm
      is active, alarm should be cleared. Node availability should
      be changed to available.
PASS: AIO-SX In service: If volume becomes active and a  LUKS alarm is
      active, alarm should be cleared. Node availability should be
      changed to available.
PASS: Standard ( 2 controller, 1 storage, 1 compute) In service:
      If volume becomes active and a LUKS alarm is active, alarm
      should be cleared. Node availability should be changed to
      available.
PASS: AIO-SX, AIO-DX, Standard- If intest fails and node availability
      is 'failed'. After fixing the volume issue, a lock/unlock should
      make the node available.

Story: 2010872
Task: 49108

Change-Id: I4621e7c546078c3cc22fe47079ba7725fbea5c8f
Signed-off-by: Jagatguru Prasad Mishra <jagatguruprasad.mishra@windriver.com>
2023-12-06 00:34:02 -05:00
Zuul
1332ebb7a7 Merge "Replace a file test from fsmond" 2023-12-04 14:03:11 +00:00
Teresa Ho
36814db843 Increase timeout for runtime manifest
In management network reconfiguration for AIO-SX, the runtime manifest
executed during host unlock could take more than five minutes to complete.
This commit is to extend the timeout period from five minutes to eight
minutes.

Test Plan:
PASS: AIO-SX subcloud mgmt network reconfiguration

Story: 2010722
Task: 49133

Change-Id: I6bc0bacad86e82cc1385132f9cf10b56002f385e
Signed-off-by: Teresa Ho <teresa.ho@windriver.com>
2023-11-23 16:51:22 -05:00
Erickson Silva de Oliveira
16181a2ce8 Replace a file test from fsmond
fsmond tries to create a test file in "/.fs-test" but
it is not possible because "/" is blocked by ostree.

So the fix is to replace this path from fsmond monitoring
with /sysroot/.fs_test.

Below is a comparison of the logs:
  - Before change:
  ( 196) fsmon_service : Warn : File (/.fs-test) test failed

  - After change:
  ( 201) fsmon_service : Info : tests passed

Test Plan:
  - PASS: Build mtce package
  - PASS: Replace fsmond binary on AIO-SX
  - PASS: Check fsmond.log output

Closes-Bug: 2043712

Change-Id: Ib4bad73448735bce1dff598151fce86f867f4db7
Signed-off-by: Erickson Silva de Oliveira <Erickson.SilvadeOliveira@windriver.com>
2023-11-17 08:15:28 -03:00
Eric MacDonald
79d8644b1e Add bmc reset delay in the reset progression command handler
This update solves two issues involving bmc reset.

Issue #1: A race condition can occur if the mtcAgent finds an
          unlocked-disabled or heartbeat failing node early in
          its startup sequence, say over a swact or an SM service
          restart and needs to issue a one-time-reset. If at that
          point it has not yet established access to the BMC then
          the one-time-reset request is skipped.

Issue #2: When issue #1 race conbdition does not occur before BMC
          access is established the mtcAgent will issue its one-time
          reset to a node. If this occurs as a result of a crashdump
          then this one-time reset can interrupt the collection of
          the vmcore crashdump file.

This update solves both of these issues by introducing a bmc reset
delay following the detection and in the handling of a failed node
that 'may' need to be reset to recover from being network isolated.

The delay prevents the crashdump from being interrupted and removes
the race condition by giving maintenance more time to establish bmc
access required to send the reset command.

To handle significantly long bmc reset delay values this update
cancels the posted 'in waiting' reset if the target recovers online
before the delay expires.

It is recommended to use a bmc reset delay that is longer than a
typical node reboot time. This is so that in the typical case, where
there is no crashdump happening, we don't reset the node late in its
almost done recovery. The number of seconds till the pending reset
countdown is logged periodically.

It can take upwards of 2-3 minutes for a crashdump to complete.
To avoid the double reboot, in the typical case, the bmc reset delay
is set to 5 minutes which is longer than a typical boot time.
This means that if the node recovers online before the delay expires
then great, the reset wasn't needed and is cancelled.

However, if the node is truely isolated or the shutdown sequence
hangs then although the recovery is delayed a bit to accomodate for
the crashdump case, the node is still recovered after the bmc reset
delay period. This could lead to a double reboot if the node
recovery-to-online time is longer than the bmc reset delay.

This update implements this change by adding a new 'reset send wait'
phase to the exhisting reset progression command handler.

Some consistency driven logging improvements were also implemented.

Test Plan:

PASS: Verify failed node crashdump is not interrupted by bmc reset.
PASS: Verify bmc is accessible after the bmc reset delay.
PASS: Verify handling of a node recovery case where the node does not
      come back before bmc_reset_delay timeout.
PASS: Verify posted reset is cancelled if the node goes online before
      the bmc reset delay and uptime shows less than 5 mins.
PASS: Verify reset is not cancelled if node comes back online without
      reboot before bmc reset delay and still seeing mtcAlive on one
      or more links.Handles the cluster-host only heartbeat loss case.
      The node is still rebooted with the bmc reset delay as backup.
PASS: Verify reset progression command handling, with and
      without reboot ACKs, with and without bmc
PASS: Verify reset delay defaults to 5 minutes
PASS: Verify reset delay change over a manual change and sighup
PASS: Verify bmc reset delay of 0, 10, 60, 120, 300 (default), 500
PASS: Verify host-reset when host is already rebooting
PASS: Verify host-reboot when host is already rebooting
PASS: Verify timing of retries and bmc reset timeout
PASS: Verify posted reset throttled log countdown

Failure Mode Cases:

PASS: Verify recovery handling of failed powered off node
PASS: Verify recovery handling of failed node that never comes online
PASS: Verify recovery handling when bmc is never accessible
PASS: Verify recovery handling cluster-host network heartbeat loss
PASS: Verify recovery handling management network heartbeat loss
PASS: Verify recovery handling both heartbeat loss
PASS: Verify mtcAgent restart handling finding unlocked disabled host

Regression:

PASS: Verify build and DX system install
PASS: Verify lock/unlock (soak 10 loops)
PASS: Verify host-reboot
PASS: Verify host-reset
PASS: Verify host-reinstall
PASS: Verify reboot graceful recovery (force and no force)
PASS: Verify transient heartbeat failure handling
PASS: Verify persistent heartbeat loss handling of mgmt and/or cluster networks
PASS: Verify SM peer reset handling when standby controller is rebooted
PASS: Verify logging and issue debug ability

Closes-Bug: 2042567
Closes-Bug: 2042571
Change-Id: I195661702b0d843d0bac19f3d1ae70195fdec308
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-11-02 20:58:00 +00:00
Enzo Candotti
23143abbca Update crashDumpMgr to source config from envfile
This commit updates the crashDumpMgr service in order to:
- Cleanup of current service naming and packaging to follow the
  standard Linux naming convention:
    - Repackage /etc/init.d/crashDumpMgr to
      /usr/sbin/crash-dump-manager
    - Rename crashDumpMgr.service to crash-dump-manager.service
- Add EnvironmentFile to crash-dump-manager service file to source
  configuration from /etc/default/crash-dump-manager.
- Update ExecStart of crash-dump-manager service to use parameters
  from EnvironmentFile
- Update crash-dump-manager service dependencies to run after
  config.service.
- Update logrotate configuration to support the retention polices of
  the maximum files. The “rotate 1” option was removed to permit
  crash-dump-manager to manage pruning old files.
- Modify the crash-dump-manager script to enable updates to the
  max_files parameter to a lower value. If there are currently more
  files than the new max_files value, the oldest files will be
  deleted the next time a crash dump file needs to be stored, thus
  adhering to the new max_files values.

Test Plan:

PASS: Build ISO and perform a fresh install. Verify the new
crash-dump-manager service is enabled and working as expected.
PASS: Add and apply new crashdump service parameters and force a kernel
panic. Verify that after the reboot, the max_files, max_used,
min_available and max_size values are updated accordingly to the service
parameters values.
PASS: Verify that the crashdump files are rotated as expected.

Story: 2010893
Task: 48910

Change-Id: I4a81fcc6ba456a0d73067b77588ee4a125e44e62
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-10-06 23:06:54 +00:00
Enzo Candotti
a120cc5fea Add new configuration parameters to crashDumpMgr
This commmit updates crashDumpMgr in order to add three new parameters
and enhance the existing one.

1. Maximum Files: Added 'max-files' parameter to specify the maximum
   number of saved crash dump files. The default value is 4.
2. Maximum Size: Updated the 'max-size' parameter to support
   the 'unlimited' value. The default value is 5GiB.
3. Maximum Used: Included 'max-used' parameter to limit the maximum
   storage used by saved crash dump files. It supports 'unlimited'
   and has a default value of unlimited.
4. Minimum Available: Implemented 'min-available' parameter, enabling
   the definition of a minimum available storage threshold on the
   crash dump file system. The value is restricted to a minimum of
   1GB and defaults to 10%.

These enhancements refine the crash dump management process and
offer more control over storage usage and crash dump file retention.

Story: 2010893
Task: 48676

Test Plan:
1) max-files parameter:
  PASS: don't set max-files param. Ensure the default value is used.
  Create 5 directories inside /var/crash. Each of them contains
  dmesg.<date> and dump.<date>. run the crashDumpMgr script.
  Verify:
    PASS: the vmcore_first.tar.1.gz is created when the first
          directory is read.
    PASS: 4 more vmcore_<date>.tar files are created.
    PASS: There will be 1 vmcore_first.tar.1.gz and 4
          vmcore_<date>.tar inside /var/log/crash.
    PASS: There will be one summary file for each direcory:
          <date>_dmesg.<date> inside /var/crash
2) max-size parameter
  PASS: don't set max-size param. Ensure the default value is used
        (5GiB).
  PASS: Set a fixed max-size param. Create a dump.<date> file greater
        that the max-size param. Run the crashDumpMgr script. Verify
        that the crash dump file is not generated and a log
        message is displayed.
3) max-used parameter:
  PASS: don't set max-used param. Ensure the default value is used
        (unlimited).
  PASS: Set a fixed max-used param. Create a dump.<date> file that
        will generate that the used space is greater that the
        max-used param. Run the crashDumpMgr script. Verify that
        the crash dump file is not generated, a log message is
        displayed and the directory is deleted.
4) min-available parameter:
  PASS: don't set min-available param. Ensure the default value is
        used (10% of /var/log/crash).
  PASS: Set a fixed 'min-available' param. Generate a 'dump.<date>'
        file to simulate a situation where the remaining space is
        less than the 'min-available' parameter. Run the crashDumpMgr
        script and ensure that it does not create the crashdump file,
        displays a log message, and deletes the entry.
5) PASS: Since the crashDumpMgr.service file is not being modified,
         verify that the script takes the default values.

Note: All tests have also been conducted by generating a kernel panic
and ensuring the crashDumpMgr script follows the correct workflow.

Change-Id: I8948593469dae01f190fd1ea21da3d0852bd7814
Signed-off-by: Enzo Candotti <enzo.candotti@windriver.com>
2023-09-18 19:22:09 +00:00
Eric MacDonald
d863aea172 Increase mtce host offline threshold to handle slow host shutdown
Mtce polls/queries the remote host for mtcAlive messages
for 42 x 100 ms intervals over unlock or host failed cases.
Absence of mtcAlive during this (~5 sec) period indicates
the node is offline.

However, in the rare case where shutdown is slow, 5 seconds
is not long enough. Rare cases have been seen where 7 or 8
second wait time is required to properly declare offline.

To avoid the rare transient 200.004 host alarm over an
unlock operation, this update increases the mtce host
offline window from 5 to 10 seconds (approx) by modifying
the mtce configuration file offline threshold from 42 to 90.

Test Plan:

PASS: Verify unchallenged failed to offline period to be ~10 secs
PASS: Verify algorithm restarts if there is mtcAlive received
      anytime during the polls/queries (challenge) window.
PASS: Verify challenge handling leads to a longer but
      successful offline declaration.
PASS: Verify above handling for both unlock and spontaneous
      failure handling cases.

Closes-Bug: 2024249
Change-Id: Ice41ed611b4ba71d9cf8edbfe98da4b65dcd05cf
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-06-16 18:14:08 +00:00
Matheus Guilhermino
a0e270b51b Add mpath support to wipedisk script
The wipedisk script was not able to find the boot device
when using multipath disks. This is due to the fact that
multipath devices are not listed under /dev/disk/by-path/.

To add support to multipath devices, the script should look
for the boot device under /dev/disk/by-id/ as well.

Test Plan
PASS: Successfully run wipedisk on a AIO-SX with multipath
PASS: Successfully run wipedisk on a AIO-SX w/o multipath

Closes-bug: 2013391

Signed-off-by: Matheus Guilhermino <matheus.machadoguilhermino@windriver.com>
Change-Id: I3af76cd44f22795784a9184daf75c66fc1b9874f
2023-04-10 17:10:22 -03:00
Al Bailey
37c5910a62 Update mtce debian package ver based on git
Update debian package versions to use git commits for:
 - mtce         (old 9, new 30)
 - mtce-common  (old 1, new 9)
 - mtce-compute (old 3, new 4)
 - mtce-control (old 7, new 10)
 - mtce-storage (old 3, new 4)

The Debian packaging has been changed to reflect all the
git commits under the directory, and not just the commits
to the metadata folder.

This ensures that any new code submissions under those
directories will increment the versions.

Test Plan:
  PASS: build-pkgs -p mtce
  PASS: build-pkgs -p mtce-common
  PASS: build-pkgs -p mtce-compute
  PASS: build-pkgs -p mtce-control
  PASS: build-pkgs -p mtce-storage

Story: 2010550
Task: 47401
Task: 47402
Task: 47403
Task: 47404
Task: 47405

Signed-off-by: Al Bailey <al.bailey@windriver.com>
Change-Id: I4846804320b0ad3ec10799a468a9ee3bf7973587
2023-03-02 14:50:35 +00:00
Kyale, Eliud
502662a8a7 Cleanup mtcAgent error logging during startup
- reduced log level in http util to warning
- use inservice test handler to ensure state change notification
  is sent to vim
- reduce retry count from 3 to 1 for add_handler state_change
  vim notification

Test plan:
PASS - AIO-SX: ansible controller startup (race condition)
PASS - AIO-DX: ansible controller startup
PASS - AIO-DX: SWACT
PASS - AIO-DX: power off restart
PASS - AIO-DX: full ISO install
PASS - AIO-DX: Lock Host
PASS - AIO-DX: Unlock Host
PASS - AIO-DX: Fail Host ( by rebooting unlocked-enabled standby controller)

Story: 2010533
Task: 47338

Signed-off-by: Kyale, Eliud <Eliud.Kyale@windriver.com>
Change-Id: I7576e2642d33c69a4b355be863bd7183fbb81f45
2023-02-14 14:18:02 -05:00
Christopher Souza
56ab793bc5 Change hostwd emergency log to write to /dev/kmsg
The hostwd emergency logs was written to /dev/console,
the change was to add the prefix "hoswd:" to the log message
and write to /dev/kmsg.

Test Plan:

Pass: AIO-SX and AIO DX full deployment.
Pass: kill pmond and wait for the emergency log to be written.
Pass: check if the emergency log was written to /dev/kmsg.
Pass: Verify logging for quorum report missing failure.
Pass: Verify logging for quorum process failure.
Pass: Verify emergency log crash dump logging to mesg and
      console logging for each of the 2 cases above with
      stressng overloading the server (CPU, FS and Memory);
      stress-ng --vm-bytes 4000000000 --vm-keep -m 30 -i 30 -c 30

Story: 2010533
Task: 47216

Co-authored-by: Eric MacDonald <eric.macdonald@windriver.com>
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Co-authored-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Signed-off-by: Christopher Souza <Christopher.DeOliveiraSouza@windriver.com>
Change-Id: I0da82f964dd096840259c4d0ed4e5f558debdf22
2023-02-01 23:41:14 +00:00
Eric MacDonald
a3cba57a1f Adapt Host Watchdog to use kdump-tools
The Debian package for kdump changed from kdump to kdump-tools

Test Plan:

PASS: Verify build and install AIO DX system
PASS: Verify host watchdog detects kdump as active in debian

Closes-Bug: 2001692
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
Change-Id: Ie1ac29d3d29f3d9c843789cdedf85081fe790616
2023-01-04 12:57:19 -05:00
Robert Church
1796ed8740 Update wipedisk for LVM based rootfs
Now that the root filesystem is based on an LVM logical volume, discover
the root disk by searching for the boot partition.

Changes include:
 - remove detection of rootfs_part/rootfs and adjust rootfs related
   references with boot_disk.
 - run bashate on the script and resolve indentation and syntax related
   errors. Leave long-line errors alone for improved readability.

Test Plan:
PASS - run 'wipedisk', answer prompts, and ensure all partitions are
       cleaned up except for the platform backup partition
PASS - run 'wipedisk --include-backup', answer prompts, and ensure all
       partitions are cleaned up
PASS - run 'wipedisk --include-backup --force' and ensure all partitions
       are cleaned up

Change-Id: I036ce745353b6a26bc2615ffc6e3b8955b4dd1ec
Closes-Bug: #1998204
Signed-off-by: Robert Church <robert.church@windriver.com>
2022-11-29 05:04:38 -06:00