Files
metal/mtce-common
Eric MacDonald 77b414114a Mtce power off action is declared successful before verification
The bmcUtil_is_power_on function is declaring power off while the
reading is not reporting "on" rather than waiting to see the "off".
Some servers report a transient value of "poweringoff" during the
powering off process. The reading not showing "off" makes the function
prematurely declare power as off.

This update adds an explicit check for the "off" state before
declaring the server is powered off'. Similarly for 'on'.

This update addresses 2 additional issues found during debug and
testing that needed to be fixed to properly test the fix of the
original issue.

The issues:
 - in the presence of persistent power control failures,
   the Power-On and Power-Off FSMs are not timing out.
 - power off action (when  power-off is done) phase produces a
   power-on action customer log.

This update drives consistency into and fixes power action try and
retry handling in the following ways:
 - each power (on/off) action starts with a query of the current state
 - each power (on/off) action has retries command actions up to 5 times
 - each action has 10 status query retries with
   -  5 second power status retry delay for power on
   -  0 second power action retry delay for power on
   - 10 second power status retry delay for power off
   - 30 second power action retry delay for power off
 - power off fails with max retries after 12 minutes
 - power  on fails with max retries after  3 minutes

 - each power (on/off) command start produces appropriate
   customer command start log ; in adminActionChange function.
 - each power (on/off) acton completion produces appropriate
   customer action complete log ; in corresponding action FSM.

Failure path testing changes for the redfish protocol:
 - added compile-in FIT capability to test send, recv and reading value
   failure paths.
 - adds a debug option that adds the ability to configure
   mtcAgent, through manual update to /etc/mtc.conf, to save
   redfishtool request responses in dated files for debug purposes.
   This is similar to what was added for the hardware monitor in
   https://opendev.org/starlingx/metal/commit/
   a1256a3c32

Test Plan: IPV4 and IPV6 verified

PASS: Verify AIO DX system intall with BMC support.
PASS: Verify Power-off handling on servers that
      - do report intermediate power state change values
      - do not report intermediate power state change values

Power On: values can vary from server to server

PASS: Verify power on handling while already powered on ; 5 secs
PASS: Verify power on success handling;  10 secs (no retries)
PASS: Verify power on handling with partial retries
PASS: Verify power on handling with full retries - 3 minutes
PASS: Verify power on handling while handling power on
PASS: Verify power on handling while handling power off
PASS: Verify power on admin action command produces power on
      command customer log
PASS: Verify power on status change produces power on action
      customer log

Power Off: values can vary from server to server

PASS: Verify power off handling while already powered off - 5 secs
PASS: Verify power off success handling; 45 secs (no retries)
PASS: Verify power off handling with partial retries
PASS: Verify power off handling with full retries ; 15 minutes
PASS: Verify power off handling while handling power off
PASS: Verify power off handling while handling power on
PASS: Verify power off admin action command produces power off
      command customer log
PASS: Verify power off status change produces power off action
      customer log

Regression:

PASS: Verify system node reset with BMC provisioned
PASS: Verify system node pxeboot install with BMC provisioned
PASS: Verify power control success, failure and retry logging
PASS: Verify Horizon host status for all success & failure cases above

Closes-Bug: 2125596
Closes-Bug: 2125927
Change-Id: I16b35754ff700f9f6a92b090721b8763ad6933d5
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-09-30 20:01:40 -04:00
..