Eric MacDonald 0853bb3fcc Add configured add host delay to mtcAgent
The active 'controller' domain name is used by the mtcAgent
management interface to communicate with the mtcClient.

The System Swact (Switch Activity) function dynamically migrates
active controller services between controller-0 and controller-1.
During this process, the mtcAgent, along with other services, are
restarted on the newly active controller.

When the mtcAgent starts, it reads the system inventory and adds the
hosts to its internal control structure. During this "add" operation,
the mtcAgent sends commands and expects responses from the local and
remote mtcClients on individual nodes, using the controller domain
name, which represents the management network's floating IP address.

A new feature, the FQDN (Fully Qualified Domain Name) Resolution
Manager, was introduced to handle domain name resolution in the
StarlingX system. However, an issue was identified where the FQDN
resolution manager does not have the 'controller' domain name
resolution support fully available (qualified) when the mtcAgent
starts messaging with its mtcClients.

As a result, the communication between the mtcAgent and mtcClient can
lead to silent message loss. This issue can cause the "add host"
operation to fail, potentially being service affecting for that host.

This update adds a small, manually configurable delay, to the mtcAgent
host add operation start. This gives FQDN the time to complete setting
up name resolution for the required 'controller' domain name.

The default add_host_delay of 20 seconds was selected after seeing
the occasional failure with a 10 second delay.

This update can be removed in the future if the system makes changes
to avoid starting the mtcAgent before all name resolution is ready.

Test Plan:

PASS: Verify issue in system, apply update, verify issue is resolved.
PASS: Verify package/iso build along with AIO DX system install.
PASS: Verify mtcAgent logging.

Regression:

PASS: Verify standby controller lock/unlock soak ; 10+ loops.
PASS: Verify Swact soak of 20+ swacts succeeds without reproducing
      the issue this update is designed to fix.
PASS: Verify heart beating is enabled on all remote hosts on both
      controllers following an install and multiple Swacts.
PASS: Verify sensor monitoring is enabled on all hosts that have
      their BVMC provisioned over a Swact.
PASS: Verify mtcClient, mtcAgent, hbsAgent and hbsClient logs for
      unexpected behavior.
PASS: Verify default add hosts delay can be changed and a mtcAgent
      configuration reload or process restart uses the modified value.
PASS: Verify no add host delay is imposed if the new configuration
      label is removed from the config file or set to 0.
PASS: Verify host lock immediately following a swact and
      successful system host-list.

Closes-Bug: 2093381
Change-Id: I694322eff0945c7c56bf21051b3d6cccacf829a2
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2025-01-10 12:54:46 +00:00
..