0853bb3fcc
The active 'controller' domain name is used by the mtcAgent management interface to communicate with the mtcClient. The System Swact (Switch Activity) function dynamically migrates active controller services between controller-0 and controller-1. During this process, the mtcAgent, along with other services, are restarted on the newly active controller. When the mtcAgent starts, it reads the system inventory and adds the hosts to its internal control structure. During this "add" operation, the mtcAgent sends commands and expects responses from the local and remote mtcClients on individual nodes, using the controller domain name, which represents the management network's floating IP address. A new feature, the FQDN (Fully Qualified Domain Name) Resolution Manager, was introduced to handle domain name resolution in the StarlingX system. However, an issue was identified where the FQDN resolution manager does not have the 'controller' domain name resolution support fully available (qualified) when the mtcAgent starts messaging with its mtcClients. As a result, the communication between the mtcAgent and mtcClient can lead to silent message loss. This issue can cause the "add host" operation to fail, potentially being service affecting for that host. This update adds a small, manually configurable delay, to the mtcAgent host add operation start. This gives FQDN the time to complete setting up name resolution for the required 'controller' domain name. The default add_host_delay of 20 seconds was selected after seeing the occasional failure with a 10 second delay. This update can be removed in the future if the system makes changes to avoid starting the mtcAgent before all name resolution is ready. Test Plan: PASS: Verify issue in system, apply update, verify issue is resolved. PASS: Verify package/iso build along with AIO DX system install. PASS: Verify mtcAgent logging. Regression: PASS: Verify standby controller lock/unlock soak ; 10+ loops. PASS: Verify Swact soak of 20+ swacts succeeds without reproducing the issue this update is designed to fix. PASS: Verify heart beating is enabled on all remote hosts on both controllers following an install and multiple Swacts. PASS: Verify sensor monitoring is enabled on all hosts that have their BVMC provisioned over a Swact. PASS: Verify mtcClient, mtcAgent, hbsAgent and hbsClient logs for unexpected behavior. PASS: Verify default add hosts delay can be changed and a mtcAgent configuration reload or process restart uses the modified value. PASS: Verify no add host delay is imposed if the new configuration label is removed from the config file or set to 0. PASS: Verify host lock immediately following a swact and successful system host-list. Closes-Bug: 2093381 Change-Id: I694322eff0945c7c56bf21051b3d6cccacf829a2 Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>