Files
ironic/releasenotes/notes/permit-retries-with-agent-startup-aebfc36a775794c3.yaml
Julia Kreger 91b28bc43b Fix agent get_XXX_steps retries from being treated as not fresh agents
It is possible that an agent is booting in an environment with firewalls
doing evil things like not closing sockets, or where a FIN-ACK never makes
it to the conductor, or whatever.

This can result in the client hanging and eventually timing out.

Ironic's agent client code automatically retries. Which is cool.

The agent records it got a command from the first attempt, and then again
from the retry. Everything goes swimmingly until Ironic goes to assess
if the agent is a "freshly booted" agent, or not. At which point, the
check logic would see the multiple "get_xxx_steps" calls in the agent
logs, and declare the agent not to be freshly booted due to the retry
attempts.

So instead, we now explicitly evaluate the results of the command in
whole to account for retires. This commit also adds additional tests
as the helper was previously only really being exercised with empty
lists in unit tests.

Closes-Bug: 2110698
Change-Id: I460751b761462dbb630368e474e207fed90f289a
2025-05-14 10:11:35 -07:00

12 lines
568 B
YAML

---
fixes:
- |
Fixes an issue with agent startup where the workflow from the first
agent heartbeat interaction could fail due to a transient networking
issue leaving the Agent and Ironic in a state where the node cannot be
deployed and continues to record errors upon each additional heartbeat
operation. Logic to check the state of the agent has been adjusted to
ignore retry operations which were recorded by the agent.
More information on this issue can be found in
`bug 2110698 <https://bugs.launchpad.net/ironic/+bug/2110698>`_.