There are some Ironic execution workflows where there is not an easy way
to retry, such as when attempting to hand off the processing of an async
task to a conductor. Task handoff can require releasing a lock on the
node, so the next entity processing the task can acquire the lock
itself. However, this is vulnerable to race conditions, as there is no
uniform retry mechanism built in to such handoffs. Consider the
continue_node_deploy/clean logic, which does this:
method = 'continue_node_%s' % operation
# Need to release the lock to let the conductor take it
task.release_resources()
getattr(rpc, method)(task.context, uuid, topic=topic
If another process obtains a lock between the releasing of resources and
the acquiring of the lock during the continue_node_* operation, and
holds the lock longer than the max attempt * interval window (which
defaults to 3 seconds), then the handoff will never complete. Beyond
that, because there is no proper queue for processes waiting on the
lock, there is no fairness, so it's also possible that instead of one
long lock being held, the lock is obtained and held for a short window
several times by other competing processes.
This manifests as nodes occasionally getting stuck in the "DEPLOYING"
state during a deploy. For example, a user may attempt to open or access
the serial console before the deploy is complete--the serial console
process obtains a lock and starves the conductor of the lock, so the
conductor cannot finish the deploy. It's also possible a long heartbeat
or badly-timed sequence of heartbeats could do the same.
To fix this, this commit introduces the concept of a "patient" lock,
which will retry indefinitely until it doesn't encounter the NodeLocked
exception. This overrides any retry behavior.
.. note::
There may be other cases where such a lock is desired.
Story: #2008323
Change-Id: I9937fab18a50111ec56a3fd023cdb9d510a1e990
Previously disk labels would not be populated if not explicitly
set by an API user, which lead to a dangerous possible case,
which sometimes could work, but was ultimately wrong to
setup a UEFI booting machine with a BIOS MBR partition table.
Not all systems support this, but UEFI systems are supposed to
support GPT partition tables.
We now fallback if no explicit override is set and assume GPT
if the machine is set to UEFI mode.
Change-Id: I001d8c6ee3b1d6c466c71ea5179bdbca9bdd692d
Currently we're building a VFAT image with the network data just
to unpack it back on the next step. Just pass the file directly.
This fixes a permission denied problem on Bifrost on Fedora
(at least).
As a nice side effect, the change reduces the amount of IO done
for virtual media quite substantially.
Change-Id: I5499fa42c1d82a1a29099fbbba6f45d440448b72
When moving the node to ``manageable``, in addition to
``cleaning``, retrieve the BIOS configuration settings. In the
case of ``manageable``, this may allow the settings to be used
when choosing which node to deploy.
Change-Id: Ic2b162f31d4a1465fcb61671e7f48b3d31de788c
Story: 2008326
Task: 41224
The base implementation of get_node_network_data returns {} and is
not overridden in the noop network. Update the base implementation
to use task.node.network_data and remove the excessive logging.
Change-Id: Ie50dcd1c2a151f5dd09794467792527032249809
The issue is that when a port group doesn't have a mac address assigned by
operators, and during provisioning we unbind/bind tenant port with None which
causes the mac address to be regenerated twice and differs from the originally
one allocated by nova or users which was packed into config drive. The end result
is that, bond port has different mac address configured and can't the IP address
from neutron.
Change-Id: I92ed5d17239216324d6a69e0ed8771fd6948d6ec
Story: 2008300
Task: 41185
Each ironic-api process consumes non-negligible amount of RAM,
defaulting to CPU core count may result in many hundres of megabytes
occupied by ironic-api processes. Limit the default value to 4
and let people who actually need more than that pick their value.
Change-Id: I5aefa8c6c7aadc56aea151647e1c0a5af54ada4c
Instead of using process_event('fail') use error_handlers,
otherwise in case of failure node gets stuck and fails
because of timeout, instead of failing earlier due to
step failure.
Story: 2008307
Task: 41194
Change-Id: Ieec0173f57367587985d2baad77205bb83e8b69a
Instead of using process_event('fail') use error_handlers,
otherwise in case of failure node gets stuck and fails
because of timeout, instead of failing earlier due to
step failure.
Besides adding new unit tests, also update related unit tests
to test for success correctly and have realistic data.
Story: 2008307
Task: 41196
Change-Id: If28ccb252a87610e3fd3dc78e1ed75bb8ca1cdcf
Calling prepare_ramdisk may break fast-track, as it's the case with
redfish-virtual-media (it powers nodes off unconditionally). To
avoid timeouts, check fast-track status again after prepare_ramdisk.
Change-Id: Iad2d6f4827bd7e8b2a02005fe18d31ec8d37db97
The agent command exec model is based upon an incoming
heartbeat, however heartbeats are independent and
commands can take a long time. For example, software RAID
setup in CI can encounter this.
From an IPA log:
[-] Picked root device /dev/md0 for node c6ca0af2-baec-40d6-879d-cbb5c751aafb
based on root device hints {'name': '/dev/md0'}
[-] Attempting to download image from http://199.204.45.248:3928/agent_images/
c6ca0af2-baec-40d6-879d-cbb5c751aafb
[-] Executing command: standby.get_partition_uuids with args: {} execute_command
/usr/local/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py:255
[-] Tried to execute standby.get_partition_uuids, agent is still executing Command name:
execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image',
'args': {'image_info': {'id': 'cb9e199a-af1b-4a6f-b00e-f284008b8046',
'urls': ['http://199.204.45.248:3928/agent_images/c6ca0af2-baec-40d6-879d-cbb5c751aafb'],
'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, 'os_hash_algo':
'sha512', 'os_hash_value':<trimed>
This was with code built on master, using master images.
Inside the conductor log, it notes that it is likely an out
of date agent because only AgentAPIError is evaluated,
however any API error is evaluated this way. In reality, we need
to explicitly flag *when* we have an error that is because
we've tried to soon as something is already being worked upon.
The result, is to evaluate and return an exception indicating work
is already in flight.
Update - It looks like, the original fix to prevent busy agent
recognition did not fully detect all cases as getting steps is a
command which can
get skipped by accident with a busy agent, under certain circumstances.
Change I5d86878b5ed6142ed2630adee78c0867c49b663f in ironic-python-agent
also changed the string that was being checked for the previous
handling, where we really should have just made the string we were
checking lower case in ironic. Oh well! This should fix things
right up.
Story: 2008167
Task: 41175
Change-Id: Ia169640b7084d17d26f22e457c7af512db6d21d6
If the conductor's host option is configured to be an IPv6 address, we
need to surround it with [] when incorporating it in a URL to contact
the conductor over json-rpc.
Change-Id: Ib3bc4c570ec0f2e5c73e3ce15b05684b8e4c1ff9
Story: 2008288
Task: 41166
An empty physical_network can be set to port and make the port
unusable.
Change-Id: I58cf04839f40922cf0c7ddffc08b843cb3c50e06
Story: 2008279
Task: 41153
Using software RAID with whole disk images requires specifying
a root partition UUID, but it is only possible through Glance.
This change adds an explicit field for that.
Change-Id: I55e3727aab3960ef472ec2db1f23c25db405e801
After changing the boot device via Redfish, check that the boot mode being
reported matches what is configured and, if not, set it to the configured
value. Some BMCs change the boot mode when the device is set via Redfish,
this will ensure the mode is set properly.
Change-Id: Ib077f7f32de029833e6bd936853c382305bce36e
Story: 2008252
Task: 41103
Calculating the ipmitool `-N` and `-R` arguments from ironic.conf
[ipmi] `command_retry_timeout` and `min_command_interval` now takes
into account the 1 second interval increment that ipmitool adds on
each retry event.
Failure-path ipmitool run duration will now be just less than
`command_retry_timeout` instead of much longer.
Change-Id: Ia3d8d85497651290c62341ac121e2aa438b4ac50
The root device hint is not guaranteed to be valid after raid
configuration in most cases, this could cause no matching device
found and fail the deployment. AgentRAID implementations can return
the correct root device hint from the create_configuration.
Change-Id: Iab97a16ef8ccea8186f0cc7a14b77d508804fc8d
Because of using an incorrect variable, reboot was treated as power on,
and the token was not wiped.
Change-Id: I656450c2bedc3dc0d20a70de78cc29bf64d5fe85
Story: #2008097
Task: #40799
This commit fix an issue to inspect nodes using idrac when using
redfish virtual media. We are using the redfish configuration so
it can be backported.
Change-Id: I478c25fac13b49867349c2d9fc8d206c9994c398
Story: #2008221
Task: #41010
This patch implements reporting number of NVIDIA Tesla T4
devices connected to a system by discovering such devices
and reporting them through capability 'pci_gpu_devices'.
Change-Id: If713895f05f08a9827c4c085108abb3e388b2a2e
Story: 2008118
Task: 40839
Depends-On: https://review.opendev.org/#/c/750364/