Allows nodes with a single IP stack to be deployed from a dual-stack
Ironic.
Detecting advertised address and usable Ironic URLs are done completely
independently which does open some space for a misconfiguration. I hope
it's not likely in the reality, especially since this feature is
targetting advanced standalone users.
Change-Id: Ifa506c58caebe00b37167d329b81c166cdb323f2
Closes-Bug: #2045548
Somehow, it has worked correctly for years, but now I've discovered that
the new inspection is (no longer?) tolerant to the missing header.
While here, copy all headers from the heartbeat code.
Change-Id: I9e5c609eb4435e520bc225dea08aedfdf169744b
This fixes several spelling issues identified by codepsell. In some
cases, I may have manually modified a line to make the output more clear
or to correct grammatical issues which were obvious in the codespell
output.
Later changes in this chain will provide the codespell config used to
generate this, as well as adding this commit's SHA, once landed, to a
.git-blame-ignore-revs file to ensure it will not pollute git historys
for modern clients.
Related-Bug: 2047654
Change-Id: I240cf8484865c9b748ceb51f3c7b9fd973cb5ada
Currently, if heartbeat fails, we reschedule it after 5 seconds.
This is fine for the first retry, but it can cause a thundering herd
problem when a lot of nodes fail to heartbeat at once.
This change adds jitter to the minimum wait of 5 seconds. The jitter is
not applied for forced heartbeats: they still have a minimum wait of
exactly 5 seconds from the last heartbeat.
The code is re-ordered to move the interval calculation to one place.
Bonus: correctly logging the next interval.
The unit tests have been rewritten to test the heartbeat process step by
step and not rely on the exact sequence of the calls.
Closes-Bug: #2038438
Change-Id: I4c4207b15fb3d48b55e340b7b3b54af833f92cb5
* ProxyError is derived from ConnectionError, but it's necessary
to check the Response object to identify.
- Added ProxyError in retry_if_exception_type
- Updated _post_to_inspector to proper handle ProxyError
- Updated the wait to use wait_exponential instead of wait_fixed.
Closes-Bug: 2045429
Change-Id: Iefe3fe581cd4e7c91a0da708e6f6d0fdaacab6fe
This reverts commit 33f01fa3c2f32f447ed36f00fea68321c3991c2e.
There are a few issues with the patch - see my comments there.
The most pressing and the reasons to revert are:
1) It breaks deployments when the vmedia is present but does not
have a network_data.json (the case for Metal3).
2) It assumes the presence of Glean which may not be the case.
Neither Julia nor myself have time to thoroughly fix the issue,
leaving a revert as the only option to unblock Metal3.
Change-Id: I3f1a18a4910308699ca8f88d8e814c5efa78baee
Closes-Bug: #2045255
In some cases the output of the multipath can differ
and we would return a wrong parent device.
Closes-Bug: 2043992
Change-Id: I848d7df798cc736bd5a55eed8fa46110caea1dc3
This commit:
- fixes some "multipathd error handling improvement"
release notes
- fixes a related comment in the code
Related launchpad issue https://bugs.launchpad.net/ironic-python-agent/+bug/2031092
Change-Id: Ie3ba0601fa117b053cb8db6284e47249ca9c9134
Signed-off-by: Adam Rozman <adam.rozman@est.tech>
When performing DHCP-less deployments, the agent can start and
discover more than one configuration drive present on a host.
For example, a host was previously deployed using Ironic, and
is now being re-deployed again.
If Glean was present in the ramdisk, the glean-early.sh would end
mounting the folder based upon label.
If cloud-init, somehow is still in the ramdisk, the other folder
could somehow get mounted.
This patch, which is intended to be backportable, causes the agent
to unmount any configuration drive folders, mount the most likely
candidate based upon device type, partition, and overall state of
the machine, and then utilize that configuration, if present,
to re-configure and reload networking.
Thus allowing dhcp-less re-deployments to be fixed without
forcing any breaking changes.
It should also be noted that this fix was generated in concert
with an additional tempest test case, because this overall failure
case needed to be reproduced to ensure we had a workable non-breaking
path forward.
Closes-Bug: 2032377
Change-Id: I9a3b3dbb9ca98771ce2decf893eba7a4c1890eee
Image caching was never fully supported in Ironic or IPA; this is vestigal
code leftover from a partial implementation.
Even if we implemetented it today, we'd likely use a completely different
methodology.
Change-Id: Id4ab7b3c4f106b209585dbd090cdcb229b1daa73
IPA now includes information about numa node id when collecting
information about PCI devices.
Closes-bug: #1622940
Co-Authored-By: Jay Faulkner <jay@jvf.cc>
Change-Id: I70b0cb3eff66d67bb8168982acbbf335de0599cd
This commit:
- Adds the ability to ignore inconsequential OS error caused
by starting the multipathd service when an instance of the
service is already running.
Related launchpad issue https://bugs.launchpad.net/ironic-python-agent/+bug/2031092
Change-Id: Iebf486915bfdc2546451e6b38a450b4c241e43a8
HTTP is a fun protocol.
Size is basically optional. And clients implicitly trust the server
and socket has transferred all the bytes. Which *really* means you
should always checksum.
But... previously we didn't checksum as part of retrying.
So if anything happened with python-requests, or lower level
library code or the system itself causing bytes to be lost off the
buffer, creating an incomplete transfer situation, then we wouldn't
know until the checksum.
So now, we checksum and re-trigger the download if there is a
failure of the checksum.
This involved a minor shift in the download logic, and resulted in
a needful minor fix to an image checksum test as it would loop for
90 seconds as well.
Closes-Bug: 2038934
Change-Id: I543a60555a2621b49dd7b6564bd0654a46db2e9a
Add file to the reno documentation build to show release notes for
stable/2023.2.
Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/2023.2.
Sem-Ver: feature
Change-Id: I8150eb8f35a444ef5a2bc7a648ec301e5094e52d
Follow-up from service steps addition change to add a deploy steps
alias for the Nvidia Mellanox network device firmware update clean
steps. This allows deploy time firmware updates to be codified as
part of a deployment with custom steps.
Change-Id: I9d80447dee7cfde4d3f8d81d9d39e738916b7824
Initial code patches for service steps have merged in
ironic, and it is now time to add support into the
agent which allows service steps to be raised to
the service.
Updates the default hardware manager version to 1.2,
which has *rarely* been incremented due to oversight.
Change-Id: Iabd2c6c551389ec3c24e94b71245b1250345f7a7
Changes the default lookup timeout to be 600 seconds which
reduces the risk of lookup failing as a write operation
to the backing database is performed upon lookup thanks to
generation of an agent token.
Overall, this is fairly harmless since by default ramdisks
restart the agent if they were not able to successfully
start.
Change-Id: I35c64c0b4f9b3b607df1bc0c4c2a852aa3595cbd
When an underlying block device (or driver) only supports 4KB IO,
this can cause some issues with aspects like using an ISO9660 filesystem
which can only support a maximum of 2KB IO.
The agent will now attempt to mount the filesystem *before* deleting the
supplied file, and should that fail it will mount the configuration drive
file from the ramdisk utilizing a loopback, and then extract the contents
of the ramdisk into a newly created VFAT filesystem which supports 4KB
block IO.
Closes-Bug: #2028002
Change-Id: I336acb8e8eb5a02dde2f5e24c258e23797d200ee
If the node is locked, a lookup cannot be performed when an agent
token needs to be generated, which tends to error like this:
ironic_python_agent.ironic_api_client [-] Failed looking up node
with addresses '00:6f:bb:34:b3:4d,00:6f:bb:34:b3:4b' at
https://172.22.0.2:6385. Error 409: Node
c25e451b-d2fb-4168-b690-f15bc8365520 is locked by host 172.22.0.2,
please retry after the current operation is completed..
Check if inspection has completed.
Problem is, if we keep pounding on the door, we can actually worsen
the situation, and previously we would just just let tenacity
retry.
We will now hold for 30 seconds before proceeding, so we have
hopefully allowed the operation to complete.
Also fixes the error logging to help human's sanity.
Change-Id: I97d3e27e2adb731794a7746737d3788c6e7977a0
Rebuilding an instance on a RAIDed ESPs will fail due to sgdisk
running against an non-clean disk and bailing out. Check if there
is a RAIDed ESP already and skip creation if it exists.
Change-Id: I13617ae77515a9d34bc4bb3caf9fae73d5e4e578
Bandit 1.7.5 released with a timeout check for all requests and
urllib calls.
Fixed those.
In the process, then exposed a bandit b310 issue, which was already
covered by the code, but explicitly marked it as such.
Also, enables bandit checks to be voting for CI..
Change-Id: If0e87790191f5f3648366d571e1d85dd7393a548
This was a significant breaking change that was landed despite explicit
disagreement by some community members (myself included). It has already
resulted in an accidental Ironic CI breakage, has broken Bifrost and has
a potential of breaking Metal3. In case of Metal3, MD5 support is a part
of its public API.
While MD5 is a potential security hazard, I don't see the need to hurry
this change without giving the community time to prepare. This change
reverts the new option md5_enabled to True.
Change-Id: I32b291ea162e8eb22429712c15cb5b225a6daafd
The CentOS Stream SUM files uses format:
# FILENAME: <size> bytes
ALGORITHM (FILENAME) = CHECKSUM
Compared to the more common format:
CHECKSUM *FILE_A
CHECKSUM FILE_B
Use regular expressions to check for filename both
in the middle with parentheses and at the end.
Similarly look for valid checksums at beginning or
end of line. Also look for know checsum patterns in
case file only contain the checksum iteself.
Change-Id: I9e49c1a6c66e51a7b884485f0bcaf7f1802bda33
Binary LLDP data is bloating inventory causing us to disable its collection
by default. For other similar low-level information, such as PCI devices
or DMI data, we already use inspection collectors instead. Now that the
inventory format is shared with out-of-band inspection, having LLDP
there makes even less sense.
This change adds a new collector ``lldp`` to replace the now-deprecated
inventory field.
Change-Id: I56be06a7d1db28407e1128c198c12bea0809d3a3
MD5 image checksums have long been supersceeded by the use of a
``os_hash_algo`` and ``os_hash_value`` field as part of the
properties of an image.
In the process of doing this, we determined that checksum via
URL usage was non-trivial and determined that an appropriate
path was to allow the checksum type to be determined as needed.
Change-Id: I26ba8f8c37d663096f558e83028ff463d31bd4e6
The tl;dr is that UEFI NVRAM is in encoded
in UTF-16, and when we run the efibootmgr command,
we can get unicode characters back.
Except we previously were forcing everything to be
treated as UTF-8 due to the way oslo.concurrency's
processutils module works.
This could be observed with UTF character 0x00FF
which raises up a nice exception when we try to
decode it.
Anyhow! while fixing handling of this, we discovered
we could get basically the cruft out of the NVRAM,
by getting what was most likey a truncated string
out of our own test VMs. As such, we need to also
permit decoding to be tollerant of failures.
This could be binary data or as simple as flipped
bits which get interpretted invalid characters.
As such, we have introduced such data into one of our
tests involving UEFI record de-duplication.
Closes-Bug: 2015602
Change-Id: I006535bf124379ed65443c7b283bc99ecc95568b