In order to ease logging of the various burn-in steps, this patch
proposes options to define the outpout files for all burn-in steps:
{'agent_burnin_cpu', 'agent_burnin_vm', 'agent_burnin_fio_network',
'agent_burnin_fio_disk'}_outputfile via a node's driver-info.
Story: #2007523
Task: #44102
Change-Id: I327cae5949d38e738d3c535487b3795d00ad8f1e
Doing this will cause it not to zero out the entire
block device which can be very costly on a slow HDD.
Story: 2009227
Task: 43315
Change-Id: I62ba2afc037d9844387e6b0984fe5008779d95d2
Add the option to run a SMART self test right after
the disk burn-in. The disk burn-in step will fail if
the SMART test on any of the disk fails.
Story: #2007523
Task: #43383
Change-Id: I1312d5b71bedd044581a136af0b4c43769d21877
Use add instead of update to re-read the partition table with partx.
See [1] for more details.
Co-authored-by: Arne Wiebalck <arne.wiebalck@cern.ch>
[1] https: //opendev.org/openstack/ironic-python-agent/commit/dc8c1f16f9a00e2bff21612d1a9cf0ea0f3addf0
Change-Id: I2336e22dadc790cfbde87904612fcaa3b8c501db
Re-read the partition table with 'partx -a', rather than 'partx -u'.
This should fix an timing issue where the bootloader installation
fails to mount the EFI partition from a whole disk image since it
is not yet aware of the new partitions (observed with both, the
iscsi and the direct deploy interface).
Change-Id: If5da3075e813ae01df3decf8f0647aba111b0515
I accidently put colons on the test data and remembered taking the
colon character out of the regex I was working on, but apparently
left it in, and accounted for the active entry indicator flag
which appears to have inconsistent support across vendors.
The regex has been fixed, and a test added from a Lenovo SR650
which has some additional string entry data in the UEFI output
which may separate entries.
Change-Id: I1f67b0fb1f645fa82e98bd7c7bba3ffc7755cc74
Some firmware seems to take an objection with EFI nvram
entries being deleted after one is added, resulting in the
entire entry table being reset to the last known good state.
This is problematic, as ultimately deployments can time out
if we previously booted with Networking, and the machine, while
commanded to do other wise, reboots back to networking regardless.
We will now delete entries first, before proceeding.
Additionally, for general use, this pattern may serve the
community better by avoiding cases where we would have
previously just relied upon efibootmgr[0] to warn us of duplicate
entries.
[0]: 103aa22ece/src/efibootmgr.c (L228)
Change-Id: Ib61a7100a059e79a8b0901fd8f46b9bc41d657dc
Story: 2009649
Task: 43808
Even if journald is present, there is no guarantee that IPA logs there
(this is the case in container-based ramdisks).
Change-Id: Iceeab0010827728711e19e5b031ccac55fe1efde
* Use the same TLS parameters as everything else
* Respect image_download_connection_timeout
* Do not ignore HTTP errors
Change-Id: I84f8021f731186d82e44ac3d4ef2d12df13f830a
The EFI partition UUID may be None and this will break
the fstab editing. While this is not necessarily fatal when
instantiating a node, it creates an exception at the end of
bootloader installation, so only attempt to add a line to
fstab when the UUID is not None.
Change-Id: I68799980e67c05afe4ca68ca9733605dd166d54d
This patch fixes a race during software RAID creation:
we create the partition with parted, the kernel then
notifies udev, but we need to wait for udevd to create
the device files before calling mdadm to create the
md device.
Credits to jcosmao for finding this.
Change-Id: I642f28acc351cf50263e37dfbc8468bf59de2cc5
Add file to the reno documentation build to show release notes for
stable/xena.
Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/xena.
Sem-Ver: feature
Change-Id: If28b1df9c76469062e6d9ce28edcf3026fdbfbaa
This exposes the MAC address of the first LAN channel with an assigned
IP address in the inventory data. This is useful for inventory
processes where the asset number is not discoverable from the software
side: the BMC MAC is going to be unique (at least within an
organization).
Change-Id: I8a4bee0c25743befd7f2033e4e0cba26895c8926
In order to make sure we have the correct time early, e.g.
by the time we create a TLS certificate, this patch proposes
to force an immediate NTP update when using chronyd. While
the previous approach uses the passed NTP server as well, the
update may happen only after chronyd has performed measurements
(which may be too late).
Story: #2009058
Task: #42843
Change-Id: I6edafe8edeb8549f324959e7a1ec175c3049a515
Add a clean step for network burn-in via fio. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42385
Change-Id: I2861696740b2de9ec38f7e9fc2c5e448c009d0bf
Check if the ESP is already mounted before attempting to mount it
for the bootloader installation.
Change-Id: Ifd738b2c5663f1a211d7e13b5ba386be631d8db1
The IPA sends heartbeats to the conductor periodically and when
requested, e.g. at the end of asynchronous commands. In order
to avoid to send such notifications in too quick succession,
e.g. when two asynchronous commands finish at the same time or
when the periodic heartbeat was just sent right before a command
ended, this patch proposes to coalesce heartbeats which are
close together timewise and send only one for all of them
in a time interval of 5 seconds.
Co-Authored-By: Arne Wiebalck <arne.wiebalck@cern.ch>
Story: #2008983
Task: 42633
Change-Id: Idfbce44065e1e5a8b730b94741b2604c51f0ab14
Adds support to identify and utilize a CSV file to signal which
bootloader to utilize, and set it when the OS is running as opposed
to when EFI is running. This works around EFI loader potentially
crashing some vendors hardware types when entry stored in the
image does not match the EFI loader record which was utilzied to
boot.
Grub2+shim specifically specifically needs the CSV file name
and entry label to match what the system was booted with in order
to prevent the machine from potentially crashing.
See https://storyboard.openstack.org/#!/story/2008962
and https://bugzilla.redhat.com/show_bug.cgi?id=1966129#c37
for more information.
Change-Id: Ibf1ef4fe0764c0a6f1a39cb7eebc23ecc0ee177d
Story: 2008962
Task: 42598
Co-Authored-By: Bob Fournier <bfournie@redhat.com>
Recent releases of redhat grub2 will always fail when installing to
EFI paths, to encourage a transition to the signed shim bootloader.
Partition image deploys avoid calling grub2-install with the
preserve-efi-assets functions. Deploying whole disk images doesn't
require grub2-install. This leaves whole disk images installed onto
softraid devices, which still attempts to call grub2-install.
This change will still attempt to run grub2-install in this
one remaining case, but will ignore any failure.
A future enhancement can avoid calling grub2-install entirely so that
non-redhat secure-boot capable images can keep their signed
bootloaders.
Story: 2008923
Task: 42521
Change-Id: If432ef795d64d76442d739eb4f7d155ff847041e
We're currently requiring it twice: in image_info and in a separate
configdrive argument. I think we should eventually settle on separate
arguments for separate entities, so this change makes the value in
image_info optional with a goal to stop accepting it.
We could probably just remove the handling in image_info, but a
deprecation is safer.
The (unused in ironic) cache_image call is updated with an optional
configdrive arguments.
Story: #2008904
Task: #42480
Change-Id: I1e2efa28efa3ea7e389774cb7633d916757bc6ed
qemu-img attempts to launch multiple threads by default *and*
attempts to have multiple memory allocation arenas to operate
from. While multithreading can be good for performance, this
pattern and the memory footprint for process launch and
dependencies can turn the memory footprint for a cirros image
conversion (16MB) into 1.2GB of memory being asked for by the
qemu-img tool.
In order to limit this impact, as the default number of arenas
is governed by the number of CPUs times the number 8, it seems
reasonable to lower this to a more reasonable number which
also helps keep our possible memory footprint from being exceeded.
Change-Id: I71a28ec59ec31c691205eb34d9fcab63a2ccb682
Story: 2008928
Task: 42528
Add a clean step for disk burn-in via fio. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42384
Change-Id: I5f5e336bd629846b3d779fd0fc7a2060b385b035
The command params can be huge when configdrive is used. There is no
point in sending them back, Ironic does not use them anyhow.
Story: #2008904
Task: #42479
Change-Id: I6e3db5db2042ca3fb5dafacfacf036fd7fc2fc4c
The _manage_uefi code has a check where it attempts to just
identify the precise partition number of the device, in order
for configuration to be parsed and passed. However, the same code
did not handle the existence of a `p1` partition instead of just a
partition #1. This is because the device naming format is different
with NVMe and Software RAID.
Likely, this wasn't an issue with software raid due to how complex the
code interaction is, but the docs also indicate to use only whole disk
images in that case.
This patch was pulled down my one RH's professional services folks
who has confirmed it does indeed fix the issue at hand. This is noted
as a public comment on the Red Hat bugzilla.
https://bugzilla.redhat.com/show_bug.cgi?id=1954096
Story: 2008881
Task: 42426
Related: rhbz#1954096
Change-Id: Ie3bd49add9a57fabbcdcbae4b73309066b620d02
Add a clean step for memory burn-in via stress-ng. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42383
Change-Id: I33a83968c9f87cf795ec7ec922bce98b52c5181c
Add a clean step for CPU burn-in via stress-ng. Get basic
run parameters from the node's driver_info.
Story: #2007523
Task: #42382
Change-Id: I14fd4164991fb94263757244f716b6bfe8edf875
Due to a regression in lshw introduced by
https://github.com/lyonel/lshw/pull/60, there are some versions in the
wild that do not return sizes for memory banks <32GiB. In those cases,
work around the problem by looking at the top-level size (if available)
to find the total size. Previously we assumed that we only needed the
top-level size when there was no list of memory banks.
The issue is fixed upstream by https://github.com/lyonel/lshw/pull/65,
but the erroneous patch is still present in the lshw-B.02.19.2-5.el8
package in CentOS 8.4 and 8.5.
Change-Id: I6eb5981d28b9ae368239af0c1d0ec32ff79d95b3
Story: #2008865
Task: 42395
After GPT and MBR are destroyed systemd-udevd gets triggered
which may hold /dev/sda open preventing qemu-img from writting
its image.
Story: 2008830
Task: 42312
Change-Id: I6105192a16fcb7f6898910e8d0ab824d731d491d
For software RAID in UEFI mode, we create ESPs on all holder disks
and copy the bootloader there. Since there is no mechanism to keep
the ESPs in sync, e.g. on kernel upgrades or when kernel parameters
are updated, the ESPs will get out of sync eventually. This may lead
to a situation where a node boots with outdated parameters or does
not have any of the installed kernels in the boot menu anymore.
This change proposes to RAID the ESPs. While the UEFI firmware will
find an ESP partition (one leg of the mirror), the node will see
an md device and all subsequent updates will go to all member disks.
Also, remove the source ESP after copying in order to avoid mount
confusion (same UUID!).
Story: #2008745
Task: #42103
Change-Id: I9078ef37f1e94382c645ae98ce724ac9ed87c287