510 Commits

Author SHA1 Message Date
Jay Faulkner
80575566b1 Allow manual setting of Ironic API Version
Typically, the Ironic API client in IPA will autodetect the API version
based on the output of a GET of the root of the API. If for some reason
this API endpoint is restricted, or the operator wishes to limit the
Ironic API version IPA uses, they can now set CONF.ironic_api_version to
avoid autodetection and force a version.

Change-Id: Ib96a1057792f45f2e4554671e32c436140463ee8
2020-10-23 15:38:42 +00:00
Julia Kreger
6542a9cb04 Don't run os-prober from grub2-mkconfig
By default, grub2-mkconfig scans everything to look for other
environments and then load those into the grub configuration.

It makes sense, but on newer versions of grub2 in distribution
images, os-prober is taking an exceptionally long time in some
cases where more than one storage device exists with other
filesystems.

As a result, of the os-prober execution by grub2-mkconfig, the
bootloader installation can completely time out and fail the
deployment. This is presently experienced with metalsmith on
centos8.

There are numerous sporatic reports of issues like this issue
where grub2-mkconfig hangs for some period of time, and this is
observable on Centos8.2 in our CI. While one report[0] mentions
this issue, Another bug [1] has the dialog that actually helps us
frame the context as to what we likely should do.

Also, fixes the unit testing so we actually test if we're running
with grub2. :\

[0]: https://bugzilla.redhat.com/show_bug.cgi?id=1744693
[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1709682

Depends-On: https://review.opendev.org/#/c/748315
Change-Id: I14bf299afef3a1ddb2006fe5f182d7f0d249e734
2020-10-22 22:28:07 +00:00
Zuul
80b0a9a132 Merge "Software RAID: Re-add missing devices" 2020-10-12 12:24:24 +00:00
Dmitry Tantsur
420ebc0d73 Do not silently swallow errors in the write_image deploy step
Calling join() does not raise, we need to explicitly check the result.

Change-Id: I81d3d727af220c2b50358edab8139f07874611f0
Story: #2008240
Task: #41083
2020-10-09 11:24:12 +02:00
Dmitry Tantsur
fc4e0eed6a Don't try to call GRUB when root UUID is not provided
We don't have a really working way to detect root UUID for whole
disk images at the moment, which results in an ignored traceback
every time install_bootloader is called with whole disk images in
UEFI mode. Avoid it by skipping GRUB2 if root UUID is unknown.

Change-Id: I84245538f59c664b72d1cafbca8d61be0978f489
2020-10-07 12:06:42 +02:00
Zuul
abd9f91813 Merge "Add basic retries for inspection" 2020-10-06 17:07:20 +00:00
Arne Wiebalck
253b4887d5 Software RAID: Re-add missing devices
Upon md device creation, component devices are sometimes removed
immediately again due to a "disk failure". The disks seem healthy,
though. This patch re-adds compoenent devices in such cases to
prevent that the md device will remain in a degraded state (which
would cause issues later, e.g. during ESP creation).

Story: #2008164
Task: #40914

Change-Id: I2ac7cb4a546de84686d5c3435e850c14b3f6c1d7
2020-10-06 14:00:57 +02:00
fb45e58d1c Update master for stable/victoria
Add file to the reno documentation build to show release notes for
stable/victoria.

Use pbr instruction to increment the minor version number
automatically so that master versions are higher than the versions on
stable/victoria.

Change-Id: Ia3696da8663c140504924b0a1cd23f9aaa517f0a
Sem-Ver: feature
2020-10-01 18:42:40 +00:00
Zuul
99dee5067e Merge "Software RAID: Get component devices by md UUID" 2020-09-30 18:30:56 +00:00
Zuul
faeb9441d3 Merge "Simplify heartbeating by removing use of select()" 2020-09-29 15:47:08 +00:00
Arne Wiebalck
044c64dbc0 Software RAID: Get component devices by md UUID
Scanning the output of mdadm commands for RAID members will
miss component devices which are currently not part of the
RAID. For proper cleaning it is better to scan block devices
for a signature of the md device for which we would like to
get the components.

Story: #2008186
Task: #40947

Change-Id: Ib46612697851e36a16d272ccaeb0115106253863
2020-09-29 17:08:40 +02:00
Arne Wiebalck
c7aec775ff Software RAID: Don't delete partitions too early
Partions on the holder disk should only be deleted after
all RAID devices have been deleted. Otherwise, super blocks
on partitions which reside on the same disks cannot be cleaned.

Story: #2008199
Task: #40979
Change-Id: I19293f5b992cd1fa68957d6f306dcec8f3b7a820
2020-09-28 10:35:12 +02:00
Zuul
c7ff931fe6 Merge "Fix: make Intel CNA hardware manager none generic" 2020-09-23 14:57:40 +00:00
Zuul
11a87365fb Merge "Generate a TLS certificate and send it to ironic" 2020-09-23 12:14:38 +00:00
Qianbiao.NG
4b0ef13d08 Fix: make Intel CNA hardware manager none generic
Currently, IntelCnaHardwareManager inherits GenericHardwareManager
which makes it a new "GenericHardwareManager" with "MAINLINE" priority.
This causes all other hardware-managers with lower priority than
"MAINLINE" never be used. To fix this, make IntelCnaHardwareManager
inherit basic HardwareManager.

Change-Id: I28b665d8841b0b2e83b132e1f25df95e03e7ba10
Story: 2008142
Task: 40882
2020-09-23 18:24:26 +08:00
Jay Faulkner
a01646f56b Simplify heartbeating by removing use of select()
Heartbeating in IPA has used select.poll() for years to workaround
a bug where changing the time in the ramdisk could cause heartbeats
to stop and never resume.

Now that IPA syncs time at start and exit, this workaround is no
longer needed. So instead, we'll revert to using threading.Event()
in order to make the code simpler and easier to understand.

Since we need this to be an eventlet-event, and not a standard-thread
event, also monkey_patch threading.

Additionally, there were a few completely unused backoff interval
values set, that were never applied. In respect of maintaining the
5+ years old behavior of not doing error backoffs, that code was
removed instead of being made to work.

Change-Id: Ibcde99de64bb7e95d5df63a42a4ca4999f0c4c9b
2020-09-22 16:59:47 +00:00
Julia Kreger
bb27badf76 Add basic retries for inspection
A transitory connection failure, such as one caused by
a port being held down for traffic forwarding, can experience
intermittent connectivity failures which result in failed
introspections.

Now the agent retries.

Change-Id: I72c5e3aca000d3854a17f8a461b1a2935e5c0d9b
2020-09-14 22:38:18 +00:00
Dmitry Tantsur
021e0a6a46 Generate a TLS certificate and send it to ironic
Adds a new flag (on by default) that enables generating a TLS
certificate and sending it to ironic via heartbeat. Whether
ironic supports auto-generated certificates is determined by
checking its API version.

Change-Id: I01f83dd04cfec2adc9e2a6b9c531391773ed36e5
Depends-On: https://review.opendev.org/747136
Depends-On: https://review.opendev.org/749975
Story: #2007214
Task: #40604
2020-09-11 17:46:52 +02:00
Julia Kreger
3426963552 Fix backup node lookup
The node lookup code added in change
I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
was slightly broken in that we call a method
with a keyword arguemnt which doesn't exist.

uuid versus node_uuid.

It happens, it is a quick fix!

Spotted on a metalsmith job:

[-] Agent is requesting to perform an explicit node cache update.
    This is to pickup any chanages in the cache before deployment.
[-] Failed to update node cache. Error lookup_node() got an
    unexpected keyword argument 'uuid'

Change-Id: I59ecec65707a2f03918b233f1925395ebe59b8c4
2020-09-09 15:19:38 -07:00
Zuul
e73b7220c4 Merge "If listen_tls is true, enable TLS on wsgi server" 2020-09-03 18:59:48 +00:00
Zuul
09f6a4e3da Merge "Update the cache if we don't have a root device hint" 2020-09-03 09:41:58 +00:00
Jay Faulkner
1d11f0b7dd If listen_tls is true, enable TLS on wsgi server
This change enables operators to set [DEFAULT]listen_tls to
true configure IPA to be host its WSGI server over TLS using
existing SSL support in oslo.service.

In addition to configuring this in IPA, a deployer will need to
also set [ssl]cert_file, [ssl]key_file, and optionally
[ssl]ca_file in their ipa config, in addition to embedding those
files into the IPA ramdisk in order for this to be functional.

In order to make this change work, we also need to monkey patch
socket library early, or else oslo.service will end up passing an
unpatched socket to the eventlet wsgi server, which causes
deadlocks.

Change-Id: Ib7decae410915f3c27b045ee08538c94d455b030
2020-09-02 16:07:42 -07:00
Jay Faulkner
7d0ad36ebd Make WSGI server respect listen_* directives
The listen_port and listen_host directives are intended to allow
deployers of IPA to change the port and host IPA listens on. These
configs have not been obeyed since the migration to the oslo.service
wsgi server.

Story: 2008016
Task: 40668
Change-Id: I76235a6e6ffdf80a0f5476f577b055223cdf1585
2020-08-31 14:37:38 +00:00
Julia Kreger
d3c3d4dabe Update the cache if we don't have a root device hint
Or at least try to.

Some deployments just don't use root device hints, and this is okay.

However, other deployments need root device hints, and with fast
track mode in ramdisks, we created a situation where the node cache
could be updated by a human or software between the time the agent
was started, and the deployment was requested.

As a result, the agent has been updated to check if we have a hint
and if we don't, update the cache from the node lookup endpoint.

This is not needed when the inband deploy steps are executed, as
the process of updating the steps does force the node cache to be
updated.

Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6
Story: 2008039
Task: 40701
2020-08-25 19:34:48 +00:00
Zuul
cfede0c5bc Merge "Clarify connection error on heartbeats" 2020-08-24 13:29:27 +00:00
Julia Kreger
f670f704f3 Clarify connection error on heartbeats
Heartbeat connection errors are often a sign of a transitory
network failures which may resolve themselves. But an operator
looking at the screen doesn't necessarilly know that.

They don't understand that there could have been a network
failure, or a misconfiguration that caused the connectivity
failure and soft of kind of default to "well it failed"
without further clarification.

As such, this patch adds explicit catching of the requests
ConnectionError exception and rasies a new internal error
with a more verbose error message in that event to provide
operators with additional clarity.

Change-Id: I4cb2c0d1f577df1c4451308bd86efa8f94390b0c
Story: 2008046
Task: 40709
2020-08-20 13:45:47 -07:00
Dmitry Tantsur
d50ff06b6b Enable the logs collection by default
It's incredibly helpful when debugging and most of consumers seem
to enable and rely on it.

Change-Id: I33bf58b3eb16b63b70f2a23e8a04449dc88fd94c
2020-08-19 17:25:24 +02:00
Zuul
3e938b6fcc Merge "Support changing the protocol part of callback_url to https" 2020-08-10 14:59:51 +00:00
Zuul
9f88a0cb59 Merge "Fix TypeError on agent lookup failure" 2020-08-07 16:32:30 +00:00
Dmitry Tantsur
353d09c3b0 Support changing the protocol part of callback_url to https
Adds a new kernel parameter for manual configuration and also creates
foundation for automatic TLS support later.

Change-Id: If341c3a8a268fc8cab6bd6be04b12ca32b31c8d8
Story: #2007214
Task: #40619
2020-08-06 15:14:31 +02:00
Julia Kreger
5eab9bced6 Fix TypeError on agent lookup failure
Agent lookups can fail as we presently use logging.exception,
better known in our code as LOG.exception, which can also generate
other fun issues on journald based systems where additional errors
could be raised resulting in us being unable to troubleshoot the
the actual issue.

Because of the mis-use of LOG.exception and the default behavior
of the backoff retry handler, the retry logic was also not
functional as any error no matter how small caused IPA to
just exit.

Change-Id: Ic4608b7c6ff9773d1403926efb3d59869c71343b
Story: 2007968
Task: 40465
2020-08-04 20:43:02 -07:00
Kaifeng Wang
b424fbfa35 Extends pci devices metrics
Collects PCI class, revision, and bus information for the pci-devices
collector, these metrics as well as vendor id and device id are
components which can be used to construct device information like
lspci output, which is how cyborg agent collects accelerator devices.

Accelerator device based scheduling is possible after ironic has such
information in place.

Change-Id: I6c37c554f37dd5f1d21c8fd4fad2a4f44a3c75d7
Story: 2007971
Task: 40474
2020-08-04 23:32:37 +08:00
Zuul
ad9c54f55c Merge "Return the final RAID configuration from apply_configuration" 2020-07-29 14:00:08 +00:00
Dmitry Tantsur
f03d72019a Return the final RAID configuration from apply_configuration
AgentRAID expects it and fails with TypeError if it's not provided.

Change-Id: Id84ac129bba97540338e25f0027aa0a0f51bde52
Story: #2006963
2020-07-29 10:10:18 +02:00
Dmitry Tantsur
eb87651496 Allow erase_devices_metadata to be used as a deploy step
Change-Id: I75f156dd76b0e3aaa1592ba24fe42fb2a7057cc8
Story: #2006963
2020-07-27 17:57:37 +02:00
Zuul
9ca640a1c5 Merge "Prevent un-needed iscsi cleanup" 2020-07-25 13:54:51 +00:00
Zuul
f6bf94fe64 Merge "Fix versions in release notes" 2020-07-23 00:09:02 +00:00
Zuul
daf61f33b0 Merge "Fix bootloader install issue with MDRAID" 2020-07-22 22:13:34 +00:00
Zuul
bfb395837d Merge "Adds poll mode deployment support" 2020-07-22 19:53:31 +00:00
Doug Szumski
5e95b1321d Fix bootloader install issue with MDRAID
When no root_device hint is set, an MDRAID partition can be incorrectly
selected as the root device which causes installation of the bootloader
to the physical disks behind the MDRAID volume to fail. See the notes
in the referenced Story for more detail.

This change adds a little more specificity to the listing of block
devices.

Change-Id: I66db457e71a0586723ee753bef961aec5bf58827
Story: 2007905
Task: 40303
2020-07-22 11:16:13 -07:00
Riccardo Pittau
ab585153c9 Fix versions in release notes
Change-Id: I2ba658d83a15554e135429d464c0a033063d4631
2020-07-22 15:41:38 +02:00
Julia Kreger
2a56ee03b6 Prevent un-needed iscsi cleanup
When we added software raid support, we started calling bootloader
installation. As time went on, we ehnanced that code path for non
RAID cases in order to ensure that UEFI nvram was setup
for the instance to boot properly.

Somewhere in this process, we missed a possible failure case where
the iscsi client tgtadm may return failures. Obviously, the correct
path is to not call iscsi teardown if we don't need to.

Since it was always semi-opportunistic teardown, we can't blindly
catch any error, and if we started iSCSI and failed to tear the
connection down, we might want to still fail, so this change
moves the logic over to use a flag on the agent object which
one extension to set the flag and the other to read it and take
action based upon that.

Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c
Story: 2007937
Task: 40395
2020-07-20 14:24:06 -07:00
Dmitry Tantsur
1f3b70c4e9 Ignore devices with size 0 when collecting inventory
delete_configuration still fetches all devices as it needs to clean
ones with broken RAID.

Story: #2007907
Task: #40307
Change-Id: I4b0be2b0755108490f9cd3c4f3b71a5e036761a1
2020-07-09 18:28:20 +02:00
Zuul
2e9620a2c0 Merge "Limit Inspection->Lookup->Heartbeat lag" 2020-07-06 18:08:14 +00:00
Zuul
6218725610 Merge "Fix serializing ironic-lib exceptions" 2020-07-06 16:47:58 +00:00
Julia Kreger
c76b8b2c21 Limit Inspection->Lookup->Heartbeat lag
Caches hardware information collected during inspection
so that the initial lookup can occur without any delay.

Also adds logging to track how long inventory collection takes.

Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com>
Change-Id: I3e0d237d37219e783d81913fa6cc490492b3f96a
2020-07-03 10:32:26 +02:00
Dmitry Tantsur
ba3caa6c64 Increase the ESP partition size to 550 MiB when using software RAID
This has been a popular guidance, and diskimage-builder has recently
started following it.

Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4
2020-07-02 17:30:33 +02:00
Dmitry Tantsur
a4855c544c Fix serializing ironic-lib exceptions
Change-Id: If1408e4b81d263c56b4bbab618dd0737db5f762e
Story: #2007889
Task: #40268
2020-07-02 12:18:53 +02:00
Julia Kreger
c77a7df851 Extend retries to 9, 10 seconds apart.
The download retry interval was previously five seconds which is
not long enough to recover after a hard network connectivity break
where we may be reliant upon network port forwarding hold-down
timers or even routing protocol route propogation to recover
communication.

Previously the time value was 5 seconds, with 3 attempts, meaning
15 seconds total ignoring the error detection timeouts.

Now it is 10 seconds, with 10 attempts, meaning 100 seconds before
the error detection timeouts.

Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699
2020-06-23 20:27:49 +00:00
Julia Kreger
159ab9f0ce Add full download retries
Instead of just trying to get the connection and handler
for the download, lets try to retry the whole action of
of downloading.

Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a
2020-06-23 20:27:41 +00:00