1779 Commits

Author SHA1 Message Date
Dmitry Tantsur
219bf0c373 Prepare release 16.1
Change-Id: Ia37075d4aa2f39ca0862b03ca02c85bac17400e5
2020-12-14 13:29:36 +01:00
Zuul
eeda309bbd Merge "Add secure boot support to ilo-uefi-https" 2020-12-14 04:49:49 +00:00
vmud213
681940c8f0 Add secure boot support to ilo-uefi-https
Adds secure boot support to ilo-uefi-https boot interface.

Change-Id: I1d08b88496764bbee5cf0a1d306eb7be31d0d373
Story: #2008258
Task: #41114
2020-11-26 08:46:01 +00:00
Zuul
a08da8551a Merge "Add vendor_passthru method for virtual media" 2020-11-25 17:02:34 +00:00
Zuul
4cc375a747 Merge "Allow disabling automated_clean per node" 2020-11-25 12:42:49 +00:00
Zuul
0ac45a8d9d Merge "Always retry locking when performing task handoff" 2020-11-25 09:16:12 +00:00
Feruzjon Muyassarov
ee6119e774 Allow disabling automated_clean per node
This allows users to disable automated cleaning on
Node level.

Story: #2008113
Task: #40829
Change-Id: If583bae4108b9bfa99cc460509af84696c7003c5
2020-11-24 17:23:13 +00:00
Jason Anderson
bfc2ad56d5
Always retry locking when performing task handoff
There are some Ironic execution workflows where there is not an easy way
to retry, such as when attempting to hand off the processing of an async
task to a conductor. Task handoff can require releasing a lock on the
node, so the next entity processing the task can acquire the lock
itself. However, this is vulnerable to race conditions, as there is no
uniform retry mechanism built in to such handoffs. Consider the
continue_node_deploy/clean logic, which does this:

  method = 'continue_node_%s' % operation
  # Need to release the lock to let the conductor take it
  task.release_resources()
  getattr(rpc, method)(task.context, uuid, topic=topic

If another process obtains a lock between the releasing of resources and
the acquiring of the lock during the continue_node_* operation, and
holds the lock longer than the max attempt * interval window (which
defaults to 3 seconds), then the handoff will never complete. Beyond
that, because there is no proper queue for processes waiting on the
lock, there is no fairness, so it's also possible that instead of one
long lock being held, the lock is obtained and held for a short window
several times by other competing processes.

This manifests as nodes occasionally getting stuck in the "DEPLOYING"
state during a deploy. For example, a user may attempt to open or access
the serial console before the deploy is complete--the serial console
process obtains a lock and starves the conductor of the lock, so the
conductor cannot finish the deploy. It's also possible a long heartbeat
or badly-timed sequence of heartbeats could do the same.

To fix this, this commit introduces the concept of a "patient" lock,
which will retry indefinitely until it doesn't encounter the NodeLocked
exception. This overrides any retry behavior.

  .. note::
     There may be other cases where such a lock is desired.

Story: #2008323
Change-Id: I9937fab18a50111ec56a3fd023cdb9d510a1e990
2020-11-24 09:41:38 -06:00
Bob Fournier
98958cd0a4 Add vendor_passthru method for virtual media
Add a vendor_passthru method to eject_vmedia for Redfish and idrac.

Story: 2008363
Task: 41271

Change-Id: Ib5ae16bacfd79f479a9aa8fbf69edc5cfdf73ce3
2020-11-24 09:25:44 -05:00
Zuul
3db362e5aa Merge "Fix disk label to account for UEFI" 2020-11-19 23:34:27 +00:00
Julia Kreger
ed4abbd519 Fix disk label to account for UEFI
Previously disk labels would not be populated if not explicitly
set by an API user, which lead to a dangerous possible case,
which sometimes could work, but was ultimately wrong to
setup a UEFI booting machine with a BIOS MBR partition table.

Not all systems support this, but UEFI systems are supposed to
support GPT partition tables.

We now fallback if no explicit override is set and assume GPT
if the machine is set to UEFI mode.

Change-Id: I001d8c6ee3b1d6c466c71ea5179bdbca9bdd692d
2020-11-18 03:10:27 +00:00
Zuul
385aeb0143 Merge "Limit the default value of [api]api_workers to 4" 2020-11-18 03:06:29 +00:00
Zuul
0f3c63fde2 Merge "Simplify injecting network data into an ISO image" 2020-11-13 09:46:17 +00:00
Zuul
1cdf582d83 Merge "Fix incorrect network_data.json location" 2020-11-13 07:09:34 +00:00
Zuul
44027a1175 Merge "Remove root device hint after delete_configuration" 2020-11-12 23:41:19 +00:00
Zuul
029b875be1 Merge "Fixes the issue that instance bond port can't get IP address" 2020-11-12 23:41:14 +00:00
Zuul
b5932bc6bf Merge "Fix idrac-wsman RAID step async error handling" 2020-11-12 19:08:59 +00:00
Zuul
887f263c26 Merge "Retrieve BIOS configuration when moving node to `manageable`" 2020-11-12 18:29:56 +00:00
Zuul
e85c04ba3e Merge "Fix DHCP-less operations with the noop network interface" 2020-11-11 19:46:09 +00:00
Dmitry Tantsur
d48479b52d Simplify injecting network data into an ISO image
Currently we're building a VFAT image with the network data just
to unpack it back on the next step. Just pass the file directly.
This fixes a permission denied problem on Bifrost on Fedora
(at least).

As a nice side effect, the change reduces the amount of IO done
for virtual media quite substantially.

Change-Id: I5499fa42c1d82a1a29099fbbba6f45d440448b72
2020-11-11 12:20:20 +01:00
Dmitry Tantsur
3fd513ee1c Fix incorrect network_data.json location
There is no metadata subdirectory, it goes right into openstack/latest.

Change-Id: I576c3c85515970262b5e7480913ff7daefa1b539
2020-11-11 12:17:36 +01:00
Zuul
d16daa3b52 Merge "Fix redfish BIOS apply config error handling" 2020-11-10 21:27:40 +00:00
Bob Fournier
9b18336f76 Retrieve BIOS configuration when moving node to `manageable`
When moving the node to ``manageable``, in addition to
``cleaning``, retrieve the BIOS configuration settings.  In the
case of ``manageable``, this may allow the settings to be used
when choosing which node to deploy.

Change-Id: Ic2b162f31d4a1465fcb61671e7f48b3d31de788c
Story: 2008326
Task: 41224
2020-11-10 14:57:20 -05:00
Dmitry Tantsur
2e5d01d48d Fix DHCP-less operations with the noop network interface
The base implementation of get_node_network_data returns {} and is
not overridden in the noop network. Update the base implementation
to use task.node.network_data and remove the excessive logging.

Change-Id: Ie50dcd1c2a151f5dd09794467792527032249809
2020-11-10 18:53:32 +01:00
Kaifeng Wang
fe01ddb2bc Fixes the issue that instance bond port can't get IP address
The issue is that when a port group doesn't have a mac address assigned by
operators, and during provisioning we unbind/bind tenant port with None which
causes the mac address to be regenerated twice and differs from the originally
one allocated by nova or users which was packed into config drive. The end result
is that, bond port has different mac address configured and can't the IP address
from neutron.

Change-Id: I92ed5d17239216324d6a69e0ed8771fd6948d6ec
Story: 2008300
Task: 41185
2020-11-10 21:11:18 +08:00
Zuul
08bf8dee65 Merge "Add node name to ironic-conductor ramdisk log filename" 2020-11-03 15:57:31 +00:00
Dmitry Tantsur
c9c492725e Limit the default value of [api]api_workers to 4
Each ironic-api process consumes non-negligible amount of RAM,
defaulting to CPU core count may result in many hundres of megabytes
occupied by ironic-api processes. Limit the default value to 4
and let people who actually need more than that pick their value.

Change-Id: I5aefa8c6c7aadc56aea151647e1c0a5af54ada4c
2020-11-03 16:33:14 +01:00
Aija Jauntēva
23951f4b44 Fix idrac-wsman RAID step async error handling
Instead of using process_event('fail') use error_handlers,
otherwise in case of failure node gets stuck and fails
because of timeout, instead of failing earlier due to
step failure.

Story: 2008307
Task: 41194
Change-Id: Ieec0173f57367587985d2baad77205bb83e8b69a
2020-11-02 12:56:29 -05:00
Aija Jauntēva
70b7ca345f Fix redfish BIOS apply config error handling
Instead of using process_event('fail') use error_handlers,
otherwise in case of failure node gets stuck and fails
because of timeout, instead of failing earlier due to
step failure.

Besides adding new unit tests, also update related unit tests
to test for success correctly and have realistic data.

Story: 2008307
Task: 41196
Change-Id: If28ccb252a87610e3fd3dc78e1ed75bb8ca1cdcf
2020-11-02 12:55:26 -05:00
Zuul
7ea6e41b26 Merge "Prevent timeouts when using fast-track with redfish-virtual-media" 2020-11-02 13:59:05 +00:00
Zuul
31277f2c95 Merge "Handle agent still doing the prior command" 2020-10-30 16:10:10 +00:00
Dmitry Tantsur
551ca9c8f7 Prevent timeouts when using fast-track with redfish-virtual-media
Calling prepare_ramdisk may break fast-track, as it's the case with
redfish-virtual-media (it powers nodes off unconditionally). To
avoid timeouts, check fast-track status again after prepare_ramdisk.

Change-Id: Iad2d6f4827bd7e8b2a02005fe18d31ec8d37db97
2020-10-30 16:41:01 +01:00
Julia Kreger
545dc2106b Handle agent still doing the prior command
The agent command exec model is based upon an incoming
heartbeat, however heartbeats are independent and
commands can take a long time. For example, software RAID
setup in CI can encounter this.

From an IPA log:

[-] Picked root device /dev/md0 for node c6ca0af2-baec-40d6-879d-cbb5c751aafb
    based on root device hints {'name': '/dev/md0'}
[-] Attempting to download image from http://199.204.45.248:3928/agent_images/
    c6ca0af2-baec-40d6-879d-cbb5c751aafb
[-] Executing command: standby.get_partition_uuids with args: {} execute_command
    /usr/local/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py:255
[-] Tried to execute standby.get_partition_uuids, agent is still executing Command name:
    execute_deploy_step, params: {'step': {'interface': 'deploy', 'step': 'write_image',
    'args': {'image_info': {'id': 'cb9e199a-af1b-4a6f-b00e-f284008b8046',
    'urls': ['http://199.204.45.248:3928/agent_images/c6ca0af2-baec-40d6-879d-cbb5c751aafb'],
    'disk_format': 'raw', 'container_format': 'bare', 'stream_raw_images': True, 'os_hash_algo':
    'sha512', 'os_hash_value':<trimed>

This was with code built on master, using master images.
Inside the conductor log, it notes that it is likely an out
of date agent because only AgentAPIError is evaluated,
however any API error is evaluated this way. In reality, we need
to explicitly flag *when* we have an error that is because
we've tried to soon as something is already being worked upon.

The result, is to evaluate and return an exception indicating work
is already in flight.

Update - It looks like, the original fix to prevent busy agent
recognition did not fully detect all cases as getting steps is a
command which can
get skipped by accident with a busy agent, under certain circumstances.
Change I5d86878b5ed6142ed2630adee78c0867c49b663f in ironic-python-agent
also changed the string that was being checked for the previous
handling, where we really should have just made the string we were
checking lower case in ironic. Oh well! This should fix things
right up.

Story: 2008167
Task: 41175
Change-Id: Ia169640b7084d17d26f22e457c7af512db6d21d6
2020-10-29 14:58:34 -07:00
Dmitry Tantsur
2dfb3f5eca Make redfish-virtual-media respect default_boot_mode
Change-Id: I46c865ba1cc05a60aa9703f0b35247b62ad4235a
2020-10-29 17:44:00 +01:00
Zuul
7155ec7ef9 Merge "Add timeout to image operations in the direct deploy" 2020-10-29 12:19:59 +00:00
Zuul
90536bdbbc Merge "json-rpc: surround IPv6 address with [] in conductor URL" 2020-10-29 07:49:52 +00:00
Zuul
dc9cd24bc6 Merge "Sync boot mode when changing the boot device via Redfish" 2020-10-28 10:11:16 +00:00
Zane Bitter
85b4892886 json-rpc: surround IPv6 address with [] in conductor URL
If the conductor's host option is configured to be an IPv6 address, we
need to surround it with [] when incorporating it in a URL to contact
the conductor over json-rpc.

Change-Id: Ib3bc4c570ec0f2e5c73e3ce15b05684b8e4c1ff9
Story: 2008288
Task: 41166
2020-10-27 11:22:12 -04:00
Kaifeng Wang
91d6426b06 Fixes empty physical_network is not guarded
An empty physical_network can be set to port and make the port
unusable.

Change-Id: I58cf04839f40922cf0c7ddffc08b843cb3c50e06
Story: 2008279
Task: 41153
2020-10-24 22:05:36 +08:00
dujinxiu
da4c583ea9 Add node name to ironic-conductor ramdisk log filename
Change-Id: Ide28c16806909f1bbf93bf7c72b5cec6f8ddc260
Story: #2008281
Task: #41155
2020-10-24 16:01:52 +08:00
Dmitry Tantsur
fe37fb6d5d Add timeout to image operations in the direct deploy
Currently they may hang when the remote server is not responding.

Change-Id: I1de17fed3b43a3d16795dc614ce76e2cfe1faca0
2020-10-22 13:16:47 +02:00
Dmitry Tantsur
0a68622187 Allow passing rootfs_uuid for the standalone case
Using software RAID with whole disk images requires specifying
a root partition UUID, but it is only possible through Glance.
This change adds an explicit field for that.

Change-Id: I55e3727aab3960ef472ec2db1f23c25db405e801
2020-10-20 18:22:25 +02:00
Bob Fournier
685131fd36 Sync boot mode when changing the boot device via Redfish
After changing the boot device via Redfish, check that the boot mode being
reported matches what is configured and, if not, set it to the configured
value.  Some BMCs change the boot mode when the device is set via Redfish,
this will ensure the mode is set properly.

Change-Id: Ib077f7f32de029833e6bd936853c382305bce36e
Story: 2008252
Task: 41103
2020-10-19 14:34:44 -04:00
Zuul
a29417f46f Merge "Fix ipmitool timing argument calculation" 2020-10-19 07:17:02 +00:00
Steve Baker
1de3db3b16 Fix ipmitool timing argument calculation
Calculating the ipmitool `-N` and `-R` arguments from ironic.conf
[ipmi] `command_retry_timeout` and `min_command_interval` now takes
into account the 1 second interval increment that ipmitool adds on
each retry event.

Failure-path ipmitool run duration will now be just less than
`command_retry_timeout` instead of much longer.

Change-Id: Ia3d8d85497651290c62341ac121e2aa438b4ac50
2020-10-14 19:33:50 +00:00
Dmitry Tantsur
7a89ddcf0c Do not pass BOOTIF=None if no BOOTIF can be guessed
It breaks inspection with the default add_ports=pxe.

Change-Id: I730b4bbd48e7188148669670fdb742b88a62f820
2020-10-13 15:16:43 +02:00
Kaifeng Wang
6a34d47829 Remove root device hint after delete_configuration
The root device hint is not guaranteed to be valid after raid
configuration in most cases, this could cause no matching device
found and fail the deployment. AgentRAID implementations can return
the correct root device hint from the create_configuration.

Change-Id: Iab97a16ef8ccea8186f0cc7a14b77d508804fc8d
2020-10-11 21:52:43 +08:00
Dmitry Tantsur
e39858dd8c Wiping agent tokens on reboot via API - take 2
Because of using an incorrect variable, reboot was treated as power on,
and the token was not wiped.

Change-Id: I656450c2bedc3dc0d20a70de78cc29bf64d5fe85
Story: #2008097
Task: #40799
2020-10-05 17:36:45 +02:00
Iury Gregory Melo Ferreira
db55700384 Fix inspection for idrac
This commit fix an issue to inspect nodes using idrac when using
redfish virtual media. We are using the redfish configuration so
it can be backported.

Change-Id: I478c25fac13b49867349c2d9fc8d206c9994c398
Story: #2008221
Task: #41010
2020-10-02 17:49:16 +02:00
Mudit
101fc29686 Add GPU reporting to idrac-wsman inspect interface
This patch implements reporting number of NVIDIA Tesla T4
devices connected to a system by discovering such devices
and reporting them through capability 'pci_gpu_devices'.

Change-Id: If713895f05f08a9827c4c085108abb3e388b2a2e
Story: 2008118
Task: 40839
Depends-On: https://review.opendev.org/#/c/750364/
2020-09-30 18:33:53 -04:00