This change adds support for verify steps in Ironic. Verify steps
allow executing actions on transition from "verifying" to "managable"
state and can perform actions such as cleaning BMC job queue or
resetting the BMC on supported platforms. Verify steps are similar
to deploy and clean steps, just simpler.
Story: 2009025
Task: 42751
Change-Id: Iee27199a0315b8609e629bac272998c28274802b
Update python-dracclient version to indicate Xena compatibility with
7.*.* releases.
Version 7.0.0 is available from PyPI [1].
[1] https://pypi.org/project/python-dracclient/7.0.0/
Change-Id: I399f1525a473afc0783a52dabf5f85f820794e24
Neutron's firewall initialization with OVS seems
to be the source of our pain with ports not being found
by ironic jobs. This is because firewall startup errors
crashes out the agent with a RuntimeError while it is deep
in it's initial __init__ sequence.
This ultimately seems to be rooted with communication
with OVS itself, but perhaps the easiest solution is
to just disable the firewall....
Related: https://bugs.launchpad.net/neutron/+bug/1944201
Change-Id: I303989a825a7e35f1cb7b401134fd63553f6791c
iDRAC jobs can finish in 'Completed', 'Failed' and also
'Completed with Errors' state. This fix adds handling of
'Completed with Errors' as finished failed job otherwise node
stays in wait state as it does not consider such jobs
as finished.
Change-Id: I5018bf8ef6c86c6d303258f1497fa83d33b3cb76
Adds capability to copy bootloader assets from the system OS
into the network boot folders on conductor startup.
Change-Id: Ica8f9472d0a2409cf78832166c57f2bb96677833
Observed an OOM incident causing
ironic-tempest-ipa-partition-pxe_ipmitool to fail.
One vm started, the other seemed to try to start twice, but both times
stopped shortly into the run and the base OS had recorded in it an OOM
failure.
It appears the actual QEMU memory footprint being consumed when
configured at 3GB is upwards of 4GB, which obviously is too big to
fit in our 8GB VM instance.
Dialing back slightly, in hopes it stabilizes the job.
Change-Id: Id8cef722ed305e96d89b9960a8f60f751f900221
Adds API for retrieving node history events
via a node. Includes pagination and limitation
of the response set.
Story: 2002980
Tas: 42961
Change-Id: I22a92fa6c30d721f6a5dd0670b2e0a9cf76ad7b1
set_power_state has returned to the caller immediately without
confirming the system has reached the requested state. This fixes that
by synchronously waiting until the target state has been read before
returning.
That bug can cause instance workload deployments to fail on Dell EMC
PowerEdge server models on which IPA ramdisk soft power off fails and
ironic employs its OOB fallback strategy. After an otherwise successful
deployment, the node is active, but is powered off. No error is reported
in last_error. If the subsequent instance workflow expects the system to
be powered on into the operating system, it fails.
Story: 2009204
Task: 43261
Change-Id: I3112a22149c07e5508f26c79f33d09aeb905c308
After volumes are deleted in Redfish RAID also
clear foreign config if there is any.
Story: 2009160
Task: 43145
Depends-On: https://review.opendev.org/c/x/sushy-oem-idrac/+/806888
Change-Id: Ifde4656b4edd387ce2db2dbfc4c5ede261fafc70
Previously, a pattern of periodic tasks was created where
nodes, and in many cases, all nodes not actively locked nor
those in maintenance state, were pulled in by a periodic task.
These periodic tasks would then create tasks which generated
additional database queries in order to populate the task object.
With the task object populated, the driver would then evaluate
if the driver in question was for the the driver interface in
question and *then* evaluate if work had to be performed.
However, that field containing a pointer to if work needed to be
performed as often already queried from the database on the
very initial query to generate the list of nodes to evaluate.
In essence, we've moved this up in the sequence so we evaluate
that field in question prior to creating the task, potentially
across every conductor, depending on the query, and ultimately
which drivers are enabled.
This saves potentially saves hundreds of thousands of needless
database queries on a medium size deployment per single day,
depending on which drivers and driver interfaces are in use.
Change-Id: I409e87de2808d442d39e4d0ae6e995668230cbba
Older iDRACs delete the task after 1 minute, since 5.00.00.00
the task is being kept for 10 minutes.
However, if encountering the issue, handle it and advise
user to either upgrade iDRAC if not already or decrease
checking interval.
Prior this node got stuck in wait mode forever if task was
deleted as raised exception by periodic didn't make the step
fail.
Change-Id: I5d500b7d53e9804aa3b54dc400d8621f40cd5d0c
* Adds periodic task to purge node_history entries based upon
provided configuration.
* Adds recording of node history entries for errors in the
core conductor code.
* Also changes the rescue abort behavior to remove the notice
from being recorded as an error, as this is a likely bug in
behavior for any process or service evaluating the node
last_error field.
* Makes use of a semi-free form event_type field to help
provide some additional context into what is going on and
why. For example if deployments are repeatedly failing,
then perhaps it is a configuration issue, as opposed to
a general failure. If a conductor has no resources, then
the failure, in theory would point back to the conductor
itself.
Story: 2002980
Task: 42960
Change-Id: Ibfa8ac4878cacd98a43dd4424f6d53021ad91166
This patch provides basic data model change to support node history.
Batch removal is not included in this patch.
Change-Id: I5c7cebd585ee84b5b57bd4690d4074baf0d05699
Story: 2002980
Task: 22989
In iDRAC import configuration task can be completed with OK health
but having some errors, for example, when one disk failed to be
created and another succeeded.
Also changed to exclude informational messages for error reporting.
Story: 2009198
Task: 43253
Change-Id: I02b63547566c94ffa1a5d0e84bd1b1f10d28bfc3
Currently we set parallel_image_downloads to False, which means that
all downloads that go through the image cache are serialized.
This change enables it by default and deprecates in favour of a new
more fine-grained mechanism: the new option image_download_concurrency
specifies how many downloads (and raw conversions) will run in parallel.
Update logging to trace how long each download takes.
Change-Id: I8b85afda295029f85e82143cf7d4bcb2316860f6
Currently we default to assuming the cache is up-to-date. This is likely
wrong. Normal web servers provide Last Modified for files they serve.
If it is absent, chances are high the image is served by some sort
of a dynamic service, which may modify the URL on fly.
In any case, always updating the image is a safer choice.
Change-Id: I0548db14a97638d26ebb687e8f47f1b295d1f774
Implements clean step "clear_ca_certificates" to remove any 3rd party
expired/revoked CA certificates from iLO.
Change-Id: I0a3c1da9b94e4037a53ade100354ac51ca08db35
Story: #2008784
Task: #42175