Docs: Troubleshooting: how to exit clean failed

I got pinged with some questions by an operator who had issues attempting to exit cleaning. In the discussion, it was realized we lack basic troubleshooting guidance, which led them to try everything but the command they needed. As such, adding some guidance in an attempt to help operators navigate these sorts of issues moving forward. Change-Id: Ia563f5e50bbcc789ccc768bef5800a64b38ff3d7
2023-01-19 11:31:28 -08:00 · 2023-01-19 11:31:28 -08:00 · 8604a799aa
commit 8604a799aa
parent 9a85e4787b
1 changed files with 44 additions and 0 deletions
--- a/doc/source/admin/troubleshooting.rst
+++ b/doc/source/admin/troubleshooting.rst
@ -1100,3 +1100,47 @@ of other variables, you may be able to leverage the `RAID <raid>`_
 configuration interface to delete volumes/disks, and recreate them. This may
 have the same effect as a clean disk, however that too is RAID controller
 dependent behavior.
+
+I'm in "clean failed" state, what do I do?
+==========================================
+
+There is only one way to exit the ``clean failed`` state. But before we visit
+the answer as to **how**, we need to stress the importance of attempting to
+understand **why** cleaning failed. On the simple side of things, this may be
+as simple as a DHCP failure, but on a complex side of things, it could be that
+a cleaning action failed against the underlying hardware, possibly due to
+a hardware failure.
+
+As such, we encourage everyone to attempt to understand **why** before exiting
+the ``clean failed`` state, because you could potentially make things worse
+for yourself. For example if firmware updates were being performed, you may
+need to perform a rollback operation against the physical server, depending on
+what, and how the firmware was being updated. Unfortunately this also borders
+the territory of "no simple answer".
+
+This can be counter balanced with sometimes there is a transient networking
+failure and a DHCP address was not obtained. An example of this would be
+suggested by the ``last_error`` field indicating something about "Timeout
+reached while cleaning the node", however we recommend following several
+basic troubleshooting steps:
+
+* Consult the ``last_error`` field on the node, utilizing the
+  ``baremetal node show <uuid>`` command.
+* If the version of ironic supports the feature, consult the node history
+  log, ``baremetal node history list`` and
+  ``baremetal node history get <uuid>``.
+* Consult the acutal console screen of the physical machine. *If* the ramdisk
+  booted, you will generally want to investigate the controller logs and see
+  if an uploaded agent log is being stored on the conductor responsible for
+  the baremetal node. Consult `Retrieving logs from the deploy ramdisk`_.
+  If the node did not boot for some reason, you can typically just retry
+  at this point and move on.
+
+How to get out of the state, once you've understood **why** you reached it
+in the first place, is to utilize the ``baremetal node manage <node_id>``
+command. This returns the node to ``manageable`` state, from where you can
+retry "cleaning" through automated cleaning with the ``provide`` command,
+or manual cleaning with ``clean`` command. or the next appropriate action
+in the workflow process you are attempting to follow, which may be
+ultimately be decommissioning the node because it could have failed and is
+being removed or replaced.