ironic-python-agent

Author	SHA1	Message	Date
Jay Faulkner	80575566b1	Allow manual setting of Ironic API Version Typically, the Ironic API client in IPA will autodetect the API version based on the output of a GET of the root of the API. If for some reason this API endpoint is restricted, or the operator wishes to limit the Ironic API version IPA uses, they can now set CONF.ironic_api_version to avoid autodetection and force a version. Change-Id: Ib96a1057792f45f2e4554671e32c436140463ee8	2020-10-23 15:38:42 +00:00
Julia Kreger	6542a9cb04	Don't run os-prober from grub2-mkconfig By default, grub2-mkconfig scans everything to look for other environments and then load those into the grub configuration. It makes sense, but on newer versions of grub2 in distribution images, os-prober is taking an exceptionally long time in some cases where more than one storage device exists with other filesystems. As a result, of the os-prober execution by grub2-mkconfig, the bootloader installation can completely time out and fail the deployment. This is presently experienced with metalsmith on centos8. There are numerous sporatic reports of issues like this issue where grub2-mkconfig hangs for some period of time, and this is observable on Centos8.2 in our CI. While one report[0] mentions this issue, Another bug [1] has the dialog that actually helps us frame the context as to what we likely should do. Also, fixes the unit testing so we actually test if we're running with grub2. :\ [0]: https://bugzilla.redhat.com/show_bug.cgi?id=1744693 [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1709682 Depends-On: https://review.opendev.org/#/c/748315 Change-Id: I14bf299afef3a1ddb2006fe5f182d7f0d249e734	2020-10-22 22:28:07 +00:00
Zuul	80b0a9a132	Merge "Software RAID: Re-add missing devices"	2020-10-12 12:24:24 +00:00
Dmitry Tantsur	420ebc0d73	Do not silently swallow errors in the write_image deploy step Calling join() does not raise, we need to explicitly check the result. Change-Id: I81d3d727af220c2b50358edab8139f07874611f0 Story: #2008240 Task: #41083	2020-10-09 11:24:12 +02:00
Dmitry Tantsur	fc4e0eed6a	Don't try to call GRUB when root UUID is not provided We don't have a really working way to detect root UUID for whole disk images at the moment, which results in an ignored traceback every time install_bootloader is called with whole disk images in UEFI mode. Avoid it by skipping GRUB2 if root UUID is unknown. Change-Id: I84245538f59c664b72d1cafbca8d61be0978f489	2020-10-07 12:06:42 +02:00
Zuul	abd9f91813	Merge "Add basic retries for inspection"	2020-10-06 17:07:20 +00:00
Arne Wiebalck	253b4887d5	Software RAID: Re-add missing devices Upon md device creation, component devices are sometimes removed immediately again due to a "disk failure". The disks seem healthy, though. This patch re-adds compoenent devices in such cases to prevent that the md device will remain in a degraded state (which would cause issues later, e.g. during ESP creation). Story: #2008164 Task: #40914 Change-Id: I2ac7cb4a546de84686d5c3435e850c14b3f6c1d7	2020-10-06 14:00:57 +02:00
OpenStack Release Bot	fb45e58d1c	Update master for stable/victoria Add file to the reno documentation build to show release notes for stable/victoria. Use pbr instruction to increment the minor version number automatically so that master versions are higher than the versions on stable/victoria. Change-Id: Ia3696da8663c140504924b0a1cd23f9aaa517f0a Sem-Ver: feature	2020-10-01 18:42:40 +00:00
Zuul	99dee5067e	Merge "Software RAID: Get component devices by md UUID"	2020-09-30 18:30:56 +00:00
Zuul	faeb9441d3	Merge "Simplify heartbeating by removing use of select()"	2020-09-29 15:47:08 +00:00
Arne Wiebalck	044c64dbc0	Software RAID: Get component devices by md UUID Scanning the output of mdadm commands for RAID members will miss component devices which are currently not part of the RAID. For proper cleaning it is better to scan block devices for a signature of the md device for which we would like to get the components. Story: #2008186 Task: #40947 Change-Id: Ib46612697851e36a16d272ccaeb0115106253863	2020-09-29 17:08:40 +02:00
Arne Wiebalck	c7aec775ff	Software RAID: Don't delete partitions too early Partions on the holder disk should only be deleted after all RAID devices have been deleted. Otherwise, super blocks on partitions which reside on the same disks cannot be cleaned. Story: #2008199 Task: #40979 Change-Id: I19293f5b992cd1fa68957d6f306dcec8f3b7a820	2020-09-28 10:35:12 +02:00
Zuul	c7ff931fe6	Merge "Fix: make Intel CNA hardware manager none generic"	2020-09-23 14:57:40 +00:00
Zuul	11a87365fb	Merge "Generate a TLS certificate and send it to ironic"	2020-09-23 12:14:38 +00:00
Qianbiao.NG	4b0ef13d08	Fix: make Intel CNA hardware manager none generic Currently, IntelCnaHardwareManager inherits GenericHardwareManager which makes it a new "GenericHardwareManager" with "MAINLINE" priority. This causes all other hardware-managers with lower priority than "MAINLINE" never be used. To fix this, make IntelCnaHardwareManager inherit basic HardwareManager. Change-Id: I28b665d8841b0b2e83b132e1f25df95e03e7ba10 Story: 2008142 Task: 40882	2020-09-23 18:24:26 +08:00
Jay Faulkner	a01646f56b	Simplify heartbeating by removing use of select() Heartbeating in IPA has used select.poll() for years to workaround a bug where changing the time in the ramdisk could cause heartbeats to stop and never resume. Now that IPA syncs time at start and exit, this workaround is no longer needed. So instead, we'll revert to using threading.Event() in order to make the code simpler and easier to understand. Since we need this to be an eventlet-event, and not a standard-thread event, also monkey_patch threading. Additionally, there were a few completely unused backoff interval values set, that were never applied. In respect of maintaining the 5+ years old behavior of not doing error backoffs, that code was removed instead of being made to work. Change-Id: Ibcde99de64bb7e95d5df63a42a4ca4999f0c4c9b	2020-09-22 16:59:47 +00:00
Julia Kreger	bb27badf76	Add basic retries for inspection A transitory connection failure, such as one caused by a port being held down for traffic forwarding, can experience intermittent connectivity failures which result in failed introspections. Now the agent retries. Change-Id: I72c5e3aca000d3854a17f8a461b1a2935e5c0d9b	2020-09-14 22:38:18 +00:00
Dmitry Tantsur	021e0a6a46	Generate a TLS certificate and send it to ironic Adds a new flag (on by default) that enables generating a TLS certificate and sending it to ironic via heartbeat. Whether ironic supports auto-generated certificates is determined by checking its API version. Change-Id: I01f83dd04cfec2adc9e2a6b9c531391773ed36e5 Depends-On: https://review.opendev.org/747136 Depends-On: https://review.opendev.org/749975 Story: #2007214 Task: #40604	2020-09-11 17:46:52 +02:00
Julia Kreger	3426963552	Fix backup node lookup The node lookup code added in change I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6 was slightly broken in that we call a method with a keyword arguemnt which doesn't exist. uuid versus node_uuid. It happens, it is a quick fix! Spotted on a metalsmith job: [-] Agent is requesting to perform an explicit node cache update. This is to pickup any chanages in the cache before deployment. [-] Failed to update node cache. Error lookup_node() got an unexpected keyword argument 'uuid' Change-Id: I59ecec65707a2f03918b233f1925395ebe59b8c4	2020-09-09 15:19:38 -07:00
Zuul	e73b7220c4	Merge "If listen_tls is true, enable TLS on wsgi server"	2020-09-03 18:59:48 +00:00
Zuul	09f6a4e3da	Merge "Update the cache if we don't have a root device hint"	2020-09-03 09:41:58 +00:00
Jay Faulkner	1d11f0b7dd	If listen_tls is true, enable TLS on wsgi server This change enables operators to set [DEFAULT]listen_tls to true configure IPA to be host its WSGI server over TLS using existing SSL support in oslo.service. In addition to configuring this in IPA, a deployer will need to also set [ssl]cert_file, [ssl]key_file, and optionally [ssl]ca_file in their ipa config, in addition to embedding those files into the IPA ramdisk in order for this to be functional. In order to make this change work, we also need to monkey patch socket library early, or else oslo.service will end up passing an unpatched socket to the eventlet wsgi server, which causes deadlocks. Change-Id: Ib7decae410915f3c27b045ee08538c94d455b030	2020-09-02 16:07:42 -07:00
Jay Faulkner	7d0ad36ebd	Make WSGI server respect listen_* directives The listen_port and listen_host directives are intended to allow deployers of IPA to change the port and host IPA listens on. These configs have not been obeyed since the migration to the oslo.service wsgi server. Story: 2008016 Task: 40668 Change-Id: I76235a6e6ffdf80a0f5476f577b055223cdf1585	2020-08-31 14:37:38 +00:00
Julia Kreger	d3c3d4dabe	Update the cache if we don't have a root device hint Or at least try to. Some deployments just don't use root device hints, and this is okay. However, other deployments need root device hints, and with fast track mode in ramdisks, we created a situation where the node cache could be updated by a human or software between the time the agent was started, and the deployment was requested. As a result, the agent has been updated to check if we have a hint and if we don't, update the cache from the node lookup endpoint. This is not needed when the inband deploy steps are executed, as the process of updating the steps does force the node cache to be updated. Change-Id: I27201319f31cdc01605a3c5ae9ef4b4218e4a3f6 Story: 2008039 Task: 40701	2020-08-25 19:34:48 +00:00
Zuul	cfede0c5bc	Merge "Clarify connection error on heartbeats"	2020-08-24 13:29:27 +00:00
Julia Kreger	f670f704f3	Clarify connection error on heartbeats Heartbeat connection errors are often a sign of a transitory network failures which may resolve themselves. But an operator looking at the screen doesn't necessarilly know that. They don't understand that there could have been a network failure, or a misconfiguration that caused the connectivity failure and soft of kind of default to "well it failed" without further clarification. As such, this patch adds explicit catching of the requests ConnectionError exception and rasies a new internal error with a more verbose error message in that event to provide operators with additional clarity. Change-Id: I4cb2c0d1f577df1c4451308bd86efa8f94390b0c Story: 2008046 Task: 40709	2020-08-20 13:45:47 -07:00
Dmitry Tantsur	d50ff06b6b	Enable the logs collection by default It's incredibly helpful when debugging and most of consumers seem to enable and rely on it. Change-Id: I33bf58b3eb16b63b70f2a23e8a04449dc88fd94c	2020-08-19 17:25:24 +02:00
Zuul	3e938b6fcc	Merge "Support changing the protocol part of callback_url to https"	2020-08-10 14:59:51 +00:00
Zuul	9f88a0cb59	Merge "Fix TypeError on agent lookup failure"	2020-08-07 16:32:30 +00:00
Dmitry Tantsur	353d09c3b0	Support changing the protocol part of callback_url to https Adds a new kernel parameter for manual configuration and also creates foundation for automatic TLS support later. Change-Id: If341c3a8a268fc8cab6bd6be04b12ca32b31c8d8 Story: #2007214 Task: #40619	2020-08-06 15:14:31 +02:00
Julia Kreger	5eab9bced6	Fix TypeError on agent lookup failure Agent lookups can fail as we presently use logging.exception, better known in our code as LOG.exception, which can also generate other fun issues on journald based systems where additional errors could be raised resulting in us being unable to troubleshoot the the actual issue. Because of the mis-use of LOG.exception and the default behavior of the backoff retry handler, the retry logic was also not functional as any error no matter how small caused IPA to just exit. Change-Id: Ic4608b7c6ff9773d1403926efb3d59869c71343b Story: 2007968 Task: 40465	2020-08-04 20:43:02 -07:00
Kaifeng Wang	b424fbfa35	Extends pci devices metrics Collects PCI class, revision, and bus information for the pci-devices collector, these metrics as well as vendor id and device id are components which can be used to construct device information like lspci output, which is how cyborg agent collects accelerator devices. Accelerator device based scheduling is possible after ironic has such information in place. Change-Id: I6c37c554f37dd5f1d21c8fd4fad2a4f44a3c75d7 Story: 2007971 Task: 40474	2020-08-04 23:32:37 +08:00
Zuul	ad9c54f55c	Merge "Return the final RAID configuration from apply_configuration"	2020-07-29 14:00:08 +00:00
Dmitry Tantsur	f03d72019a	Return the final RAID configuration from apply_configuration AgentRAID expects it and fails with TypeError if it's not provided. Change-Id: Id84ac129bba97540338e25f0027aa0a0f51bde52 Story: #2006963	2020-07-29 10:10:18 +02:00
Dmitry Tantsur	eb87651496	Allow erase_devices_metadata to be used as a deploy step Change-Id: I75f156dd76b0e3aaa1592ba24fe42fb2a7057cc8 Story: #2006963	2020-07-27 17:57:37 +02:00
Zuul	9ca640a1c5	Merge "Prevent un-needed iscsi cleanup"	2020-07-25 13:54:51 +00:00
Zuul	f6bf94fe64	Merge "Fix versions in release notes"	2020-07-23 00:09:02 +00:00
Zuul	daf61f33b0	Merge "Fix bootloader install issue with MDRAID"	2020-07-22 22:13:34 +00:00
Zuul	bfb395837d	Merge "Adds poll mode deployment support"	2020-07-22 19:53:31 +00:00
Doug Szumski	5e95b1321d	Fix bootloader install issue with MDRAID When no root_device hint is set, an MDRAID partition can be incorrectly selected as the root device which causes installation of the bootloader to the physical disks behind the MDRAID volume to fail. See the notes in the referenced Story for more detail. This change adds a little more specificity to the listing of block devices. Change-Id: I66db457e71a0586723ee753bef961aec5bf58827 Story: 2007905 Task: 40303	2020-07-22 11:16:13 -07:00
Riccardo Pittau	ab585153c9	Fix versions in release notes Change-Id: I2ba658d83a15554e135429d464c0a033063d4631	2020-07-22 15:41:38 +02:00
Julia Kreger	2a56ee03b6	Prevent un-needed iscsi cleanup When we added software raid support, we started calling bootloader installation. As time went on, we ehnanced that code path for non RAID cases in order to ensure that UEFI nvram was setup for the instance to boot properly. Somewhere in this process, we missed a possible failure case where the iscsi client tgtadm may return failures. Obviously, the correct path is to not call iscsi teardown if we don't need to. Since it was always semi-opportunistic teardown, we can't blindly catch any error, and if we started iSCSI and failed to tear the connection down, we might want to still fail, so this change moves the logic over to use a flag on the agent object which one extension to set the flag and the other to read it and take action based upon that. Change-Id: Id3b1ae5e59282f4109f6246d5614d44c93aefa7c Story: 2007937 Task: 40395	2020-07-20 14:24:06 -07:00
Dmitry Tantsur	1f3b70c4e9	Ignore devices with size 0 when collecting inventory delete_configuration still fetches all devices as it needs to clean ones with broken RAID. Story: #2007907 Task: #40307 Change-Id: I4b0be2b0755108490f9cd3c4f3b71a5e036761a1	2020-07-09 18:28:20 +02:00
Zuul	2e9620a2c0	Merge "Limit Inspection->Lookup->Heartbeat lag"	2020-07-06 18:08:14 +00:00
Zuul	6218725610	Merge "Fix serializing ironic-lib exceptions"	2020-07-06 16:47:58 +00:00
Julia Kreger	c76b8b2c21	Limit Inspection->Lookup->Heartbeat lag Caches hardware information collected during inspection so that the initial lookup can occur without any delay. Also adds logging to track how long inventory collection takes. Co-Authored-By: Dmitry Tantsur <dtantsur@protonmail.com> Change-Id: I3e0d237d37219e783d81913fa6cc490492b3f96a	2020-07-03 10:32:26 +02:00
Dmitry Tantsur	ba3caa6c64	Increase the ESP partition size to 550 MiB when using software RAID This has been a popular guidance, and diskimage-builder has recently started following it. Change-Id: I794c846fb191c15b0a30546bf64d624dfbde0fd4	2020-07-02 17:30:33 +02:00
Dmitry Tantsur	a4855c544c	Fix serializing ironic-lib exceptions Change-Id: If1408e4b81d263c56b4bbab618dd0737db5f762e Story: #2007889 Task: #40268	2020-07-02 12:18:53 +02:00
Julia Kreger	c77a7df851	Extend retries to 9, 10 seconds apart. The download retry interval was previously five seconds which is not long enough to recover after a hard network connectivity break where we may be reliant upon network port forwarding hold-down timers or even routing protocol route propogation to recover communication. Previously the time value was 5 seconds, with 3 attempts, meaning 15 seconds total ignoring the error detection timeouts. Now it is 10 seconds, with 10 attempts, meaning 100 seconds before the error detection timeouts. Change-Id: I6d11edc9a3156f2bdc21c3d432ecc7625d652699	2020-06-23 20:27:49 +00:00
Julia Kreger	159ab9f0ce	Add full download retries Instead of just trying to get the connection and handler for the download, lets try to retry the whole action of of downloading. Change-Id: I9217792d32e6f33c70f146a9b7d3ef58c5644d8a	2020-06-23 20:27:41 +00:00

... 3 4 5 6 7 ...

510 Commits