The healthcheck checks for a process called httpd, but these distros
call it apache2. This results in the ironic_ipxe container being marked
as unhealthy.
This change fixes the issue by making the process name distro dependent.
Change-Id: I0b0126e3071146e7f8593ba970ecbed65b36fcfa
Closes-Bug: #1937037
Ansible facts can have a large impact on the performance of the Ansible
control host. This patch introduces some control over which facts are
gathered (kolla_ansible_setup_gather_subset) and which facts are stored
(kolla_ansible_setup_filter). By default we do not change the default
values of these arguments to the setup module. The flexibility of these
arguments is limited, but they do provide enough for a large performance
improvement in a typical moderate to large OpenStack cloud.
In particular, the large complex dict fact for each interface has a
large effect, and on an OpenStack controller or hypervisor there may be
many virtual interfaces. We can use the kolla_ansible_setup_filter
variable to help:
kolla_ansible_setup_filter: 'ansible_[!qt]*'
This causes Ansible to collect but not store facts matching that
pattern, which includes the virtual interface facts. Currently we are
not referencing other facts matching the pattern within Kolla Ansible.
Note that including the 'ansible_' prefix causes meta facts module_setup
and gather_subset to be filtered, but this seems to be the only way to
get a good match on the interface facts. To work around this, we use
ansible_facts rather than module_setup to detect whether facts exist in
the cache.
The exact improvement will vary, but has been reported to be as large as
18x on systems with many virtual interfaces.
For reference, here are some other tunings tried:
* Increased the number of forks (great speedup depending of the size of
the deployment)
* Use `strategy = mitogen_linear` (cut processing time in half)
* Ansible caching (little speed up)
* SSH tunning (little speed up)
Co-Authored-By: Mark Goddard <mark@stackhpc.com>
Closes-Bug: #1921538
Change-Id: Iae8ca4aae945892f1dc65e1b10381d2e26e88805
This commit adds two new cli commands to allow an operator
to read and write passwords into a configured Hashicorp Vault
KV.
Change-Id: Icf0eaf7544fcbdf7b83f697cc711446f47118a4d
By default, Ansible injects a variable for every fact, prefixed with
ansible_. This can result in a large number of variables for each host,
which at scale can incur a performance penalty. Ansible provides a
configuration option [0] that can be set to False to prevent this
injection of facts. In this case, facts should be referenced via
ansible_facts.<fact>.
This change updates all references to Ansible facts within Kolla Ansible
from using individual fact variables to using the items in the
ansible_facts dictionary. This allows users to disable fact variable
injection in their Ansible configuration, which may provide some
performance improvement.
This change disables fact variable injection in the ansible
configuration used in CI, to catch any attempts to use the injected
variables.
[0] https://docs.ansible.com/ansible/latest/reference_appendices/config.html#inject-facts-as-vars
Change-Id: I7e9d5c9b8b9164d4aee3abb4e37c8f28d98ff5d1
Partially-Implements: blueprint performance-improvements
Magnum has various sections in its configuration file for OpenStack
clients. When internal TLS is enabled, these may need a CA certificate
to be specified.
This change adds a CA certificate configuration, based on
openstack_cacert, for all clients using internal endpoints.
Note: we are explicitly not adding the configuration for the
[magnum_client] ca_file and [drivers] openstack_ca_file options, since
these use the public endpoint by default. These options may be
provided via custom configuration if necessary.
Change-Id: Ie59b3777c0a2c142b580addd67e279bc4b2f2c90
Co-Authored-By: Kyle Dean
Closes-Bug: #1919389
Kolla Ansible runs iscsid in the foreground (-f) and
a recent change to iscsid in CentOS 8 (both Linux and Stream)
caused it to reject setting pid file in such a case.
PID file is irrelevant in this scenario so this commit
removes its parameter.
Closes-Bug: #1933033
Change-Id: Ic0c4beae0c812f3ca68a6ee5cc4daa2fee0f277d
This reverts commit c6259158e3eff4aff9770b7044b0179a7de533aa.
Reason for revert: cAdvisor fails with:
invalid value "percpu,referenced_memory,cpu_topology,resctrl,udp,advtcp,sched,hugetlb,memory_numa,tcp,process" for flag -disable_metrics: unsupported metric "referenced_memory" specified in disable_metrics
Change-Id: I1a0eea5c20f95f38c707401b56b7d2454484377d
Adds support for passing extra runtime options to cAdvisor.
By default new options disable exporting rarely useful metrics
and labels by cAdvisor. This helps reducing the load on Prometheus
and cAdvisor itself.
Change-Id: Id0144e8fa518e3236cb94ba2e3961fb455d36443
With the new default since Wallaby, starting Docker makes it
enable forwarding and not filter it at all.
This may pose a security risk and should be mitigated.
Closes-Bug: #1931615
Change-Id: I5129136c066489fdfaa4d93741c22e5010b7e89d
The host list order seen during Ansible handlers may differ to the usual
play host list order, due to race conditions in notifying handlers. This
means that restart_services.yml for RabbitMQ may be included in a
different order than the rabbitmq group, resulting in a node other than
the 'first' being restarted first. This can cause some nodes to fail to
join the cluster. The include_tasks loop was introduced in [1].
This change fixes the issue by splitting the handler into two tasks, and
restarting the first node before all others.
[1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137
Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba
Closes-Bug: #1930293
Since I0474324b60a5f792ef5210ab336639edf7a8cd9e swift role uses the new
service-cert-copy role introduced in the
I6351147ddaff8b2ae629179a9bc3bae2ebac9519 but the swift role itself
doesn't contain the handler used in the service-cert-copy. Right now,
restarting the swift container isn't necessary, but the handler should
exist. Also we should fix the name of the service used.
Closes-Bug: #1931097
Change-Id: I2d0615ce6914e1f875a2647c8a95b86dd17eeb22
Signed-off-by: Maksim Malchuk <maksim.malchuk@gmail.com>
On machines with many cores, we were seeing excessive CPU load on systems
that were not very busy. With the following Erlang VM argument we saw
RabbitMQ CPU usage drop from about 150% to around 20%, on a system with
40 hyperthreads.
+S 2:2
By default RabbitMQ starts N schedulers where N is the number of CPU
cores, including hyper-threaded cores. This is fine when you assume all
your CPUs are dedicated to RabbitMQ. Its not a good idea in a typical
Kolla Ansible setup. Here we go for two scheduler threads.
More details can be found here:
https://www.rabbitmq.com/runtime.html#scheduling
and here:
https://erlang.org/doc/man/erl.html#emulator-flags
+sbwt none
This stops busy waiting of the scheduler, for more details see:
https://www.rabbitmq.com/runtime.html#busy-waiting
Newer versions of rabbit may need additional flags:
"+sbwt none +sbwtdcpu none +sbwtdio none"
But this patch should be back portable to older versions of RabbitMQ
used in Train and Stein.
Note that information on this tuning was found by looking at data from:
rabbitmq-diagnostics runtime_thread_stats
More details on that can be found here:
https://www.rabbitmq.com/runtime.html#thread-stats
Related-Bug: #1846467
Change-Id: Iced014acee7e590c10848e73feca166f48b622dc