Docker has no restart policy named 'never'. It has 'no'.
This has bitten us already (see [1]) and might bite us again whenever
we want to change the restart policy to 'no'.
This patch makes our docker integration honor all valid restart policies
and only valid restart policies.
All relevant docker restart policy usages are patched as well.
I added some FIXMEs around which are relevant to kolla-ansible docker
integration. They are not fixed in here to not alter behavior.
[1] https://review.opendev.org/667363
Change-Id: I1c9764fb9bbda08a71186091aced67433ad4e3d6
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
A common class of problems goes like this:
* kolla-ansible deploy
* Hit a problem, often in ansible/roles/*/tasks/bootstrap.yml
* Re-run kolla-ansible deploy
* Service fails to start
This happens because the DB is created during the first run, but for some
reason we fail before performing the DB sync. This means that on the second run
we don't include ansible/roles/*/tasks/bootstrap_service.yml because the DB
already exists, and therefore still don't perform the DB sync. However this
time, the command may complete without apparent error.
We should be less careful about when we perform the DB sync, and do it whenever
it is necessary. There is an argument for not doing the sync during a
'reconfigure' command, although we will not change that here.
This change only always performs the DB sync during 'deploy' and
'reconfigure' commands.
Change-Id: I82d30f3fcf325a3fdff3c59f19a1f88055b566cc
Closes-Bug: #1823766
Closes-Bug: #1797814
Due to a bug in ansible, kolla-ansible deploy currently fails in nova
with the following error when used with ansible earlier than 2.8:
TASK [nova : Waiting for nova-compute services to register themselves]
*********
task path:
/home/zuul/src/opendev.org/openstack/kolla-ansible/ansible/roles/nova/tasks/discover_computes.yml:30
fatal: [primary]: FAILED! => {
"failed": true,
"msg": "The field 'vars' has an invalid value, which
includes an undefined variable. The error was:
'nova_compute_services' is undefined\n\nThe error
appears to have been in
'/home/zuul/src/opendev.org/openstack/kolla-ansible/ansible/roles/nova/tasks/discover_computes.yml':
line 30, column 3, but may\nbe elsewhere in the file
depending on the exact syntax problem.\n\nThe
offending line appears to be:\n\n\n- name: Waiting
for nova-compute services to register themselves\n ^
here\n"
}
Example:
http://logs.openstack.org/00/669700/1/check/kolla-ansible-centos-source/81b65b9/primary/logs/ansible/deploy
This was caused by
https://review.opendev.org/#/q/I2915e2610e5c0b8d67412e7ec77f7575b8fe9921,
which hits upon an ansible bug described here:
https://github.com/markgoddard/ansible-experiments/tree/master/05-referencing-registered-var-do-until.
We can work around this by not using an intermediary variable.
Change-Id: I58f8fd0a6e82cb614e02fef6e5b271af1d1ce9af
Closes-Bug: #1835817
In a single controller scenario, the "Upgrade status check result"
does nothing because the previous task can only succeed when
`nova-status upgrade check` returns code 0. This change allows this
command to fail, so that the value of returned code stored in
`nova_upgrade_check_stdout` can then be analysed.
This change also allows for warnings (rc 1) to pass.
Closes-Bug: 1834647
Change-Id: I6f5e37832f43f23604920b9d890cc505ca924ff9
There is a race condition during nova deploy since we wait for at least
one compute service to register itself before performing cells v2 host
discovery. It's quite possible that other compute nodes will not yet
have registered and will therefore not be discovered. This leaves them
not mapped into a cell, and results in the following error if the
scheduler picks one when booting an instance:
Host 'xyz' is not mapped to any cell
The problem has been exacerbated by merging a fix [1][2] for a nova race
condition, which disabled the dynamic periodic discovery mechanism in
the nova scheduler.
This change fixes the issue by waiting for all expected compute services
to register themselves before performing host discovery. This includes
both virtualised compute services and bare metal compute services.
[1] https://bugs.launchpad.net/kolla-ansible/+bug/1832987
[2] https://review.opendev.org/665554
Change-Id: I2915e2610e5c0b8d67412e7ec77f7575b8fe9921
Closes-Bug: #1835002
Currently, we have a lot of logic for checking if a handler should run,
depending on whether config files have changed and whether the
container configuration has changed. As rm_work pointed out during
the recent haproxy refactor, these conditionals are typically
unnecessary - we can rely on Ansible's handler notification system
to only trigger handlers when they need to run. This removes a lot
of error prone code.
This patch removes conditional handler logic for all services. It is
important to ensure that we no longer trigger handlers when unnecessary,
because without these checks in place it will trigger a restart of the
containers.
Implements: blueprint simplify-handlers
Change-Id: I4f1aa03e9a9faaf8aecd556dfeafdb834042e4cd
During an upgrade, nova pins the version of RPC calls to the minimum
seen across all services. This ensures that old services do not receive
data they cannot handle. After the upgrade is complete, all nova
services are supposed to be reloaded via SIGHUP to cause them to check
again the RPC versions of services and use the new latest version which
should now be supported by all running services.
Due to a bug [1] in oslo.service, sending services SIGHUP is currently
broken. We replaced the HUP with a restart for the nova_compute
container for bug 1821362, but not other nova services. It seems we need
to restart all nova services to allow the RPC version pin to be removed.
Testing in a Queens to Rocky upgrade, we find the following in the logs:
Automatically selected compute RPC version 5.0 from minimum service
version 30
However, the service version in Rocky is 35.
There is a second issue in that it takes some time for the upgraded
services to update the nova services database table with their new
version. We need to wait until all nova-compute services have done this
before the restart is performed, otherwise the RPC version cap will
remain in place. There is currently no interface in nova available for
checking these versions [2], so as a workaround we use a configurable
delay with a default duration of 30 seconds. Testing showed it takes
about 10 seconds for the version to be updated, so this gives us some
headroom.
This change restarts all nova services after an upgrade, after a 30
second delay.
[1] https://bugs.launchpad.net/oslo.service/+bug/1715374
[2] https://bugs.launchpad.net/nova/+bug/1833542
Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
Closes-Bug: #1833069
Related-Bug: #1833542
They are used only to obtain keys for the next task.
Change-Id: I2fac22af4710b70e4df8e3a272bcfb6cc8b8532e
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
In a rare event both kolla-ansible and nova-scheduler try to do
the mapping at the same time and one of them fails.
Since kolla-ansible runs host discovery on each deployment,
there is no need to change the default of no periodic host discovery.
I added some notes for future. They are not critical.
I made the decision explicit in the comments.
I changed the task name to satisfy recommendations.
I removed the variable because it is not used (to avoid future doubts).
Closes-Bug: #1832987
Change-Id: I3128472f028a2dbd7ace02abc179a9629ad74ceb
Signed-off-by: Radosław Piliszek <radoslaw.piliszek@gmail.com>
Many tasks that use Docker have become specified already, but
not all. This change ensures all tasks that use the following
modules have become:
* kolla_docker
* kolla_ceph_keyring
* kolla_toolbox
* kolla_container_facts
It also adds become for 'command' tasks that use docker CLI.
Change-Id: I4a5ebcedaccb9261dbc958ec67e8077d7980e496
Check if a base Nova cell already exists before calling `nova-manage
cell_v2 create_cell`, which would otherwise create a duplicate cell when
the transport URL or database connection change.
If a base cell already exists but the connection values have changed, we
now call `nova-manage cell_v2 update_cell` instead. This is only
possible if a duplicate cell has not yet been created. If one already
exists, we print a warning inviting the operator to perform a manual
cleanup. We don't use a hard fail to avoid an abrupt change of behavior
if this is backported to stable branches.
Change-Id: I7841ce0cff08e315fd7761d84e1e681b1a00d43e
Closes-Bug: #1734872
When ansible goes in to a loop, by default it prints all the keys for
the item it is looping over. Some roles, when setting up the databases,
iterate over an object that includes the database password.
Override the loop label to hide everything but the database name.
Change-Id: I336a81a5ecd824ace7d40e9a35942a1c853554cd
In a multi-region environment, each region is being deployed separately.
Cell discovery, however, would sometimes fail due to it picking a region
different than the one being deployed. Most likely, an internal endpoint
for region A will not be visible from region B. Furthermore, it is not
very useful to discover hosts on a region you're not modifying.
This changes the check to only run against nova compute services located
in the region being deployed.
Change-Id: I21eb1164c2f67098b81edbd5cc106472663b92cb
Several config file permissions are incorrect on the host. In general,
files should be 0660, and directories and executables 0770.
Change-Id: Id276ac1864f280554e98b937f2845bb424d521de
Closes-Bug: #1821579
After upgrading from Rocky to Stein, nova-compute services fail to start
new instances with the following error message:
Failed to allocate the network(s), not rescheduling.
Looking in the nova-compute logs, we also see this:
Neutron Reported failure on event
network-vif-plugged-60c05a0d-8758-44c9-81e4-754551567be5 for instance
32c493c4-d88c-4f14-98db-c7af64bf3324: NovaException: In shutdown, no new
events can be scheduled
During the upgrade process, we send nova containers a SIGHUP to cause
them to reload their object version state. Speaking to the nova team in
IRC, there is a known issue with this, caused by oslo.service performing
a full shutdown in response to a SIGHUP, which breaks nova-compute.
There is a patch [1] in review to address this.
The workaround employed here is to restart the nova compute service.
[1] https://review.openstack.org/#/c/641907
Change-Id: Ia4fcc558a3f62ced2d629d7a22d0bc1eb6b879f1
Closes-Bug: #1821362
When Nova, Glance, or Cinder are deployed alongside an external Ceph deployment
handlers will fail to trigger if keyring files are updated, which results in the
containers not being restarted.
This change adds the missing 'when' conditions for nova-libvirt, nova-compute,
cinder-volume, cinder-backup, and glance-api containers.
Change-Id: I8e183aac9a72e7a7210f7edc7cdcbaedd4fbcaa9
When adding the rolling upgrade support, some upgrade procedures were
modified to pull images explicitly. This is done inconsistently between
services, and is a change in behaviour from Rocky and earlier releases.
This change removes all image pulling from upgrade tasks.
Change-Id: Id0fed17714235e1daed60b83b1f30620f097eb97
This allows nova service endpoints to use custom hostnames, and adds the
following variables:
* nova_internal_fqdn
* nova_external_fqdn
* placement_internal_fqdn
* placement_external_fqdn
* nova_novncproxy_fqdn
* nova_spicehtml5proxy_fqdn
* nova_serialproxy_fqdn
These default to the old values of kolla_internal_fqdn or
kolla_external_fqdn.
This also adds the following variables:
* nova_api_listen_port
* nova_metadata_listen_port
* nova_novncproxy_listen_port
* nova_spicehtml5proxy_listen_port
* nova_serialproxy_listen_port
* placement_api_listen_port
These default to <service>_port, e.g. nova_api_port, for backward
compatibility.
These options allow the user to differentiate between the port the
service listens on, and the port the service is reachable on. This is
useful for external load balancers which live on the same host as the
service itself.
Change-Id: I7bcce56a2138eeadcabac79dd07c8dba1c5af644
Implements: blueprint service-hostnames
Nova services may reasonably expect cell databases to exist when they
start. The current cell setup tasks in kolla run after the nova
containers have started, meaning that cells may or may not exist in the
database when they start, depending on timing. In particular, we are
seeing issues in kolla CI currently with jobs timing out waiting for
nova compute services to start. The following error is seen in the nova
logs of these jobs, which may or may not be relevant:
No cells are configured, unable to continue
This change creates the cell0 and cell1 databases prior to starting nova
services.
In order to do this, we must create new containers in which to run the
nova-manage commands, because the nova-api container may not yet exist.
This required adding support to the kolla_docker module for specifying a
command for the container to run that overrides the image's command.
We also add the standard output and error to the module's result when a
non-detached container is run. A secondary benefit of this is that the
output of bootstrap containers is now displayed in the Ansible output if
the bootstrapping command fails, which will help with debugging.
Change-Id: I2c1e991064f9f588f398ccbabda94f69dc285e61
Closes-Bug: #1808575
With this change, an operator may be able to stop a
service container without stopping all services in a host.
This change is the starting point to start
fast-forward upgrades support.
In next changes new flags will be introducced to disable
stop dataplane services during upgrades.
Change-Id: Ifde7a39d7d8596ef0d7405ecf1ac1d49a459d9ef
Implements: blueprint support-stop-containers
At the moment the "databases user and setting permissions" task for
designate and nova leaks the database_password because of the use
of with_items:
---snip---
TASK [nova : Creating Nova databases user and setting permissions] *********************************************************
ok: [x -> y] => (item={u'database_password': u'password', u'database_name': u'nova', u'database_username': u'nova'})
ok: [x -> y] => (item={u'database_password': u'password', u'database_name': u'nova_cell0', u'database_username': u'nova'})
ok: [x -> y] => (item={u'database_password': u'password', u'database_name': u'nova_api', u'database_username': u'nova_api'})
---snap---
Change-Id: I141e4153223c8772c82a31d81e58057ce266c0b9
Co-authored-by: Bernd Müller <mueller@b1-systems.de>
Two new parameters (migration_interface, migration_interface_address) to make
the use of a dedicated migration network possible.
Change-Id: I723c9bea9cf1881e02ba39d5318c090960c22c47
If upgrading the nova, cinder or manila services via 'kolla-ansible
upgrade', the Ceph config files are not generated. Users will expect
that these files are generated, to pull in any changes from their
configuration or the base kolla configuration.
This change moves Ceph tasks inside config.yml to ensure that they are
performed during deploy, reconfigure and upgrade. This has been done for
nova, cinder, gnocchi and manila - glance already does this.
Change-Id: Ic75692c2bcba9b81dee922ff6fbbccd160e7fa19
Closes-Bug: #1794275
Kolla-ansible provides support for the dev mode for some projects
of openstack, but there are still some projects that do not yet
support specific release tag. This patch will implement this function
for these project.
Change-Id: I917b27dd61295b542457a21b240afe2cd4e83e58
Various ceph-related tasks were missing a 'become' that would allow them
to work as a non-root user. This seems to only cause a problem after an
initial deployment, perhaps due to the recursive ownership & permissions
changes at the end of the ceph.yml and external_ceph.yml files.
This change adds the necessary becomes.
Change-Id: I887c7b3bdef49db1dd1bf9e5bdbf5dc47b7f41af
Closes-Bug: #1795125
Having all services in one giant haproxy file makes altering
configuration for a service both painful and dangerous. Each service
should be configured with a simple set of variables and rendered with a
single unified template.
Available are two new templates:
* haproxy_single_service_listen.cfg.j2: close to the original style, but
only one service per file
* haproxy_single_service_split.cfg.j2: using the newer haproxy syntax
for separated frontend and backend
For now the default will be the single listen block, for ease of
transition.
Change-Id: I6e237438fbc0aa3c89a3c8bd706a53b74e71904b
when creating Nova databases user and setting permissions,
no need to register the database_user_create, and it used
nowhere, remove it is safe
Change-Id: If456b7c2ed25aa729be7d98ef875230c66581d65
Add a possibility to mount sources as volumes to containers,
in "more than documentation" way. That will let us to use kolla
as a replacement for devstack.
Partially implements: blueprint mount-sources
Co-Authored-By: zhulingjie <easyzlj@gmail.com>
Change-Id: I10677e5ad22f2107a0657feeeaf32287ab9f8e28
With the more recent versions of ansible, we should now use
"is" instead of the "|"
This should update it.
Change-Id: I6fba56fca182349972e8b0ee5452b37aa4090e0c