[ops-guide] Cleanup maintenance chapter
Change-Id: I421caf2a12ab192d4df6d5c197e2c5dfb1c9c9bb Implements: blueprint ops-guide-rst
This commit is contained in:
parent
17da315fc9
commit
8760c94427
File diff suppressed because it is too large
Load Diff
50
doc/ops-guide/source/ops_maintenance_complete.rst
Normal file
50
doc/ops-guide/source/ops_maintenance_complete.rst
Normal file
@ -0,0 +1,50 @@
|
||||
===========================
|
||||
Handling a Complete Failure
|
||||
===========================
|
||||
|
||||
A common way of dealing with the recovery from a full system failure,
|
||||
such as a power outage of a data center, is to assign each service a
|
||||
priority, and restore in order.
|
||||
:ref:`table_example_priority` shows an example.
|
||||
|
||||
.. _table_example_priority:
|
||||
|
||||
.. list-table:: Table. Example service restoration priority list
|
||||
:header-rows: 1
|
||||
|
||||
* - Priority
|
||||
- Services
|
||||
* - 1
|
||||
- Internal network connectivity
|
||||
* - 2
|
||||
- Backing storage services
|
||||
* - 3
|
||||
- Public network connectivity for user virtual machines
|
||||
* - 4
|
||||
- ``nova-compute``, ``nova-network``, cinder hosts
|
||||
* - 5
|
||||
- User virtual machines
|
||||
* - 10
|
||||
- Message queue and database services
|
||||
* - 15
|
||||
- Keystone services
|
||||
* - 20
|
||||
- ``cinder-scheduler``
|
||||
* - 21
|
||||
- Image Catalog and Delivery services
|
||||
* - 22
|
||||
- ``nova-scheduler`` services
|
||||
* - 98
|
||||
- ``cinder-api``
|
||||
* - 99
|
||||
- ``nova-api`` services
|
||||
* - 100
|
||||
- Dashboard node
|
||||
|
||||
Use this example priority list to ensure that user-affected services are
|
||||
restored as soon as possible, but not before a stable environment is in
|
||||
place. Of course, despite being listed as a single-line item, each step
|
||||
requires significant work. For example, just after starting the
|
||||
database, you should check its integrity, or, after starting the nova
|
||||
services, you should verify that the hypervisor matches the database and
|
||||
fix any mismatches.
|
401
doc/ops-guide/source/ops_maintenance_compute.rst
Normal file
401
doc/ops-guide/source/ops_maintenance_compute.rst
Normal file
@ -0,0 +1,401 @@
|
||||
=====================================
|
||||
Compute Node Failures and Maintenance
|
||||
=====================================
|
||||
|
||||
Sometimes a compute node either crashes unexpectedly or requires a
|
||||
reboot for maintenance reasons.
|
||||
|
||||
Planned Maintenance
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you need to reboot a compute node due to planned maintenance (such as
|
||||
a software or hardware upgrade), first ensure that all hosted instances
|
||||
have been moved off the node. If your cloud is utilizing shared storage,
|
||||
use the :command:`nova live-migration` command. First, get a list of instances
|
||||
that need to be moved:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova list --host c01.example.com --all-tenants
|
||||
|
||||
Next, migrate them one by one:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova live-migration <uuid> c02.example.com
|
||||
|
||||
If you are not using shared storage, you can use the
|
||||
:option:`--block-migrate` option:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova live-migration --block-migrate <uuid> c02.example.com
|
||||
|
||||
After you have migrated all instances, ensure that the ``nova-compute``
|
||||
service has stopped:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# stop nova-compute
|
||||
|
||||
If you use a configuration-management system, such as Puppet, that
|
||||
ensures the ``nova-compute`` service is always running, you can
|
||||
temporarily move the ``init`` files:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# mkdir /root/tmp
|
||||
# mv /etc/init/nova-compute.conf /root/tmp
|
||||
# mv /etc/init.d/nova-compute /root/tmp
|
||||
|
||||
Next, shut down your compute node, perform your maintenance, and turn
|
||||
the node back on. You can reenable the ``nova-compute`` service by
|
||||
undoing the previous commands:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# mv /root/tmp/nova-compute.conf /etc/init
|
||||
# mv /root/tmp/nova-compute /etc/init.d/
|
||||
|
||||
Then start the ``nova-compute`` service:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# start nova-compute
|
||||
|
||||
You can now optionally migrate the instances back to their original
|
||||
compute node.
|
||||
|
||||
After a Compute Node Reboots
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
When you reboot a compute node, first verify that it booted
|
||||
successfully. This includes ensuring that the ``nova-compute`` service
|
||||
is running:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# ps aux | grep nova-compute
|
||||
# status nova-compute
|
||||
|
||||
Also ensure that it has successfully connected to the AMQP server:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# grep AMQP /var/log/nova/nova-compute
|
||||
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672
|
||||
|
||||
After the compute node is successfully running, you must deal with the
|
||||
instances that are hosted on that compute node because none of them are
|
||||
running. Depending on your SLA with your users or customers, you might
|
||||
have to start each instance and ensure that they start correctly.
|
||||
|
||||
Instances
|
||||
~~~~~~~~~
|
||||
|
||||
You can create a list of instances that are hosted on the compute node
|
||||
by performing the following command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova list --host c01.example.com --all-tenants
|
||||
|
||||
After you have the list, you can use the :command:`nova` command to start each
|
||||
instance:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova reboot <uuid>
|
||||
|
||||
.. note::
|
||||
|
||||
Any time an instance shuts down unexpectedly, it might have problems
|
||||
on boot. For example, the instance might require an ``fsck`` on the
|
||||
root partition. If this happens, the user can use the dashboard VNC
|
||||
console to fix this.
|
||||
|
||||
If an instance does not boot, meaning ``virsh list`` never shows the
|
||||
instance as even attempting to boot, do the following on the compute
|
||||
node:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# tail -f /var/log/nova/nova-compute.log
|
||||
|
||||
Try executing the :command:`nova reboot` command again. You should see an
|
||||
error message about why the instance was not able to boot
|
||||
|
||||
In most cases, the error is the result of something in libvirt's XML
|
||||
file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer
|
||||
exists. You can enforce re-creation of the XML file as well as rebooting
|
||||
the instance by running the following command:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova reboot --hard <uuid>
|
||||
|
||||
Inspecting and Recovering Data from Failed Instances
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In some scenarios, instances are running but are inaccessible through
|
||||
SSH and do not respond to any command. The VNC console could be
|
||||
displaying a boot failure or kernel panic error messages. This could be
|
||||
an indication of file system corruption on the VM itself. If you need to
|
||||
recover files or inspect the content of the instance, qemu-nbd can be
|
||||
used to mount the disk.
|
||||
|
||||
.. warning::
|
||||
|
||||
If you access or view the user's content and data, get approval first!
|
||||
|
||||
To access the instance's disk
|
||||
(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following
|
||||
steps:
|
||||
|
||||
#. Suspend the instance using the ``virsh`` command.
|
||||
|
||||
#. Connect the qemu-nbd device to the disk.
|
||||
|
||||
#. Mount the qemu-nbd device.
|
||||
|
||||
#. Unmount the device after inspecting.
|
||||
|
||||
#. Disconnect the qemu-nbd device.
|
||||
|
||||
#. Resume the instance.
|
||||
|
||||
If you do not follow last three steps, OpenStack Compute cannot manage
|
||||
the instance any longer. It fails to respond to any command issued by
|
||||
OpenStack Compute, and it is marked as shut down.
|
||||
|
||||
Once you mount the disk file, you should be able to access it and treat
|
||||
it as a collection of normal directories with files and a directory
|
||||
structure. However, we do not recommend that you edit or touch any files
|
||||
because this could change the
|
||||
:term:`access control lists (ACLs) <access control list>` that are used
|
||||
to determine which accounts can perform what operations on files and
|
||||
directories. Changing ACLs can make the instance unbootable if it is not
|
||||
already.
|
||||
|
||||
#. Suspend the instance using the :command:`virsh` command, taking note of the
|
||||
internal ID:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# virsh list
|
||||
Id Name State
|
||||
----------------------------------
|
||||
1 instance-00000981 running
|
||||
2 instance-000009f5 running
|
||||
30 instance-0000274a running
|
||||
|
||||
# virsh suspend 30
|
||||
Domain 30 suspended
|
||||
|
||||
#. Connect the qemu-nbd device to the disk:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# cd /var/lib/nova/instances/instance-0000274a
|
||||
# ls -lh
|
||||
total 33M
|
||||
-rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log
|
||||
-rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk
|
||||
-rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local
|
||||
-rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml
|
||||
# qemu-nbd -c /dev/nbd0 `pwd`/disk
|
||||
|
||||
#. Mount the qemu-nbd device.
|
||||
|
||||
The qemu-nbd device tries to export the instance disk's different
|
||||
partitions as separate devices. For example, if vda is the disk and
|
||||
vda1 is the root partition, qemu-nbd exports the device as
|
||||
``/dev/nbd0`` and ``/dev/nbd0p1``, respectively:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# mount /dev/nbd0p1 /mnt/
|
||||
|
||||
You can now access the contents of ``/mnt``, which correspond to the
|
||||
first partition of the instance's disk.
|
||||
|
||||
To examine the secondary or ephemeral disk, use an alternate mount
|
||||
point if you want both primary and secondary drives mounted at the
|
||||
same time:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# umount /mnt
|
||||
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
|
||||
# mount /dev/nbd1 /mnt/
|
||||
# ls -lh /mnt/
|
||||
total 76K
|
||||
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin
|
||||
dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot
|
||||
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev
|
||||
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
|
||||
drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home
|
||||
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib
|
||||
lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64
|
||||
drwx------. 2 root root 16K Oct 15 00:42 lost+found
|
||||
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media
|
||||
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt
|
||||
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt
|
||||
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc
|
||||
dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root
|
||||
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
|
||||
lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin
|
||||
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv
|
||||
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys
|
||||
drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp
|
||||
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
|
||||
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var
|
||||
|
||||
#. Once you have completed the inspection, unmount the mount point and
|
||||
release the qemu-nbd device:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# umount /mnt
|
||||
# qemu-nbd -d /dev/nbd0
|
||||
/dev/nbd0 disconnected
|
||||
|
||||
#. Resume the instance using :command:`virsh`:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# virsh list
|
||||
Id Name State
|
||||
----------------------------------
|
||||
1 instance-00000981 running
|
||||
2 instance-000009f5 running
|
||||
30 instance-0000274a paused
|
||||
|
||||
# virsh resume 30
|
||||
Domain 30 resumed
|
||||
|
||||
.. _volumes:
|
||||
|
||||
Volumes
|
||||
~~~~~~~
|
||||
|
||||
If the affected instances also had attached volumes, first generate a
|
||||
list of instance and volume UUIDs:
|
||||
|
||||
.. code-block:: mysql
|
||||
|
||||
mysql> select nova.instances.uuid as instance_uuid,
|
||||
cinder.volumes.id as volume_uuid, cinder.volumes.status,
|
||||
cinder.volumes.attach_status, cinder.volumes.mountpoint,
|
||||
cinder.volumes.display_name from cinder.volumes
|
||||
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
|
||||
where nova.instances.host = 'c01.example.com';
|
||||
|
||||
You should see a result similar to the following:
|
||||
|
||||
.. code-block:: mysql
|
||||
|
||||
+--------------+------------+-------+--------------+-----------+--------------+
|
||||
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
|
||||
+--------------+------------+-------+--------------+-----------+--------------+
|
||||
|9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test |
|
||||
+--------------+------------+-------+--------------+-----------+--------------+
|
||||
1 row in set (0.00 sec)
|
||||
|
||||
Next, manually detach and reattach the volumes, where X is the proper
|
||||
mount point:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova volume-detach <instance_uuid> <volume_uuid>
|
||||
# nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX
|
||||
|
||||
Be sure that the instance has successfully booted and is at a login
|
||||
screen before doing the above.
|
||||
|
||||
Total Compute Node Failure
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Compute nodes can fail the same way a cloud controller can fail. A
|
||||
motherboard failure or some other type of hardware failure can cause an
|
||||
entire compute node to go offline. When this happens, all instances
|
||||
running on that compute node will not be available. Just like with a
|
||||
cloud controller failure, if your infrastructure monitoring does not
|
||||
detect a failed compute node, your users will notify you because of
|
||||
their lost instances.
|
||||
|
||||
If a compute node fails and won't be fixed for a few hours (or at all),
|
||||
you can relaunch all instances that are hosted on the failed node if you
|
||||
use shared storage for ``/var/lib/nova/instances``.
|
||||
|
||||
To do this, generate a list of instance UUIDs that are hosted on the
|
||||
failed node by running the following query on the nova database:
|
||||
|
||||
.. code-block:: mysql
|
||||
|
||||
mysql> select uuid from instances
|
||||
where host = 'c01.example.com' and deleted = 0;
|
||||
|
||||
Next, update the nova database to indicate that all instances that used
|
||||
to be hosted on c01.example.com are now hosted on c02.example.com:
|
||||
|
||||
.. code-block:: mysql
|
||||
|
||||
mysql> update instances set host = 'c02.example.com'
|
||||
where host = 'c01.example.com' and deleted = 0;
|
||||
|
||||
If you're using the Networking service ML2 plug-in, update the
|
||||
Networking service database to indicate that all ports that used to be
|
||||
hosted on c01.example.com are now hosted on c02.example.com:
|
||||
|
||||
.. code-block:: mysql
|
||||
|
||||
mysql> update ml2_port_bindings set host = 'c02.example.com'
|
||||
where host = 'c01.example.com';
|
||||
mysql> update ml2_port_binding_levels set host = 'c02.example.com'
|
||||
where host = 'c01.example.com';
|
||||
|
||||
After that, use the :command:`nova` command to reboot all instances that were
|
||||
on c01.example.com while regenerating their XML files at the same time:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova reboot --hard <uuid>
|
||||
|
||||
Finally, reattach volumes using the same method described in the section
|
||||
:ref:`volumes`.
|
||||
|
||||
/var/lib/nova/instances
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It's worth mentioning this directory in the context of failed compute
|
||||
nodes. This directory contains the libvirt KVM file-based disk images
|
||||
for the instances that are hosted on that compute node. If you are not
|
||||
running your cloud in a shared storage environment, this directory is
|
||||
unique across all compute nodes.
|
||||
|
||||
``/var/lib/nova/instances`` contains two types of directories.
|
||||
|
||||
The first is the ``_base`` directory. This contains all the cached base
|
||||
images from glance for each unique image that has been launched on that
|
||||
compute node. Files ending in ``_20`` (or a different number) are the
|
||||
ephemeral base images.
|
||||
|
||||
The other directories are titled ``instance-xxxxxxxx``. These
|
||||
directories correspond to instances running on that compute node. The
|
||||
files inside are related to one of the files in the ``_base`` directory.
|
||||
They're essentially differential-based files containing only the changes
|
||||
made from the original ``_base`` directory.
|
||||
|
||||
All files and directories in ``/var/lib/nova/instances`` are uniquely
|
||||
named. The files in \_base are uniquely titled for the glance image that
|
||||
they are based on, and the directory names ``instance-xxxxxxxx`` are
|
||||
uniquely titled for that particular instance. For example, if you copy
|
||||
all data from ``/var/lib/nova/instances`` on one compute node to
|
||||
another, you do not overwrite any files or cause any damage to images
|
||||
that have the same unique name, because they are essentially the same
|
||||
file.
|
||||
|
||||
Although this method is not documented or supported, you can use it when
|
||||
your compute node is permanently offline but you have instances locally
|
||||
stored on it.
|
27
doc/ops-guide/source/ops_maintenance_configuration.rst
Normal file
27
doc/ops-guide/source/ops_maintenance_configuration.rst
Normal file
@ -0,0 +1,27 @@
|
||||
========================
|
||||
Configuration Management
|
||||
========================
|
||||
|
||||
Maintaining an OpenStack cloud requires that you manage multiple
|
||||
physical servers, and this number might grow over time. Because managing
|
||||
nodes manually is error prone, we strongly recommend that you use a
|
||||
configuration-management tool. These tools automate the process of
|
||||
ensuring that all your nodes are configured properly and encourage you
|
||||
to maintain your configuration information (such as packages and
|
||||
configuration options) in a version-controlled repository.
|
||||
|
||||
.. note::
|
||||
|
||||
Several configuration-management tools are available, and this guide
|
||||
does not recommend a specific one. The two most popular ones in the
|
||||
OpenStack community are `Puppet <https://puppetlabs.com/>`_, with
|
||||
available `OpenStack Puppet
|
||||
modules <https://github.com/puppetlabs/puppetlabs-openstack>`_; and
|
||||
`Chef <http://www.getchef.com/chef/>`_, with available `OpenStack
|
||||
Chef recipes <https://github.com/opscode/openstack-chef-repo>`_.
|
||||
Other newer configuration tools include
|
||||
`Juju <https://juju.ubuntu.com/>`_,
|
||||
`Ansible <https://www.ansible.com/>`_, and
|
||||
`Salt <http://www.saltstack.com/>`_; and more mature configuration
|
||||
management tools include `CFEngine <http://cfengine.com/>`_ and
|
||||
`Bcfg2 <http://bcfg2.org/>`_.
|
96
doc/ops-guide/source/ops_maintenance_controller.rst
Normal file
96
doc/ops-guide/source/ops_maintenance_controller.rst
Normal file
@ -0,0 +1,96 @@
|
||||
===========================================================
|
||||
Cloud Controller and Storage Proxy Failures and Maintenance
|
||||
===========================================================
|
||||
|
||||
The cloud controller and storage proxy are very similar to each other
|
||||
when it comes to expected and unexpected downtime. One of each server
|
||||
type typically runs in the cloud, which makes them very noticeable when
|
||||
they are not running.
|
||||
|
||||
For the cloud controller, the good news is if your cloud is using the
|
||||
FlatDHCP multi-host HA network mode, existing instances and volumes
|
||||
continue to operate while the cloud controller is offline. For the
|
||||
storage proxy, however, no storage traffic is possible until it is back
|
||||
up and running.
|
||||
|
||||
Planned Maintenance
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
One way to plan for cloud controller or storage proxy maintenance is to
|
||||
simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
|
||||
affects fewer users. If your cloud controller or storage proxy is too
|
||||
important to have unavailable at any point in time, you must look into
|
||||
high-availability options.
|
||||
|
||||
Rebooting a Cloud Controller or Storage Proxy
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
All in all, just issue the :command:`reboot` command. The operating system
|
||||
cleanly shuts down services and then automatically reboots. If you want
|
||||
to be very thorough, run your backup jobs just before you
|
||||
reboot.
|
||||
|
||||
After a cloud controller reboots, ensure that all required services were
|
||||
successfully started. The following commands use :command:`ps` and
|
||||
:command:`grep` to determine if nova, glance, and keystone are currently
|
||||
running:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# ps aux | grep nova-
|
||||
# ps aux | grep glance-
|
||||
# ps aux | grep keystone
|
||||
# ps aux | grep cinder
|
||||
|
||||
Also check that all services are functioning. The following set of
|
||||
commands sources the ``openrc`` file, then runs some basic glance, nova,
|
||||
and openstack commands. If the commands work as expected, you can be
|
||||
confident that those services are in working condition:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# source openrc
|
||||
# glance index
|
||||
# nova list
|
||||
# openstack project list
|
||||
|
||||
For the storage proxy, ensure that the :term:`Object Storage service` has
|
||||
resumed:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# ps aux | grep swift
|
||||
|
||||
Also check that it is functioning:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# swift stat
|
||||
|
||||
Total Cloud Controller Failure
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The cloud controller could completely fail if, for example, its
|
||||
motherboard goes bad. Users will immediately notice the loss of a cloud
|
||||
controller since it provides core functionality to your cloud
|
||||
environment. If your infrastructure monitoring does not alert you that
|
||||
your cloud controller has failed, your users definitely will.
|
||||
Unfortunately, this is a rough situation. The cloud controller is an
|
||||
integral part of your cloud. If you have only one controller, you will
|
||||
have many missing services if it goes down.
|
||||
|
||||
To avoid this situation, create a highly available cloud controller
|
||||
cluster. This is outside the scope of this document, but you can read
|
||||
more in the `OpenStack High Availability
|
||||
Guide <http://docs.openstack.org/ha-guide/index.html>`_.
|
||||
|
||||
The next best approach is to use a configuration-management tool, such
|
||||
as Puppet, to automatically build a cloud controller. This should not
|
||||
take more than 15 minutes if you have a spare server available. After
|
||||
the controller rebuilds, restore any backups taken
|
||||
(see :doc:`ops_backup_recovery`).
|
||||
|
||||
Also, in practice, the ``nova-compute`` services on the compute nodes do
|
||||
not always reconnect cleanly to rabbitmq hosted on the controller when
|
||||
it comes back up after a long reboot; a restart on the nova services on
|
||||
the compute nodes is required.
|
49
doc/ops-guide/source/ops_maintenance_database.rst
Normal file
49
doc/ops-guide/source/ops_maintenance_database.rst
Normal file
@ -0,0 +1,49 @@
|
||||
=========
|
||||
Databases
|
||||
=========
|
||||
|
||||
Almost all OpenStack components have an underlying database to store
|
||||
persistent information. Usually this database is MySQL. Normal MySQL
|
||||
administration is applicable to these databases. OpenStack does not
|
||||
configure the databases out of the ordinary. Basic administration
|
||||
includes performance tweaking, high availability, backup, recovery, and
|
||||
repairing. For more information, see a standard MySQL administration guide.
|
||||
|
||||
You can perform a couple of tricks with the database to either more
|
||||
quickly retrieve information or fix a data inconsistency error—for
|
||||
example, an instance was terminated, but the status was not updated in
|
||||
the database. These tricks are discussed throughout this book.
|
||||
|
||||
Database Connectivity
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Review the component's configuration file to see how each OpenStack
|
||||
component accesses its corresponding database. Look for either
|
||||
``sql_connection`` or simply ``connection``. The following command uses
|
||||
``grep`` to display the SQL connection string for nova, glance, cinder,
|
||||
and keystone:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf \
|
||||
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
|
||||
sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
|
||||
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
||||
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
||||
sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
|
||||
connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone
|
||||
|
||||
The connection strings take this format:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
mysql+pymysql:// <username> : <password> @ <hostname> / <database name>
|
||||
|
||||
Performance and Optimizing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As your cloud grows, MySQL is utilized more and more. If you suspect
|
||||
that MySQL might be becoming a bottleneck, you should start researching
|
||||
MySQL optimization. The MySQL manual has an entire section dedicated to
|
||||
this topic: `Optimization Overview
|
||||
<http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html>`_.
|
92
doc/ops-guide/source/ops_maintenance_determine.rst
Normal file
92
doc/ops-guide/source/ops_maintenance_determine.rst
Normal file
@ -0,0 +1,92 @@
|
||||
=====================================
|
||||
Determining Which Component Is Broken
|
||||
=====================================
|
||||
|
||||
OpenStack's collection of different components interact with each other
|
||||
strongly. For example, uploading an image requires interaction from
|
||||
``nova-api``, ``glance-api``, ``glance-registry``, keystone, and
|
||||
potentially ``swift-proxy``. As a result, it is sometimes difficult to
|
||||
determine exactly where problems lie. Assisting in this is the purpose
|
||||
of this section.
|
||||
|
||||
Tailing Logs
|
||||
~~~~~~~~~~~~
|
||||
|
||||
The first place to look is the log file related to the command you are
|
||||
trying to run. For example, if ``nova list`` is failing, try tailing a
|
||||
nova log file and running the command again:
|
||||
|
||||
Terminal 1:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# tail -f /var/log/nova/nova-api.log
|
||||
|
||||
Terminal 2:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova list
|
||||
|
||||
Look for any errors or traces in the log file. For more information, see
|
||||
:doc:`ops_logging_monitoring`.
|
||||
|
||||
If the error indicates that the problem is with another component,
|
||||
switch to tailing that component's log file. For example, if nova cannot
|
||||
access glance, look at the ``glance-api`` log:
|
||||
|
||||
Terminal 1:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# tail -f /var/log/glance/api.log
|
||||
|
||||
Terminal 2:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# nova list
|
||||
|
||||
Wash, rinse, and repeat until you find the core cause of the problem.
|
||||
|
||||
Running Daemons on the CLI
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Unfortunately, sometimes the error is not apparent from the log files.
|
||||
In this case, switch tactics and use a different command; maybe run the
|
||||
service directly on the command line. For example, if the ``glance-api``
|
||||
service refuses to start and stay running, try launching the daemon from
|
||||
the command line:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# sudo -u glance -H glance-api
|
||||
|
||||
This might print the error and cause of the problem.
|
||||
|
||||
.. note::
|
||||
|
||||
The ``-H`` flag is required when running the daemons with sudo
|
||||
because some daemons will write files relative to the user's home
|
||||
directory, and this write may fail if ``-H`` is left off.
|
||||
|
||||
.. Tip::
|
||||
|
||||
**Example of Complexity**
|
||||
|
||||
One morning, a compute node failed to run any instances. The log files
|
||||
were a bit vague, claiming that a certain instance was unable to be
|
||||
started. This ended up being a red herring because the instance was
|
||||
simply the first instance in alphabetical order, so it was the first
|
||||
instance that ``nova-compute`` would touch.
|
||||
|
||||
Further troubleshooting showed that libvirt was not running at all. This
|
||||
made more sense. If libvirt wasn't running, then no instance could be
|
||||
virtualized through KVM. Upon trying to start libvirt, it would silently
|
||||
die immediately. The libvirt logs did not explain why.
|
||||
|
||||
Next, the ``libvirtd`` daemon was run on the command line. Finally a
|
||||
helpful error message: it could not connect to d-bus. As ridiculous as
|
||||
it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and
|
||||
somehow d-bus crashed. Simply starting d-bus set the entire chain back
|
||||
on track, and soon everything was back up and running.
|
64
doc/ops-guide/source/ops_maintenance_hardware.rst
Normal file
64
doc/ops-guide/source/ops_maintenance_hardware.rst
Normal file
@ -0,0 +1,64 @@
|
||||
=====================
|
||||
Working with Hardware
|
||||
=====================
|
||||
|
||||
As for your initial deployment, you should ensure that all hardware is
|
||||
appropriately burned in before adding it to production. Run software
|
||||
that uses the hardware to its limits—maxing out RAM, CPU, disk, and
|
||||
network. Many options are available, and normally double as benchmark
|
||||
software, so you also get a good idea of the performance of your
|
||||
system.
|
||||
|
||||
Adding a Compute Node
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you find that you have reached or are reaching the capacity limit of
|
||||
your computing resources, you should plan to add additional compute
|
||||
nodes. Adding more nodes is quite easy. The process for adding compute
|
||||
nodes is the same as when the initial compute nodes were deployed to
|
||||
your cloud: use an automated deployment system to bootstrap the
|
||||
bare-metal server with the operating system and then have a
|
||||
configuration-management system install and configure OpenStack Compute.
|
||||
Once the Compute service has been installed and configured in the same
|
||||
way as the other compute nodes, it automatically attaches itself to the
|
||||
cloud. The cloud controller notices the new node(s) and begins
|
||||
scheduling instances to launch there.
|
||||
|
||||
If your OpenStack Block Storage nodes are separate from your compute
|
||||
nodes, the same procedure still applies because the same queuing and
|
||||
polling system is used in both services.
|
||||
|
||||
We recommend that you use the same hardware for new compute and block
|
||||
storage nodes. At the very least, ensure that the CPUs are similar in
|
||||
the compute nodes to not break live migration.
|
||||
|
||||
Adding an Object Storage Node
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Adding a new object storage node is different from adding compute or
|
||||
block storage nodes. You still want to initially configure the server by
|
||||
using your automated deployment and configuration-management systems.
|
||||
After that is done, you need to add the local disks of the object
|
||||
storage node into the object storage ring. The exact command to do this
|
||||
is the same command that was used to add the initial disks to the ring.
|
||||
Simply rerun this command on the object storage proxy server for all
|
||||
disks on the new object storage node. Once this has been done, rebalance
|
||||
the ring and copy the resulting ring files to the other storage nodes.
|
||||
|
||||
.. note::
|
||||
|
||||
If your new object storage node has a different number of disks than
|
||||
the original nodes have, the command to add the new node is
|
||||
different from the original commands. These parameters vary from
|
||||
environment to environment.
|
||||
|
||||
Replacing Components
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Failures of hardware are common in large-scale deployments such as an
|
||||
infrastructure cloud. Consider your processes and balance time saving
|
||||
against availability. For example, an Object Storage cluster can easily
|
||||
live with dead disks in it for some period of time if it has sufficient
|
||||
capacity. Or, if your compute installation is not full, you could
|
||||
consider live migrating instances off a host with a RAM failure until
|
||||
you have time to deal with the problem.
|
54
doc/ops-guide/source/ops_maintenance_hdmwy.rst
Normal file
54
doc/ops-guide/source/ops_maintenance_hdmwy.rst
Normal file
@ -0,0 +1,54 @@
|
||||
=====
|
||||
HDWMY
|
||||
=====
|
||||
|
||||
Here's a quick list of various to-do items for each hour, day, week,
|
||||
month, and year. Please note that these tasks are neither required nor
|
||||
definitive but helpful ideas:
|
||||
|
||||
Hourly
|
||||
~~~~~~
|
||||
|
||||
* Check your monitoring system for alerts and act on them.
|
||||
* Check your ticket queue for new tickets.
|
||||
|
||||
Daily
|
||||
~~~~~
|
||||
|
||||
* Check for instances in a failed or weird state and investigate why.
|
||||
* Check for security patches and apply them as needed.
|
||||
|
||||
Weekly
|
||||
~~~~~~
|
||||
|
||||
* Check cloud usage:
|
||||
|
||||
* User quotas
|
||||
* Disk space
|
||||
* Image usage
|
||||
* Large instances
|
||||
* Network usage (bandwidth and IP usage)
|
||||
|
||||
* Verify your alert mechanisms are still working.
|
||||
|
||||
Monthly
|
||||
~~~~~~~
|
||||
|
||||
* Check usage and trends over the past month.
|
||||
* Check for user accounts that should be removed.
|
||||
* Check for operator accounts that should be removed.
|
||||
|
||||
Quarterly
|
||||
~~~~~~~~~
|
||||
|
||||
* Review usage and trends over the past quarter.
|
||||
* Prepare any quarterly reports on usage and statistics.
|
||||
* Review and plan any necessary cloud additions.
|
||||
* Review and plan any major OpenStack upgrades.
|
||||
|
||||
Semiannually
|
||||
~~~~~~~~~~~~
|
||||
|
||||
* Upgrade OpenStack.
|
||||
* Clean up after an OpenStack upgrade (any unused or new services to be
|
||||
aware of?).
|
90
doc/ops-guide/source/ops_maintenance_slow.rst
Normal file
90
doc/ops-guide/source/ops_maintenance_slow.rst
Normal file
@ -0,0 +1,90 @@
|
||||
=========================================
|
||||
What to do when things are running slowly
|
||||
=========================================
|
||||
|
||||
When you are getting slow responses from various services, it can be
|
||||
hard to know where to start looking. The first thing to check is the
|
||||
extent of the slowness: is it specific to a single service, or varied
|
||||
among different services? If your problem is isolated to a specific
|
||||
service, it can temporarily be fixed by restarting the service, but that
|
||||
is often only a fix for the symptom and not the actual problem.
|
||||
|
||||
This is a collection of ideas from experienced operators on common
|
||||
things to look at that may be the cause of slowness. It is not, however,
|
||||
designed to be an exhaustive list.
|
||||
|
||||
OpenStack Identity service
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If OpenStack :term:`Identity service` is responding slowly, it could be due
|
||||
to the token table getting large. This can be fixed by running the
|
||||
:command:`keystone-manage token_flush` command.
|
||||
|
||||
Additionally, for Identity-related issues, try the tips
|
||||
in :ref:`sql_backend`.
|
||||
|
||||
OpenStack Image service
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack :term:`Image service` can be slowed down by things related to the
|
||||
Identity service, but the Image service itself can be slowed down if
|
||||
connectivity to the back-end storage in use is slow or otherwise
|
||||
problematic. For example, your back-end NFS server might have gone down.
|
||||
|
||||
OpenStack Block Storage service
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack :term:`Block Storage service` is similar to the Image service, so
|
||||
start by checking Identity-related services, and the back-end storage.
|
||||
Additionally, both the Block Storage and Image services rely on AMQP and
|
||||
SQL functionality, so consider these when debugging.
|
||||
|
||||
OpenStack Compute service
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Services related to OpenStack Compute are normally fairly fast and rely
|
||||
on a couple of backend services: Identity for authentication and
|
||||
authorization), and AMQP for interoperability. Any slowness related to
|
||||
services is normally related to one of these. Also, as with all other
|
||||
services, SQL is used extensively.
|
||||
|
||||
OpenStack Networking service
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Slowness in the OpenStack :term:`Networking service` can be caused by services
|
||||
that it relies upon, but it can also be related to either physical or
|
||||
virtual networking. For example: network namespaces that do not exist or
|
||||
are not tied to interfaces correctly; DHCP daemons that have hung or are
|
||||
not running; a cable being physically disconnected; a switch not being
|
||||
configured correctly. When debugging Networking service problems, begin
|
||||
by verifying all physical networking functionality (switch
|
||||
configuration, physical cabling, etc.). After the physical networking is
|
||||
verified, check to be sure all of the Networking services are running
|
||||
(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL
|
||||
back ends.
|
||||
|
||||
AMQP broker
|
||||
~~~~~~~~~~~
|
||||
|
||||
Regardless of which AMQP broker you use, such as RabbitMQ, there are
|
||||
common issues which not only slow down operations, but can also cause
|
||||
real problems. Sometimes messages queued for services stay on the queues
|
||||
and are not consumed. This can be due to dead or stagnant services and
|
||||
can be commonly cleared up by either restarting the AMQP-related
|
||||
services or the OpenStack service in question.
|
||||
|
||||
.. _sql_backend:
|
||||
|
||||
SQL back end
|
||||
~~~~~~~~~~~~
|
||||
|
||||
Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability
|
||||
is essential to a functioning OpenStack environment. A large or
|
||||
fragmented SQLite file can cause slowness when using files as a back
|
||||
end. A locked or long-running query can cause delays for most RDBMS
|
||||
services. In this case, do not kill the query immediately, but look into
|
||||
it to see if it is a problem with something that is hung, or something
|
||||
that is just taking a long time to run and needs to finish on its own.
|
||||
The administration of an RDBMS is outside the scope of this document,
|
||||
but it should be noted that a properly functioning RDBMS is essential to
|
||||
most OpenStack services.
|
91
doc/ops-guide/source/ops_maintenance_storage.rst
Normal file
91
doc/ops-guide/source/ops_maintenance_storage.rst
Normal file
@ -0,0 +1,91 @@
|
||||
=====================================
|
||||
Storage Node Failures and Maintenance
|
||||
=====================================
|
||||
|
||||
Because of the high redundancy of Object Storage, dealing with object
|
||||
storage node issues is a lot easier than dealing with compute node
|
||||
issues.
|
||||
|
||||
Rebooting a Storage Node
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If a storage node requires a reboot, simply reboot it. Requests for data
|
||||
hosted on that node are redirected to other copies while the server is
|
||||
rebooting.
|
||||
|
||||
Shutting Down a Storage Node
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you need to shut down a storage node for an extended period of time
|
||||
(one or more days), consider removing the node from the storage ring.
|
||||
For example:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# swift-ring-builder account.builder remove <ip address of storage node>
|
||||
# swift-ring-builder container.builder remove <ip address of storage node>
|
||||
# swift-ring-builder object.builder remove <ip address of storage node>
|
||||
# swift-ring-builder account.builder rebalance
|
||||
# swift-ring-builder container.builder rebalance
|
||||
# swift-ring-builder object.builder rebalance
|
||||
|
||||
Next, redistribute the ring files to the other nodes:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# for i in s01.example.com s02.example.com s03.example.com
|
||||
> do
|
||||
> scp *.ring.gz $i:/etc/swift
|
||||
> done
|
||||
|
||||
These actions effectively take the storage node out of the storage
|
||||
cluster.
|
||||
|
||||
When the node is able to rejoin the cluster, just add it back to the
|
||||
ring. The exact syntax you use to add a node to your swift cluster with
|
||||
``swift-ring-builder`` heavily depends on the original options used when
|
||||
you originally created your cluster. Please refer back to those
|
||||
commands.
|
||||
|
||||
Replacing a Swift Disk
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If a hard drive fails in an Object Storage node, replacing it is
|
||||
relatively easy. This assumes that your Object Storage environment is
|
||||
configured correctly, where the data that is stored on the failed drive
|
||||
is also replicated to other drives in the Object Storage environment.
|
||||
|
||||
This example assumes that ``/dev/sdb`` has failed.
|
||||
|
||||
First, unmount the disk:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# umount /dev/sdb
|
||||
|
||||
Next, physically remove the disk from the server and replace it with a
|
||||
working disk.
|
||||
|
||||
Ensure that the operating system has recognized the new disk:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# dmesg | tail
|
||||
|
||||
You should see a message about ``/dev/sdb``.
|
||||
|
||||
Because it is recommended to not use partitions on a swift disk, simply
|
||||
format the disk as a whole:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# mkfs.xfs /dev/sdb
|
||||
|
||||
Finally, mount the disk:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# mount -a
|
||||
|
||||
Swift should notice the new disk and that no data exists. It then begins
|
||||
replicating the data to the disk from the other existing replicas.
|
18
doc/ops-guide/source/ops_uninstall.rst
Normal file
18
doc/ops-guide/source/ops_uninstall.rst
Normal file
@ -0,0 +1,18 @@
|
||||
============
|
||||
Uninstalling
|
||||
============
|
||||
|
||||
While we'd always recommend using your automated deployment system to
|
||||
reinstall systems from scratch, sometimes you do need to remove
|
||||
OpenStack from a system the hard way. Here's how:
|
||||
|
||||
* Remove all packages.
|
||||
* Remove remaining files.
|
||||
* Remove databases.
|
||||
|
||||
These steps depend on your underlying distribution, but in general you
|
||||
should be looking for :command:`purge` commands in your package manager, like
|
||||
:command:`aptitude purge ~c $package`. Following this, you can look for
|
||||
orphaned files in the directories referenced throughout this guide. To
|
||||
uninstall the database properly, refer to the manual appropriate for the
|
||||
product in use.
|
Loading…
Reference in New Issue
Block a user