[ops-guide] Cleanup maintenance chapter

Change-Id: I421caf2a12ab192d4df6d5c197e2c5dfb1c9c9bb
Implements: blueprint ops-guide-rst
This commit is contained in:
KATO Tomoyuki 2016-05-08 19:44:10 +09:00
parent 17da315fc9
commit 8760c94427
12 changed files with 1047 additions and 1043 deletions

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,50 @@
===========================
Handling a Complete Failure
===========================
A common way of dealing with the recovery from a full system failure,
such as a power outage of a data center, is to assign each service a
priority, and restore in order.
:ref:`table_example_priority` shows an example.
.. _table_example_priority:
.. list-table:: Table. Example service restoration priority list
:header-rows: 1
* - Priority
- Services
* - 1
- Internal network connectivity
* - 2
- Backing storage services
* - 3
- Public network connectivity for user virtual machines
* - 4
- ``nova-compute``, ``nova-network``, cinder hosts
* - 5
- User virtual machines
* - 10
- Message queue and database services
* - 15
- Keystone services
* - 20
- ``cinder-scheduler``
* - 21
- Image Catalog and Delivery services
* - 22
- ``nova-scheduler`` services
* - 98
- ``cinder-api``
* - 99
- ``nova-api`` services
* - 100
- Dashboard node
Use this example priority list to ensure that user-affected services are
restored as soon as possible, but not before a stable environment is in
place. Of course, despite being listed as a single-line item, each step
requires significant work. For example, just after starting the
database, you should check its integrity, or, after starting the nova
services, you should verify that the hypervisor matches the database and
fix any mismatches.

View File

@ -0,0 +1,401 @@
=====================================
Compute Node Failures and Maintenance
=====================================
Sometimes a compute node either crashes unexpectedly or requires a
reboot for maintenance reasons.
Planned Maintenance
~~~~~~~~~~~~~~~~~~~
If you need to reboot a compute node due to planned maintenance (such as
a software or hardware upgrade), first ensure that all hosted instances
have been moved off the node. If your cloud is utilizing shared storage,
use the :command:`nova live-migration` command. First, get a list of instances
that need to be moved:
.. code-block:: console
# nova list --host c01.example.com --all-tenants
Next, migrate them one by one:
.. code-block:: console
# nova live-migration <uuid> c02.example.com
If you are not using shared storage, you can use the
:option:`--block-migrate` option:
.. code-block:: console
# nova live-migration --block-migrate <uuid> c02.example.com
After you have migrated all instances, ensure that the ``nova-compute``
service has stopped:
.. code-block:: console
# stop nova-compute
If you use a configuration-management system, such as Puppet, that
ensures the ``nova-compute`` service is always running, you can
temporarily move the ``init`` files:
.. code-block:: console
# mkdir /root/tmp
# mv /etc/init/nova-compute.conf /root/tmp
# mv /etc/init.d/nova-compute /root/tmp
Next, shut down your compute node, perform your maintenance, and turn
the node back on. You can reenable the ``nova-compute`` service by
undoing the previous commands:
.. code-block:: console
# mv /root/tmp/nova-compute.conf /etc/init
# mv /root/tmp/nova-compute /etc/init.d/
Then start the ``nova-compute`` service:
.. code-block:: console
# start nova-compute
You can now optionally migrate the instances back to their original
compute node.
After a Compute Node Reboots
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When you reboot a compute node, first verify that it booted
successfully. This includes ensuring that the ``nova-compute`` service
is running:
.. code-block:: console
# ps aux | grep nova-compute
# status nova-compute
Also ensure that it has successfully connected to the AMQP server:
.. code-block:: console
# grep AMQP /var/log/nova/nova-compute
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672
After the compute node is successfully running, you must deal with the
instances that are hosted on that compute node because none of them are
running. Depending on your SLA with your users or customers, you might
have to start each instance and ensure that they start correctly.
Instances
~~~~~~~~~
You can create a list of instances that are hosted on the compute node
by performing the following command:
.. code-block:: console
# nova list --host c01.example.com --all-tenants
After you have the list, you can use the :command:`nova` command to start each
instance:
.. code-block:: console
# nova reboot <uuid>
.. note::
Any time an instance shuts down unexpectedly, it might have problems
on boot. For example, the instance might require an ``fsck`` on the
root partition. If this happens, the user can use the dashboard VNC
console to fix this.
If an instance does not boot, meaning ``virsh list`` never shows the
instance as even attempting to boot, do the following on the compute
node:
.. code-block:: console
# tail -f /var/log/nova/nova-compute.log
Try executing the :command:`nova reboot` command again. You should see an
error message about why the instance was not able to boot
In most cases, the error is the result of something in libvirt's XML
file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer
exists. You can enforce re-creation of the XML file as well as rebooting
the instance by running the following command:
.. code-block:: console
# nova reboot --hard <uuid>
Inspecting and Recovering Data from Failed Instances
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some scenarios, instances are running but are inaccessible through
SSH and do not respond to any command. The VNC console could be
displaying a boot failure or kernel panic error messages. This could be
an indication of file system corruption on the VM itself. If you need to
recover files or inspect the content of the instance, qemu-nbd can be
used to mount the disk.
.. warning::
If you access or view the user's content and data, get approval first!
To access the instance's disk
(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following
steps:
#. Suspend the instance using the ``virsh`` command.
#. Connect the qemu-nbd device to the disk.
#. Mount the qemu-nbd device.
#. Unmount the device after inspecting.
#. Disconnect the qemu-nbd device.
#. Resume the instance.
If you do not follow last three steps, OpenStack Compute cannot manage
the instance any longer. It fails to respond to any command issued by
OpenStack Compute, and it is marked as shut down.
Once you mount the disk file, you should be able to access it and treat
it as a collection of normal directories with files and a directory
structure. However, we do not recommend that you edit or touch any files
because this could change the
:term:`access control lists (ACLs) <access control list>` that are used
to determine which accounts can perform what operations on files and
directories. Changing ACLs can make the instance unbootable if it is not
already.
#. Suspend the instance using the :command:`virsh` command, taking note of the
internal ID:
.. code-block:: console
# virsh list
Id Name State
----------------------------------
1 instance-00000981 running
2 instance-000009f5 running
30 instance-0000274a running
# virsh suspend 30
Domain 30 suspended
#. Connect the qemu-nbd device to the disk:
.. code-block:: console
# cd /var/lib/nova/instances/instance-0000274a
# ls -lh
total 33M
-rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log
-rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk
-rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local
-rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml
# qemu-nbd -c /dev/nbd0 `pwd`/disk
#. Mount the qemu-nbd device.
The qemu-nbd device tries to export the instance disk's different
partitions as separate devices. For example, if vda is the disk and
vda1 is the root partition, qemu-nbd exports the device as
``/dev/nbd0`` and ``/dev/nbd0p1``, respectively:
.. code-block:: console
# mount /dev/nbd0p1 /mnt/
You can now access the contents of ``/mnt``, which correspond to the
first partition of the instance's disk.
To examine the secondary or ephemeral disk, use an alternate mount
point if you want both primary and secondary drives mounted at the
same time:
.. code-block:: console
# umount /mnt
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
# mount /dev/nbd1 /mnt/
# ls -lh /mnt/
total 76K
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin
dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64
drwx------. 2 root root 16K Oct 15 00:42 lost+found
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc
dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys
drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var
#. Once you have completed the inspection, unmount the mount point and
release the qemu-nbd device:
.. code-block:: console
# umount /mnt
# qemu-nbd -d /dev/nbd0
/dev/nbd0 disconnected
#. Resume the instance using :command:`virsh`:
.. code-block:: console
# virsh list
Id Name State
----------------------------------
1 instance-00000981 running
2 instance-000009f5 running
30 instance-0000274a paused
# virsh resume 30
Domain 30 resumed
.. _volumes:
Volumes
~~~~~~~
If the affected instances also had attached volumes, first generate a
list of instance and volume UUIDs:
.. code-block:: mysql
mysql> select nova.instances.uuid as instance_uuid,
cinder.volumes.id as volume_uuid, cinder.volumes.status,
cinder.volumes.attach_status, cinder.volumes.mountpoint,
cinder.volumes.display_name from cinder.volumes
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
where nova.instances.host = 'c01.example.com';
You should see a result similar to the following:
.. code-block:: mysql
+--------------+------------+-------+--------------+-----------+--------------+
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
+--------------+------------+-------+--------------+-----------+--------------+
|9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test |
+--------------+------------+-------+--------------+-----------+--------------+
1 row in set (0.00 sec)
Next, manually detach and reattach the volumes, where X is the proper
mount point:
.. code-block:: console
# nova volume-detach <instance_uuid> <volume_uuid>
# nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX
Be sure that the instance has successfully booted and is at a login
screen before doing the above.
Total Compute Node Failure
~~~~~~~~~~~~~~~~~~~~~~~~~~
Compute nodes can fail the same way a cloud controller can fail. A
motherboard failure or some other type of hardware failure can cause an
entire compute node to go offline. When this happens, all instances
running on that compute node will not be available. Just like with a
cloud controller failure, if your infrastructure monitoring does not
detect a failed compute node, your users will notify you because of
their lost instances.
If a compute node fails and won't be fixed for a few hours (or at all),
you can relaunch all instances that are hosted on the failed node if you
use shared storage for ``/var/lib/nova/instances``.
To do this, generate a list of instance UUIDs that are hosted on the
failed node by running the following query on the nova database:
.. code-block:: mysql
mysql> select uuid from instances
where host = 'c01.example.com' and deleted = 0;
Next, update the nova database to indicate that all instances that used
to be hosted on c01.example.com are now hosted on c02.example.com:
.. code-block:: mysql
mysql> update instances set host = 'c02.example.com'
where host = 'c01.example.com' and deleted = 0;
If you're using the Networking service ML2 plug-in, update the
Networking service database to indicate that all ports that used to be
hosted on c01.example.com are now hosted on c02.example.com:
.. code-block:: mysql
mysql> update ml2_port_bindings set host = 'c02.example.com'
where host = 'c01.example.com';
mysql> update ml2_port_binding_levels set host = 'c02.example.com'
where host = 'c01.example.com';
After that, use the :command:`nova` command to reboot all instances that were
on c01.example.com while regenerating their XML files at the same time:
.. code-block:: console
# nova reboot --hard <uuid>
Finally, reattach volumes using the same method described in the section
:ref:`volumes`.
/var/lib/nova/instances
~~~~~~~~~~~~~~~~~~~~~~~
It's worth mentioning this directory in the context of failed compute
nodes. This directory contains the libvirt KVM file-based disk images
for the instances that are hosted on that compute node. If you are not
running your cloud in a shared storage environment, this directory is
unique across all compute nodes.
``/var/lib/nova/instances`` contains two types of directories.
The first is the ``_base`` directory. This contains all the cached base
images from glance for each unique image that has been launched on that
compute node. Files ending in ``_20`` (or a different number) are the
ephemeral base images.
The other directories are titled ``instance-xxxxxxxx``. These
directories correspond to instances running on that compute node. The
files inside are related to one of the files in the ``_base`` directory.
They're essentially differential-based files containing only the changes
made from the original ``_base`` directory.
All files and directories in ``/var/lib/nova/instances`` are uniquely
named. The files in \_base are uniquely titled for the glance image that
they are based on, and the directory names ``instance-xxxxxxxx`` are
uniquely titled for that particular instance. For example, if you copy
all data from ``/var/lib/nova/instances`` on one compute node to
another, you do not overwrite any files or cause any damage to images
that have the same unique name, because they are essentially the same
file.
Although this method is not documented or supported, you can use it when
your compute node is permanently offline but you have instances locally
stored on it.

View File

@ -0,0 +1,27 @@
========================
Configuration Management
========================
Maintaining an OpenStack cloud requires that you manage multiple
physical servers, and this number might grow over time. Because managing
nodes manually is error prone, we strongly recommend that you use a
configuration-management tool. These tools automate the process of
ensuring that all your nodes are configured properly and encourage you
to maintain your configuration information (such as packages and
configuration options) in a version-controlled repository.
.. note::
Several configuration-management tools are available, and this guide
does not recommend a specific one. The two most popular ones in the
OpenStack community are `Puppet <https://puppetlabs.com/>`_, with
available `OpenStack Puppet
modules <https://github.com/puppetlabs/puppetlabs-openstack>`_; and
`Chef <http://www.getchef.com/chef/>`_, with available `OpenStack
Chef recipes <https://github.com/opscode/openstack-chef-repo>`_.
Other newer configuration tools include
`Juju <https://juju.ubuntu.com/>`_,
`Ansible <https://www.ansible.com/>`_, and
`Salt <http://www.saltstack.com/>`_; and more mature configuration
management tools include `CFEngine <http://cfengine.com/>`_ and
`Bcfg2 <http://bcfg2.org/>`_.

View File

@ -0,0 +1,96 @@
===========================================================
Cloud Controller and Storage Proxy Failures and Maintenance
===========================================================
The cloud controller and storage proxy are very similar to each other
when it comes to expected and unexpected downtime. One of each server
type typically runs in the cloud, which makes them very noticeable when
they are not running.
For the cloud controller, the good news is if your cloud is using the
FlatDHCP multi-host HA network mode, existing instances and volumes
continue to operate while the cloud controller is offline. For the
storage proxy, however, no storage traffic is possible until it is back
up and running.
Planned Maintenance
~~~~~~~~~~~~~~~~~~~
One way to plan for cloud controller or storage proxy maintenance is to
simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
affects fewer users. If your cloud controller or storage proxy is too
important to have unavailable at any point in time, you must look into
high-availability options.
Rebooting a Cloud Controller or Storage Proxy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All in all, just issue the :command:`reboot` command. The operating system
cleanly shuts down services and then automatically reboots. If you want
to be very thorough, run your backup jobs just before you
reboot.
After a cloud controller reboots, ensure that all required services were
successfully started. The following commands use :command:`ps` and
:command:`grep` to determine if nova, glance, and keystone are currently
running:
.. code-block:: console
# ps aux | grep nova-
# ps aux | grep glance-
# ps aux | grep keystone
# ps aux | grep cinder
Also check that all services are functioning. The following set of
commands sources the ``openrc`` file, then runs some basic glance, nova,
and openstack commands. If the commands work as expected, you can be
confident that those services are in working condition:
.. code-block:: console
# source openrc
# glance index
# nova list
# openstack project list
For the storage proxy, ensure that the :term:`Object Storage service` has
resumed:
.. code-block:: console
# ps aux | grep swift
Also check that it is functioning:
.. code-block:: console
# swift stat
Total Cloud Controller Failure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The cloud controller could completely fail if, for example, its
motherboard goes bad. Users will immediately notice the loss of a cloud
controller since it provides core functionality to your cloud
environment. If your infrastructure monitoring does not alert you that
your cloud controller has failed, your users definitely will.
Unfortunately, this is a rough situation. The cloud controller is an
integral part of your cloud. If you have only one controller, you will
have many missing services if it goes down.
To avoid this situation, create a highly available cloud controller
cluster. This is outside the scope of this document, but you can read
more in the `OpenStack High Availability
Guide <http://docs.openstack.org/ha-guide/index.html>`_.
The next best approach is to use a configuration-management tool, such
as Puppet, to automatically build a cloud controller. This should not
take more than 15 minutes if you have a spare server available. After
the controller rebuilds, restore any backups taken
(see :doc:`ops_backup_recovery`).
Also, in practice, the ``nova-compute`` services on the compute nodes do
not always reconnect cleanly to rabbitmq hosted on the controller when
it comes back up after a long reboot; a restart on the nova services on
the compute nodes is required.

View File

@ -0,0 +1,49 @@
=========
Databases
=========
Almost all OpenStack components have an underlying database to store
persistent information. Usually this database is MySQL. Normal MySQL
administration is applicable to these databases. OpenStack does not
configure the databases out of the ordinary. Basic administration
includes performance tweaking, high availability, backup, recovery, and
repairing. For more information, see a standard MySQL administration guide.
You can perform a couple of tricks with the database to either more
quickly retrieve information or fix a data inconsistency error—for
example, an instance was terminated, but the status was not updated in
the database. These tricks are discussed throughout this book.
Database Connectivity
~~~~~~~~~~~~~~~~~~~~~
Review the component's configuration file to see how each OpenStack
component accesses its corresponding database. Look for either
``sql_connection`` or simply ``connection``. The following command uses
``grep`` to display the SQL connection string for nova, glance, cinder,
and keystone:
.. code-block:: console
# grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf \
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone
The connection strings take this format:
.. code-block:: console
mysql+pymysql:// <username> : <password> @ <hostname> / <database name>
Performance and Optimizing
~~~~~~~~~~~~~~~~~~~~~~~~~~
As your cloud grows, MySQL is utilized more and more. If you suspect
that MySQL might be becoming a bottleneck, you should start researching
MySQL optimization. The MySQL manual has an entire section dedicated to
this topic: `Optimization Overview
<http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html>`_.

View File

@ -0,0 +1,92 @@
=====================================
Determining Which Component Is Broken
=====================================
OpenStack's collection of different components interact with each other
strongly. For example, uploading an image requires interaction from
``nova-api``, ``glance-api``, ``glance-registry``, keystone, and
potentially ``swift-proxy``. As a result, it is sometimes difficult to
determine exactly where problems lie. Assisting in this is the purpose
of this section.
Tailing Logs
~~~~~~~~~~~~
The first place to look is the log file related to the command you are
trying to run. For example, if ``nova list`` is failing, try tailing a
nova log file and running the command again:
Terminal 1:
.. code-block:: console
# tail -f /var/log/nova/nova-api.log
Terminal 2:
.. code-block:: console
# nova list
Look for any errors or traces in the log file. For more information, see
:doc:`ops_logging_monitoring`.
If the error indicates that the problem is with another component,
switch to tailing that component's log file. For example, if nova cannot
access glance, look at the ``glance-api`` log:
Terminal 1:
.. code-block:: console
# tail -f /var/log/glance/api.log
Terminal 2:
.. code-block:: console
# nova list
Wash, rinse, and repeat until you find the core cause of the problem.
Running Daemons on the CLI
~~~~~~~~~~~~~~~~~~~~~~~~~~
Unfortunately, sometimes the error is not apparent from the log files.
In this case, switch tactics and use a different command; maybe run the
service directly on the command line. For example, if the ``glance-api``
service refuses to start and stay running, try launching the daemon from
the command line:
.. code-block:: console
# sudo -u glance -H glance-api
This might print the error and cause of the problem.
.. note::
The ``-H`` flag is required when running the daemons with sudo
because some daemons will write files relative to the user's home
directory, and this write may fail if ``-H`` is left off.
.. Tip::
**Example of Complexity**
One morning, a compute node failed to run any instances. The log files
were a bit vague, claiming that a certain instance was unable to be
started. This ended up being a red herring because the instance was
simply the first instance in alphabetical order, so it was the first
instance that ``nova-compute`` would touch.
Further troubleshooting showed that libvirt was not running at all. This
made more sense. If libvirt wasn't running, then no instance could be
virtualized through KVM. Upon trying to start libvirt, it would silently
die immediately. The libvirt logs did not explain why.
Next, the ``libvirtd`` daemon was run on the command line. Finally a
helpful error message: it could not connect to d-bus. As ridiculous as
it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and
somehow d-bus crashed. Simply starting d-bus set the entire chain back
on track, and soon everything was back up and running.

View File

@ -0,0 +1,64 @@
=====================
Working with Hardware
=====================
As for your initial deployment, you should ensure that all hardware is
appropriately burned in before adding it to production. Run software
that uses the hardware to its limits—maxing out RAM, CPU, disk, and
network. Many options are available, and normally double as benchmark
software, so you also get a good idea of the performance of your
system.
Adding a Compute Node
~~~~~~~~~~~~~~~~~~~~~
If you find that you have reached or are reaching the capacity limit of
your computing resources, you should plan to add additional compute
nodes. Adding more nodes is quite easy. The process for adding compute
nodes is the same as when the initial compute nodes were deployed to
your cloud: use an automated deployment system to bootstrap the
bare-metal server with the operating system and then have a
configuration-management system install and configure OpenStack Compute.
Once the Compute service has been installed and configured in the same
way as the other compute nodes, it automatically attaches itself to the
cloud. The cloud controller notices the new node(s) and begins
scheduling instances to launch there.
If your OpenStack Block Storage nodes are separate from your compute
nodes, the same procedure still applies because the same queuing and
polling system is used in both services.
We recommend that you use the same hardware for new compute and block
storage nodes. At the very least, ensure that the CPUs are similar in
the compute nodes to not break live migration.
Adding an Object Storage Node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Adding a new object storage node is different from adding compute or
block storage nodes. You still want to initially configure the server by
using your automated deployment and configuration-management systems.
After that is done, you need to add the local disks of the object
storage node into the object storage ring. The exact command to do this
is the same command that was used to add the initial disks to the ring.
Simply rerun this command on the object storage proxy server for all
disks on the new object storage node. Once this has been done, rebalance
the ring and copy the resulting ring files to the other storage nodes.
.. note::
If your new object storage node has a different number of disks than
the original nodes have, the command to add the new node is
different from the original commands. These parameters vary from
environment to environment.
Replacing Components
~~~~~~~~~~~~~~~~~~~~
Failures of hardware are common in large-scale deployments such as an
infrastructure cloud. Consider your processes and balance time saving
against availability. For example, an Object Storage cluster can easily
live with dead disks in it for some period of time if it has sufficient
capacity. Or, if your compute installation is not full, you could
consider live migrating instances off a host with a RAM failure until
you have time to deal with the problem.

View File

@ -0,0 +1,54 @@
=====
HDWMY
=====
Here's a quick list of various to-do items for each hour, day, week,
month, and year. Please note that these tasks are neither required nor
definitive but helpful ideas:
Hourly
~~~~~~
* Check your monitoring system for alerts and act on them.
* Check your ticket queue for new tickets.
Daily
~~~~~
* Check for instances in a failed or weird state and investigate why.
* Check for security patches and apply them as needed.
Weekly
~~~~~~
* Check cloud usage:
* User quotas
* Disk space
* Image usage
* Large instances
* Network usage (bandwidth and IP usage)
* Verify your alert mechanisms are still working.
Monthly
~~~~~~~
* Check usage and trends over the past month.
* Check for user accounts that should be removed.
* Check for operator accounts that should be removed.
Quarterly
~~~~~~~~~
* Review usage and trends over the past quarter.
* Prepare any quarterly reports on usage and statistics.
* Review and plan any necessary cloud additions.
* Review and plan any major OpenStack upgrades.
Semiannually
~~~~~~~~~~~~
* Upgrade OpenStack.
* Clean up after an OpenStack upgrade (any unused or new services to be
aware of?).

View File

@ -0,0 +1,90 @@
=========================================
What to do when things are running slowly
=========================================
When you are getting slow responses from various services, it can be
hard to know where to start looking. The first thing to check is the
extent of the slowness: is it specific to a single service, or varied
among different services? If your problem is isolated to a specific
service, it can temporarily be fixed by restarting the service, but that
is often only a fix for the symptom and not the actual problem.
This is a collection of ideas from experienced operators on common
things to look at that may be the cause of slowness. It is not, however,
designed to be an exhaustive list.
OpenStack Identity service
~~~~~~~~~~~~~~~~~~~~~~~~~~
If OpenStack :term:`Identity service` is responding slowly, it could be due
to the token table getting large. This can be fixed by running the
:command:`keystone-manage token_flush` command.
Additionally, for Identity-related issues, try the tips
in :ref:`sql_backend`.
OpenStack Image service
~~~~~~~~~~~~~~~~~~~~~~~
OpenStack :term:`Image service` can be slowed down by things related to the
Identity service, but the Image service itself can be slowed down if
connectivity to the back-end storage in use is slow or otherwise
problematic. For example, your back-end NFS server might have gone down.
OpenStack Block Storage service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
OpenStack :term:`Block Storage service` is similar to the Image service, so
start by checking Identity-related services, and the back-end storage.
Additionally, both the Block Storage and Image services rely on AMQP and
SQL functionality, so consider these when debugging.
OpenStack Compute service
~~~~~~~~~~~~~~~~~~~~~~~~~
Services related to OpenStack Compute are normally fairly fast and rely
on a couple of backend services: Identity for authentication and
authorization), and AMQP for interoperability. Any slowness related to
services is normally related to one of these. Also, as with all other
services, SQL is used extensively.
OpenStack Networking service
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Slowness in the OpenStack :term:`Networking service` can be caused by services
that it relies upon, but it can also be related to either physical or
virtual networking. For example: network namespaces that do not exist or
are not tied to interfaces correctly; DHCP daemons that have hung or are
not running; a cable being physically disconnected; a switch not being
configured correctly. When debugging Networking service problems, begin
by verifying all physical networking functionality (switch
configuration, physical cabling, etc.). After the physical networking is
verified, check to be sure all of the Networking services are running
(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL
back ends.
AMQP broker
~~~~~~~~~~~
Regardless of which AMQP broker you use, such as RabbitMQ, there are
common issues which not only slow down operations, but can also cause
real problems. Sometimes messages queued for services stay on the queues
and are not consumed. This can be due to dead or stagnant services and
can be commonly cleared up by either restarting the AMQP-related
services or the OpenStack service in question.
.. _sql_backend:
SQL back end
~~~~~~~~~~~~
Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability
is essential to a functioning OpenStack environment. A large or
fragmented SQLite file can cause slowness when using files as a back
end. A locked or long-running query can cause delays for most RDBMS
services. In this case, do not kill the query immediately, but look into
it to see if it is a problem with something that is hung, or something
that is just taking a long time to run and needs to finish on its own.
The administration of an RDBMS is outside the scope of this document,
but it should be noted that a properly functioning RDBMS is essential to
most OpenStack services.

View File

@ -0,0 +1,91 @@
=====================================
Storage Node Failures and Maintenance
=====================================
Because of the high redundancy of Object Storage, dealing with object
storage node issues is a lot easier than dealing with compute node
issues.
Rebooting a Storage Node
~~~~~~~~~~~~~~~~~~~~~~~~
If a storage node requires a reboot, simply reboot it. Requests for data
hosted on that node are redirected to other copies while the server is
rebooting.
Shutting Down a Storage Node
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you need to shut down a storage node for an extended period of time
(one or more days), consider removing the node from the storage ring.
For example:
.. code-block:: console
# swift-ring-builder account.builder remove <ip address of storage node>
# swift-ring-builder container.builder remove <ip address of storage node>
# swift-ring-builder object.builder remove <ip address of storage node>
# swift-ring-builder account.builder rebalance
# swift-ring-builder container.builder rebalance
# swift-ring-builder object.builder rebalance
Next, redistribute the ring files to the other nodes:
.. code-block:: console
# for i in s01.example.com s02.example.com s03.example.com
> do
> scp *.ring.gz $i:/etc/swift
> done
These actions effectively take the storage node out of the storage
cluster.
When the node is able to rejoin the cluster, just add it back to the
ring. The exact syntax you use to add a node to your swift cluster with
``swift-ring-builder`` heavily depends on the original options used when
you originally created your cluster. Please refer back to those
commands.
Replacing a Swift Disk
~~~~~~~~~~~~~~~~~~~~~~
If a hard drive fails in an Object Storage node, replacing it is
relatively easy. This assumes that your Object Storage environment is
configured correctly, where the data that is stored on the failed drive
is also replicated to other drives in the Object Storage environment.
This example assumes that ``/dev/sdb`` has failed.
First, unmount the disk:
.. code-block:: console
# umount /dev/sdb
Next, physically remove the disk from the server and replace it with a
working disk.
Ensure that the operating system has recognized the new disk:
.. code-block:: console
# dmesg | tail
You should see a message about ``/dev/sdb``.
Because it is recommended to not use partitions on a swift disk, simply
format the disk as a whole:
.. code-block:: console
# mkfs.xfs /dev/sdb
Finally, mount the disk:
.. code-block:: console
# mount -a
Swift should notice the new disk and that no data exists. It then begins
replicating the data to the disk from the other existing replicas.

View File

@ -0,0 +1,18 @@
============
Uninstalling
============
While we'd always recommend using your automated deployment system to
reinstall systems from scratch, sometimes you do need to remove
OpenStack from a system the hard way. Here's how:
* Remove all packages.
* Remove remaining files.
* Remove databases.
These steps depend on your underlying distribution, but in general you
should be looking for :command:`purge` commands in your package manager, like
:command:`aptitude purge ~c $package`. Following this, you can look for
orphaned files in the directories referenced throughout this guide. To
uninstall the database properly, refer to the manual appropriate for the
product in use.