[ops-guide] Cleanup maintenance chapter

Change-Id: I421caf2a12ab192d4df6d5c197e2c5dfb1c9c9bb Implements: blueprint ops-guide-rst
2016-05-08 19:44:10 +09:00 · 2016-05-08 19:44:10 +09:00 · 8760c94427
commit 8760c94427
parent 17da315fc9
12 changed files with 1047 additions and 1043 deletions
--- a/doc/ops-guide/source/ops_maintenance.rst
+++ b/doc/ops-guide/source/ops_maintenance.rst
--- a/doc/ops-guide/source/ops_maintenance_complete.rst
+++ b/doc/ops-guide/source/ops_maintenance_complete.rst
@ -0,0 +1,50 @@
+===========================
+Handling a Complete Failure
+===========================
+
+A common way of dealing with the recovery from a full system failure,
+such as a power outage of a data center, is to assign each service a
+priority, and restore in order.
+:ref:`table_example_priority` shows an example.
+
+.. _table_example_priority:
+
+.. list-table:: Table. Example service restoration priority list
+   :header-rows: 1
+
+   * - Priority
+     - Services
+   * - 1
+     - Internal network connectivity
+   * - 2
+     - Backing storage services
+   * - 3
+     - Public network connectivity for user virtual machines
+   * - 4
+     - ``nova-compute``, ``nova-network``, cinder hosts
+   * - 5
+     - User virtual machines
+   * - 10
+     - Message queue and database services
+   * - 15
+     - Keystone services
+   * - 20
+     - ``cinder-scheduler``
+   * - 21
+     - Image Catalog and Delivery services
+   * - 22
+     - ``nova-scheduler`` services
+   * - 98
+     - ``cinder-api``
+   * - 99
+     - ``nova-api`` services
+   * - 100
+     - Dashboard node
+
+Use this example priority list to ensure that user-affected services are
+restored as soon as possible, but not before a stable environment is in
+place. Of course, despite being listed as a single-line item, each step
+requires significant work. For example, just after starting the
+database, you should check its integrity, or, after starting the nova
+services, you should verify that the hypervisor matches the database and
+fix any mismatches.
--- a/doc/ops-guide/source/ops_maintenance_compute.rst
+++ b/doc/ops-guide/source/ops_maintenance_compute.rst
@ -0,0 +1,401 @@
+=====================================
+Compute Node Failures and Maintenance
+=====================================
+
+Sometimes a compute node either crashes unexpectedly or requires a
+reboot for maintenance reasons.
+
+Planned Maintenance
+~~~~~~~~~~~~~~~~~~~
+
+If you need to reboot a compute node due to planned maintenance (such as
+a software or hardware upgrade), first ensure that all hosted instances
+have been moved off the node. If your cloud is utilizing shared storage,
+use the :command:`nova live-migration` command. First, get a list of instances
+that need to be moved:
+
+.. code-block:: console
+
+   # nova list --host c01.example.com --all-tenants
+
+Next, migrate them one by one:
+
+.. code-block:: console
+
+   # nova live-migration <uuid> c02.example.com
+
+If you are not using shared storage, you can use the
+:option:`--block-migrate` option:
+
+.. code-block:: console
+
+   # nova live-migration --block-migrate <uuid> c02.example.com
+
+After you have migrated all instances, ensure that the ``nova-compute``
+service has stopped:
+
+.. code-block:: console
+
+   # stop nova-compute
+
+If you use a configuration-management system, such as Puppet, that
+ensures the ``nova-compute`` service is always running, you can
+temporarily move the ``init`` files:
+
+.. code-block:: console
+
+   # mkdir /root/tmp
+   # mv /etc/init/nova-compute.conf /root/tmp
+   # mv /etc/init.d/nova-compute /root/tmp
+
+Next, shut down your compute node, perform your maintenance, and turn
+the node back on. You can reenable the ``nova-compute`` service by
+undoing the previous commands:
+
+.. code-block:: console
+
+   # mv /root/tmp/nova-compute.conf /etc/init
+   # mv /root/tmp/nova-compute /etc/init.d/
+
+Then start the ``nova-compute`` service:
+
+.. code-block:: console
+
+   # start nova-compute
+
+You can now optionally migrate the instances back to their original
+compute node.
+
+After a Compute Node Reboots
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When you reboot a compute node, first verify that it booted
+successfully. This includes ensuring that the ``nova-compute`` service
+is running:
+
+.. code-block:: console
+
+   # ps aux | grep nova-compute
+   # status nova-compute
+
+Also ensure that it has successfully connected to the AMQP server:
+
+.. code-block:: console
+
+   # grep AMQP /var/log/nova/nova-compute
+   2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672
+
+After the compute node is successfully running, you must deal with the
+instances that are hosted on that compute node because none of them are
+running. Depending on your SLA with your users or customers, you might
+have to start each instance and ensure that they start correctly.
+
+Instances
+~~~~~~~~~
+
+You can create a list of instances that are hosted on the compute node
+by performing the following command:
+
+.. code-block:: console
+
+   # nova list --host c01.example.com --all-tenants
+
+After you have the list, you can use the :command:`nova` command to start each
+instance:
+
+.. code-block:: console
+
+   # nova reboot <uuid>
+
+.. note::
+
+   Any time an instance shuts down unexpectedly, it might have problems
+   on boot. For example, the instance might require an ``fsck`` on the
+   root partition. If this happens, the user can use the dashboard VNC
+   console to fix this.
+
+If an instance does not boot, meaning ``virsh list`` never shows the
+instance as even attempting to boot, do the following on the compute
+node:
+
+.. code-block:: console
+
+   # tail -f /var/log/nova/nova-compute.log
+
+Try executing the :command:`nova reboot` command again. You should see an
+error message about why the instance was not able to boot
+
+In most cases, the error is the result of something in libvirt's XML
+file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer
+exists. You can enforce re-creation of the XML file as well as rebooting
+the instance by running the following command:
+
+.. code-block:: console
+
+   # nova reboot --hard <uuid>
+
+Inspecting and Recovering Data from Failed Instances
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In some scenarios, instances are running but are inaccessible through
+SSH and do not respond to any command. The VNC console could be
+displaying a boot failure or kernel panic error messages. This could be
+an indication of file system corruption on the VM itself. If you need to
+recover files or inspect the content of the instance, qemu-nbd can be
+used to mount the disk.
+
+.. warning::
+
+   If you access or view the user's content and data, get approval first!
+
+To access the instance's disk
+(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following
+steps:
+
+#. Suspend the instance using the ``virsh`` command.
+
+#. Connect the qemu-nbd device to the disk.
+
+#. Mount the qemu-nbd device.
+
+#. Unmount the device after inspecting.
+
+#. Disconnect the qemu-nbd device.
+
+#. Resume the instance.
+
+If you do not follow last three steps, OpenStack Compute cannot manage
+the instance any longer. It fails to respond to any command issued by
+OpenStack Compute, and it is marked as shut down.
+
+Once you mount the disk file, you should be able to access it and treat
+it as a collection of normal directories with files and a directory
+structure. However, we do not recommend that you edit or touch any files
+because this could change the
+:term:`access control lists (ACLs) <access control list>` that are used
+to determine which accounts can perform what operations on files and
+directories. Changing ACLs can make the instance unbootable if it is not
+already.
+
+#. Suspend the instance using the :command:`virsh` command, taking note of the
+   internal ID:
+
+   .. code-block:: console
+
+      # virsh list
+      Id Name                 State
+      ----------------------------------
+       1 instance-00000981    running
+       2 instance-000009f5    running
+      30 instance-0000274a    running
+
+      # virsh suspend 30
+      Domain 30 suspended
+
+#. Connect the qemu-nbd device to the disk:
+
+   .. code-block:: console
+
+      # cd /var/lib/nova/instances/instance-0000274a
+      # ls -lh
+      total 33M
+      -rw-rw---- 1 libvirt-qemu kvm  6.3K Oct 15 11:31 console.log
+      -rw-r--r-- 1 libvirt-qemu kvm   33M Oct 15 22:06 disk
+      -rw-r--r-- 1 libvirt-qemu kvm  384K Oct 15 22:06 disk.local
+      -rw-rw-r-- 1 nova         nova 1.7K Oct 15 11:30 libvirt.xml
+      # qemu-nbd -c /dev/nbd0 `pwd`/disk
+
+#. Mount the qemu-nbd device.
+
+   The qemu-nbd device tries to export the instance disk's different
+   partitions as separate devices. For example, if vda is the disk and
+   vda1 is the root partition, qemu-nbd exports the device as
+   ``/dev/nbd0`` and ``/dev/nbd0p1``, respectively:
+
+   .. code-block:: console
+
+      # mount /dev/nbd0p1 /mnt/
+
+   You can now access the contents of ``/mnt``, which correspond to the
+   first partition of the instance's disk.
+
+   To examine the secondary or ephemeral disk, use an alternate mount
+   point if you want both primary and secondary drives mounted at the
+   same time:
+
+   .. code-block:: console
+
+      # umount /mnt
+      # qemu-nbd -c /dev/nbd1 `pwd`/disk.local
+      # mount /dev/nbd1 /mnt/
+      # ls -lh /mnt/
+      total 76K
+      lrwxrwxrwx.  1 root root    7 Oct 15 00:44 bin -> usr/bin
+      dr-xr-xr-x.  4 root root 4.0K Oct 15 01:07 boot
+      drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 dev
+      drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
+      drwxr-xr-x.  3 root root 4.0K Oct 15 01:07 home
+      lrwxrwxrwx.  1 root root    7 Oct 15 00:44 lib -> usr/lib
+      lrwxrwxrwx.  1 root root    9 Oct 15 00:44 lib64 -> usr/lib64
+      drwx------.  2 root root  16K Oct 15 00:42 lost+found
+      drwxr-xr-x.  2 root root 4.0K Feb  3  2012 media
+      drwxr-xr-x.  2 root root 4.0K Feb  3  2012 mnt
+      drwxr-xr-x.  2 root root 4.0K Feb  3  2012 opt
+      drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 proc
+      dr-xr-x---.  3 root root 4.0K Oct 15 21:56 root
+      drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
+      lrwxrwxrwx.  1 root root    8 Oct 15 00:44 sbin -> usr/sbin
+      drwxr-xr-x.  2 root root 4.0K Feb  3  2012 srv
+      drwxr-xr-x.  2 root root 4.0K Oct 15 00:42 sys
+      drwxrwxrwt.  9 root root 4.0K Oct 15 16:29 tmp
+      drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
+      drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var
+
+#. Once you have completed the inspection, unmount the mount point and
+   release the qemu-nbd device:
+
+   .. code-block:: console
+
+      # umount /mnt
+      # qemu-nbd -d /dev/nbd0
+      /dev/nbd0 disconnected
+
+#. Resume the instance using :command:`virsh`:
+
+   .. code-block:: console
+
+      # virsh list
+      Id Name                 State
+      ----------------------------------
+       1 instance-00000981    running
+       2 instance-000009f5    running
+      30 instance-0000274a    paused
+
+      # virsh resume 30
+      Domain 30 resumed
+
+.. _volumes:
+
+Volumes
+~~~~~~~
+
+If the affected instances also had attached volumes, first generate a
+list of instance and volume UUIDs:
+
+.. code-block:: mysql
+
+   mysql> select nova.instances.uuid as instance_uuid,
+          cinder.volumes.id as volume_uuid, cinder.volumes.status,
+          cinder.volumes.attach_status, cinder.volumes.mountpoint,
+          cinder.volumes.display_name from cinder.volumes
+          inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
+          where nova.instances.host = 'c01.example.com';
+
+You should see a result similar to the following:
+
+.. code-block:: mysql
+
+   +--------------+------------+-------+--------------+-----------+--------------+
+   |instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
+   +--------------+------------+-------+--------------+-----------+--------------+
+   |9b969a05      |1f0fbf36    |in-use |attached      |/dev/vdc   | test         |
+   +--------------+------------+-------+--------------+-----------+--------------+
+   1 row in set (0.00 sec)
+
+Next, manually detach and reattach the volumes, where X is the proper
+mount point:
+
+.. code-block:: console
+
+   # nova volume-detach <instance_uuid> <volume_uuid>
+   # nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX
+
+Be sure that the instance has successfully booted and is at a login
+screen before doing the above.
+
+Total Compute Node Failure
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Compute nodes can fail the same way a cloud controller can fail. A
+motherboard failure or some other type of hardware failure can cause an
+entire compute node to go offline. When this happens, all instances
+running on that compute node will not be available. Just like with a
+cloud controller failure, if your infrastructure monitoring does not
+detect a failed compute node, your users will notify you because of
+their lost instances.
+
+If a compute node fails and won't be fixed for a few hours (or at all),
+you can relaunch all instances that are hosted on the failed node if you
+use shared storage for ``/var/lib/nova/instances``.
+
+To do this, generate a list of instance UUIDs that are hosted on the
+failed node by running the following query on the nova database:
+
+.. code-block:: mysql
+
+   mysql> select uuid from instances
+          where host = 'c01.example.com' and deleted = 0;
+
+Next, update the nova database to indicate that all instances that used
+to be hosted on c01.example.com are now hosted on c02.example.com:
+
+.. code-block:: mysql
+
+   mysql> update instances set host = 'c02.example.com'
+          where host = 'c01.example.com' and deleted = 0;
+
+If you're using the Networking service ML2 plug-in, update the
+Networking service database to indicate that all ports that used to be
+hosted on c01.example.com are now hosted on c02.example.com:
+
+.. code-block:: mysql
+
+   mysql> update ml2_port_bindings set host = 'c02.example.com'
+          where host = 'c01.example.com';
+   mysql> update ml2_port_binding_levels set host = 'c02.example.com'
+          where host = 'c01.example.com';
+
+After that, use the :command:`nova` command to reboot all instances that were
+on c01.example.com while regenerating their XML files at the same time:
+
+.. code-block:: console
+
+   # nova reboot --hard <uuid>
+
+Finally, reattach volumes using the same method described in the section
+:ref:`volumes`.
+
+/var/lib/nova/instances
+~~~~~~~~~~~~~~~~~~~~~~~
+
+It's worth mentioning this directory in the context of failed compute
+nodes. This directory contains the libvirt KVM file-based disk images
+for the instances that are hosted on that compute node. If you are not
+running your cloud in a shared storage environment, this directory is
+unique across all compute nodes.
+
+``/var/lib/nova/instances`` contains two types of directories.
+
+The first is the ``_base`` directory. This contains all the cached base
+images from glance for each unique image that has been launched on that
+compute node. Files ending in ``_20`` (or a different number) are the
+ephemeral base images.
+
+The other directories are titled ``instance-xxxxxxxx``. These
+directories correspond to instances running on that compute node. The
+files inside are related to one of the files in the ``_base`` directory.
+They're essentially differential-based files containing only the changes
+made from the original ``_base`` directory.
+
+All files and directories in ``/var/lib/nova/instances`` are uniquely
+named. The files in \_base are uniquely titled for the glance image that
+they are based on, and the directory names ``instance-xxxxxxxx`` are
+uniquely titled for that particular instance. For example, if you copy
+all data from ``/var/lib/nova/instances`` on one compute node to
+another, you do not overwrite any files or cause any damage to images
+that have the same unique name, because they are essentially the same
+file.
+
+Although this method is not documented or supported, you can use it when
+your compute node is permanently offline but you have instances locally
+stored on it.
--- a/doc/ops-guide/source/ops_maintenance_configuration.rst
+++ b/doc/ops-guide/source/ops_maintenance_configuration.rst
@ -0,0 +1,27 @@
+========================
+Configuration Management
+========================
+
+Maintaining an OpenStack cloud requires that you manage multiple
+physical servers, and this number might grow over time. Because managing
+nodes manually is error prone, we strongly recommend that you use a
+configuration-management tool. These tools automate the process of
+ensuring that all your nodes are configured properly and encourage you
+to maintain your configuration information (such as packages and
+configuration options) in a version-controlled repository.
+
+.. note::
+
+   Several configuration-management tools are available, and this guide
+   does not recommend a specific one. The two most popular ones in the
+   OpenStack community are `Puppet <https://puppetlabs.com/>`_, with
+   available `OpenStack Puppet
+   modules <https://github.com/puppetlabs/puppetlabs-openstack>`_; and
+   `Chef <http://www.getchef.com/chef/>`_, with available `OpenStack
+   Chef recipes <https://github.com/opscode/openstack-chef-repo>`_.
+   Other newer configuration tools include
+   `Juju <https://juju.ubuntu.com/>`_,
+   `Ansible <https://www.ansible.com/>`_, and
+   `Salt <http://www.saltstack.com/>`_; and more mature configuration
+   management tools include `CFEngine <http://cfengine.com/>`_ and
+   `Bcfg2 <http://bcfg2.org/>`_.
--- a/doc/ops-guide/source/ops_maintenance_controller.rst
+++ b/doc/ops-guide/source/ops_maintenance_controller.rst
@ -0,0 +1,96 @@
+===========================================================
+Cloud Controller and Storage Proxy Failures and Maintenance
+===========================================================
+
+The cloud controller and storage proxy are very similar to each other
+when it comes to expected and unexpected downtime. One of each server
+type typically runs in the cloud, which makes them very noticeable when
+they are not running.
+
+For the cloud controller, the good news is if your cloud is using the
+FlatDHCP multi-host HA network mode, existing instances and volumes
+continue to operate while the cloud controller is offline. For the
+storage proxy, however, no storage traffic is possible until it is back
+up and running.
+
+Planned Maintenance
+~~~~~~~~~~~~~~~~~~~
+
+One way to plan for cloud controller or storage proxy maintenance is to
+simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
+affects fewer users. If your cloud controller or storage proxy is too
+important to have unavailable at any point in time, you must look into
+high-availability options.
+
+Rebooting a Cloud Controller or Storage Proxy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All in all, just issue the :command:`reboot` command. The operating system
+cleanly shuts down services and then automatically reboots. If you want
+to be very thorough, run your backup jobs just before you
+reboot.
+
+After a cloud controller reboots, ensure that all required services were
+successfully started. The following commands use :command:`ps` and
+:command:`grep` to determine if nova, glance, and keystone are currently
+running:
+
+.. code-block:: console
+
+   # ps aux | grep nova-
+   # ps aux | grep glance-
+   # ps aux | grep keystone
+   # ps aux | grep cinder
+
+Also check that all services are functioning. The following set of
+commands sources the ``openrc`` file, then runs some basic glance, nova,
+and openstack commands. If the commands work as expected, you can be
+confident that those services are in working condition:
+
+.. code-block:: console
+
+   # source openrc
+   # glance index
+   # nova list
+   # openstack project list
+
+For the storage proxy, ensure that the :term:`Object Storage service` has
+resumed:
+
+.. code-block:: console
+
+   # ps aux | grep swift
+
+Also check that it is functioning:
+
+.. code-block:: console
+
+   # swift stat
+
+Total Cloud Controller Failure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The cloud controller could completely fail if, for example, its
+motherboard goes bad. Users will immediately notice the loss of a cloud
+controller since it provides core functionality to your cloud
+environment. If your infrastructure monitoring does not alert you that
+your cloud controller has failed, your users definitely will.
+Unfortunately, this is a rough situation. The cloud controller is an
+integral part of your cloud. If you have only one controller, you will
+have many missing services if it goes down.
+
+To avoid this situation, create a highly available cloud controller
+cluster. This is outside the scope of this document, but you can read
+more in the `OpenStack High Availability
+Guide <http://docs.openstack.org/ha-guide/index.html>`_.
+
+The next best approach is to use a configuration-management tool, such
+as Puppet, to automatically build a cloud controller. This should not
+take more than 15 minutes if you have a spare server available. After
+the controller rebuilds, restore any backups taken
+(see :doc:`ops_backup_recovery`).
+
+Also, in practice, the ``nova-compute`` services on the compute nodes do
+not always reconnect cleanly to rabbitmq hosted on the controller when
+it comes back up after a long reboot; a restart on the nova services on
+the compute nodes is required.
--- a/doc/ops-guide/source/ops_maintenance_database.rst
+++ b/doc/ops-guide/source/ops_maintenance_database.rst
@ -0,0 +1,49 @@
+=========
+Databases
+=========
+
+Almost all OpenStack components have an underlying database to store
+persistent information. Usually this database is MySQL. Normal MySQL
+administration is applicable to these databases. OpenStack does not
+configure the databases out of the ordinary. Basic administration
+includes performance tweaking, high availability, backup, recovery, and
+repairing. For more information, see a standard MySQL administration guide.
+
+You can perform a couple of tricks with the database to either more
+quickly retrieve information or fix a data inconsistency error—for
+example, an instance was terminated, but the status was not updated in
+the database. These tricks are discussed throughout this book.
+
+Database Connectivity
+~~~~~~~~~~~~~~~~~~~~~
+
+Review the component's configuration file to see how each OpenStack
+component accesses its corresponding database. Look for either
+``sql_connection`` or simply ``connection``. The following command uses
+``grep`` to display the SQL connection string for nova, glance, cinder,
+and keystone:
+
+.. code-block:: console
+
+   # grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf \
+     /etc/cinder/cinder.conf /etc/keystone/keystone.conf
+   sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
+   sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
+   sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
+   sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
+       connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone
+
+The connection strings take this format:
+
+.. code-block:: console
+
+   mysql+pymysql:// <username> : <password> @ <hostname> / <database name>
+
+Performance and Optimizing
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+As your cloud grows, MySQL is utilized more and more. If you suspect
+that MySQL might be becoming a bottleneck, you should start researching
+MySQL optimization. The MySQL manual has an entire section dedicated to
+this topic: `Optimization Overview
+<http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html>`_.
--- a/doc/ops-guide/source/ops_maintenance_determine.rst
+++ b/doc/ops-guide/source/ops_maintenance_determine.rst
@ -0,0 +1,92 @@
+=====================================
+Determining Which Component Is Broken
+=====================================
+
+OpenStack's collection of different components interact with each other
+strongly. For example, uploading an image requires interaction from
+``nova-api``, ``glance-api``, ``glance-registry``, keystone, and
+potentially ``swift-proxy``. As a result, it is sometimes difficult to
+determine exactly where problems lie. Assisting in this is the purpose
+of this section.
+
+Tailing Logs
+~~~~~~~~~~~~
+
+The first place to look is the log file related to the command you are
+trying to run. For example, if ``nova list`` is failing, try tailing a
+nova log file and running the command again:
+
+Terminal 1:
+
+.. code-block:: console
+
+   # tail -f /var/log/nova/nova-api.log
+
+Terminal 2:
+
+.. code-block:: console
+
+   # nova list
+
+Look for any errors or traces in the log file. For more information, see
+:doc:`ops_logging_monitoring`.
+
+If the error indicates that the problem is with another component,
+switch to tailing that component's log file. For example, if nova cannot
+access glance, look at the ``glance-api`` log:
+
+Terminal 1:
+
+.. code-block:: console
+
+   # tail -f /var/log/glance/api.log
+
+Terminal 2:
+
+.. code-block:: console
+
+   # nova list
+
+Wash, rinse, and repeat until you find the core cause of the problem.
+
+Running Daemons on the CLI
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, sometimes the error is not apparent from the log files.
+In this case, switch tactics and use a different command; maybe run the
+service directly on the command line. For example, if the ``glance-api``
+service refuses to start and stay running, try launching the daemon from
+the command line:
+
+.. code-block:: console
+
+   # sudo -u glance -H glance-api
+
+This might print the error and cause of the problem.
+
+.. note::
+
+   The ``-H`` flag is required when running the daemons with sudo
+   because some daemons will write files relative to the user's home
+   directory, and this write may fail if ``-H`` is left off.
+
+.. Tip::
+
+   **Example of Complexity**
+
+   One morning, a compute node failed to run any instances. The log files
+   were a bit vague, claiming that a certain instance was unable to be
+   started. This ended up being a red herring because the instance was
+   simply the first instance in alphabetical order, so it was the first
+   instance that ``nova-compute`` would touch.
+
+   Further troubleshooting showed that libvirt was not running at all. This
+   made more sense. If libvirt wasn't running, then no instance could be
+   virtualized through KVM. Upon trying to start libvirt, it would silently
+   die immediately. The libvirt logs did not explain why.
+
+   Next, the ``libvirtd`` daemon was run on the command line. Finally a
+   helpful error message: it could not connect to d-bus. As ridiculous as
+   it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and
+   somehow d-bus crashed. Simply starting d-bus set the entire chain back
+   on track, and soon everything was back up and running.
--- a/doc/ops-guide/source/ops_maintenance_hardware.rst
+++ b/doc/ops-guide/source/ops_maintenance_hardware.rst
@ -0,0 +1,64 @@
+=====================
+Working with Hardware
+=====================
+
+As for your initial deployment, you should ensure that all hardware is
+appropriately burned in before adding it to production. Run software
+that uses the hardware to its limits—maxing out RAM, CPU, disk, and
+network. Many options are available, and normally double as benchmark
+software, so you also get a good idea of the performance of your
+system.
+
+Adding a Compute Node
+~~~~~~~~~~~~~~~~~~~~~
+
+If you find that you have reached or are reaching the capacity limit of
+your computing resources, you should plan to add additional compute
+nodes. Adding more nodes is quite easy. The process for adding compute
+nodes is the same as when the initial compute nodes were deployed to
+your cloud: use an automated deployment system to bootstrap the
+bare-metal server with the operating system and then have a
+configuration-management system install and configure OpenStack Compute.
+Once the Compute service has been installed and configured in the same
+way as the other compute nodes, it automatically attaches itself to the
+cloud. The cloud controller notices the new node(s) and begins
+scheduling instances to launch there.
+
+If your OpenStack Block Storage nodes are separate from your compute
+nodes, the same procedure still applies because the same queuing and
+polling system is used in both services.
+
+We recommend that you use the same hardware for new compute and block
+storage nodes. At the very least, ensure that the CPUs are similar in
+the compute nodes to not break live migration.
+
+Adding an Object Storage Node
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Adding a new object storage node is different from adding compute or
+block storage nodes. You still want to initially configure the server by
+using your automated deployment and configuration-management systems.
+After that is done, you need to add the local disks of the object
+storage node into the object storage ring. The exact command to do this
+is the same command that was used to add the initial disks to the ring.
+Simply rerun this command on the object storage proxy server for all
+disks on the new object storage node. Once this has been done, rebalance
+the ring and copy the resulting ring files to the other storage nodes.
+
+.. note::
+
+   If your new object storage node has a different number of disks than
+   the original nodes have, the command to add the new node is
+   different from the original commands. These parameters vary from
+   environment to environment.
+
+Replacing Components
+~~~~~~~~~~~~~~~~~~~~
+
+Failures of hardware are common in large-scale deployments such as an
+infrastructure cloud. Consider your processes and balance time saving
+against availability. For example, an Object Storage cluster can easily
+live with dead disks in it for some period of time if it has sufficient
+capacity. Or, if your compute installation is not full, you could
+consider live migrating instances off a host with a RAM failure until
+you have time to deal with the problem.
--- a/doc/ops-guide/source/ops_maintenance_hdmwy.rst
+++ b/doc/ops-guide/source/ops_maintenance_hdmwy.rst
@ -0,0 +1,54 @@
+=====
+HDWMY
+=====
+
+Here's a quick list of various to-do items for each hour, day, week,
+month, and year. Please note that these tasks are neither required nor
+definitive but helpful ideas:
+
+Hourly
+~~~~~~
+
+* Check your monitoring system for alerts and act on them.
+* Check your ticket queue for new tickets.
+
+Daily
+~~~~~
+
+* Check for instances in a failed or weird state and investigate why.
+* Check for security patches and apply them as needed.
+
+Weekly
+~~~~~~
+
+* Check cloud usage:
+
+  * User quotas
+  * Disk space
+  * Image usage
+  * Large instances
+  * Network usage (bandwidth and IP usage)
+
+* Verify your alert mechanisms are still working.
+
+Monthly
+~~~~~~~
+
+* Check usage and trends over the past month.
+* Check for user accounts that should be removed.
+* Check for operator accounts that should be removed.
+
+Quarterly
+~~~~~~~~~
+
+* Review usage and trends over the past quarter.
+* Prepare any quarterly reports on usage and statistics.
+* Review and plan any necessary cloud additions.
+* Review and plan any major OpenStack upgrades.
+
+Semiannually
+~~~~~~~~~~~~
+
+* Upgrade OpenStack.
+* Clean up after an OpenStack upgrade (any unused or new services to be
+  aware of?).
--- a/doc/ops-guide/source/ops_maintenance_slow.rst
+++ b/doc/ops-guide/source/ops_maintenance_slow.rst
@ -0,0 +1,90 @@
+=========================================
+What to do when things are running slowly
+=========================================
+
+When you are getting slow responses from various services, it can be
+hard to know where to start looking. The first thing to check is the
+extent of the slowness: is it specific to a single service, or varied
+among different services? If your problem is isolated to a specific
+service, it can temporarily be fixed by restarting the service, but that
+is often only a fix for the symptom and not the actual problem.
+
+This is a collection of ideas from experienced operators on common
+things to look at that may be the cause of slowness. It is not, however,
+designed to be an exhaustive list.
+
+OpenStack Identity service
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If OpenStack :term:`Identity service` is responding slowly, it could be due
+to the token table getting large. This can be fixed by running the
+:command:`keystone-manage token_flush` command.
+
+Additionally, for Identity-related issues, try the tips
+in :ref:`sql_backend`.
+
+OpenStack Image service
+~~~~~~~~~~~~~~~~~~~~~~~
+
+OpenStack :term:`Image service` can be slowed down by things related to the
+Identity service, but the Image service itself can be slowed down if
+connectivity to the back-end storage in use is slow or otherwise
+problematic. For example, your back-end NFS server might have gone down.
+
+OpenStack Block Storage service
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+OpenStack :term:`Block Storage service` is similar to the Image service, so
+start by checking Identity-related services, and the back-end storage.
+Additionally, both the Block Storage and Image services rely on AMQP and
+SQL functionality, so consider these when debugging.
+
+OpenStack Compute service
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Services related to OpenStack Compute are normally fairly fast and rely
+on a couple of backend services: Identity for authentication and
+authorization), and AMQP for interoperability. Any slowness related to
+services is normally related to one of these. Also, as with all other
+services, SQL is used extensively.
+
+OpenStack Networking service
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Slowness in the OpenStack :term:`Networking service` can be caused by services
+that it relies upon, but it can also be related to either physical or
+virtual networking. For example: network namespaces that do not exist or
+are not tied to interfaces correctly; DHCP daemons that have hung or are
+not running; a cable being physically disconnected; a switch not being
+configured correctly. When debugging Networking service problems, begin
+by verifying all physical networking functionality (switch
+configuration, physical cabling, etc.). After the physical networking is
+verified, check to be sure all of the Networking services are running
+(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL
+back ends.
+
+AMQP broker
+~~~~~~~~~~~
+
+Regardless of which AMQP broker you use, such as RabbitMQ, there are
+common issues which not only slow down operations, but can also cause
+real problems. Sometimes messages queued for services stay on the queues
+and are not consumed. This can be due to dead or stagnant services and
+can be commonly cleared up by either restarting the AMQP-related
+services or the OpenStack service in question.
+
+.. _sql_backend:
+
+SQL back end
+~~~~~~~~~~~~
+
+Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability
+is essential to a functioning OpenStack environment. A large or
+fragmented SQLite file can cause slowness when using files as a back
+end. A locked or long-running query can cause delays for most RDBMS
+services. In this case, do not kill the query immediately, but look into
+it to see if it is a problem with something that is hung, or something
+that is just taking a long time to run and needs to finish on its own.
+The administration of an RDBMS is outside the scope of this document,
+but it should be noted that a properly functioning RDBMS is essential to
+most OpenStack services.
--- a/doc/ops-guide/source/ops_maintenance_storage.rst
+++ b/doc/ops-guide/source/ops_maintenance_storage.rst
@ -0,0 +1,91 @@
+=====================================
+Storage Node Failures and Maintenance
+=====================================
+
+Because of the high redundancy of Object Storage, dealing with object
+storage node issues is a lot easier than dealing with compute node
+issues.
+
+Rebooting a Storage Node
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a storage node requires a reboot, simply reboot it. Requests for data
+hosted on that node are redirected to other copies while the server is
+rebooting.
+
+Shutting Down a Storage Node
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you need to shut down a storage node for an extended period of time
+(one or more days), consider removing the node from the storage ring.
+For example:
+
+.. code-block:: console
+
+   # swift-ring-builder account.builder remove <ip address of storage node>
+   # swift-ring-builder container.builder remove <ip address of storage node>
+   # swift-ring-builder object.builder remove <ip address of storage node>
+   # swift-ring-builder account.builder rebalance
+   # swift-ring-builder container.builder rebalance
+   # swift-ring-builder object.builder rebalance
+
+Next, redistribute the ring files to the other nodes:
+
+.. code-block:: console
+
+   # for i in s01.example.com s02.example.com s03.example.com
+   > do
+   > scp *.ring.gz $i:/etc/swift
+   > done
+
+These actions effectively take the storage node out of the storage
+cluster.
+
+When the node is able to rejoin the cluster, just add it back to the
+ring. The exact syntax you use to add a node to your swift cluster with
+``swift-ring-builder`` heavily depends on the original options used when
+you originally created your cluster. Please refer back to those
+commands.
+
+Replacing a Swift Disk
+~~~~~~~~~~~~~~~~~~~~~~
+
+If a hard drive fails in an Object Storage node, replacing it is
+relatively easy. This assumes that your Object Storage environment is
+configured correctly, where the data that is stored on the failed drive
+is also replicated to other drives in the Object Storage environment.
+
+This example assumes that ``/dev/sdb`` has failed.
+
+First, unmount the disk:
+
+.. code-block:: console
+
+   # umount /dev/sdb
+
+Next, physically remove the disk from the server and replace it with a
+working disk.
+
+Ensure that the operating system has recognized the new disk:
+
+.. code-block:: console
+
+   # dmesg | tail
+
+You should see a message about ``/dev/sdb``.
+
+Because it is recommended to not use partitions on a swift disk, simply
+format the disk as a whole:
+
+.. code-block:: console
+
+   # mkfs.xfs /dev/sdb
+
+Finally, mount the disk:
+
+.. code-block:: console
+
+   # mount -a
+
+Swift should notice the new disk and that no data exists. It then begins
+replicating the data to the disk from the other existing replicas.
--- a/doc/ops-guide/source/ops_uninstall.rst
+++ b/doc/ops-guide/source/ops_uninstall.rst
@ -0,0 +1,18 @@
+============
+Uninstalling
+============
+
+While we'd always recommend using your automated deployment system to
+reinstall systems from scratch, sometimes you do need to remove
+OpenStack from a system the hard way. Here's how:
+
+* Remove all packages.
+* Remove remaining files.
+* Remove databases.
+
+These steps depend on your underlying distribution, but in general you
+should be looking for :command:`purge` commands in your package manager, like
+:command:`aptitude purge ~c $package`. Following this, you can look for
+orphaned files in the directories referenced throughout this guide. To
+uninstall the database properly, refer to the manual appropriate for the
+product in use.