From 8760c94427adbd0d4b526cfac6eb834e8fa5a5d6 Mon Sep 17 00:00:00 2001 From: KATO Tomoyuki Date: Sun, 8 May 2016 19:44:10 +0900 Subject: [PATCH] [ops-guide] Cleanup maintenance chapter Change-Id: I421caf2a12ab192d4df6d5c197e2c5dfb1c9c9bb Implements: blueprint ops-guide-rst --- doc/ops-guide/source/ops_maintenance.rst | 1058 +---------------- .../source/ops_maintenance_complete.rst | 50 + .../source/ops_maintenance_compute.rst | 401 +++++++ .../source/ops_maintenance_configuration.rst | 27 + .../source/ops_maintenance_controller.rst | 96 ++ .../source/ops_maintenance_database.rst | 49 + .../source/ops_maintenance_determine.rst | 92 ++ .../source/ops_maintenance_hardware.rst | 64 + .../source/ops_maintenance_hdmwy.rst | 54 + doc/ops-guide/source/ops_maintenance_slow.rst | 90 ++ .../source/ops_maintenance_storage.rst | 91 ++ doc/ops-guide/source/ops_uninstall.rst | 18 + 12 files changed, 1047 insertions(+), 1043 deletions(-) create mode 100644 doc/ops-guide/source/ops_maintenance_complete.rst create mode 100644 doc/ops-guide/source/ops_maintenance_compute.rst create mode 100644 doc/ops-guide/source/ops_maintenance_configuration.rst create mode 100644 doc/ops-guide/source/ops_maintenance_controller.rst create mode 100644 doc/ops-guide/source/ops_maintenance_database.rst create mode 100644 doc/ops-guide/source/ops_maintenance_determine.rst create mode 100644 doc/ops-guide/source/ops_maintenance_hardware.rst create mode 100644 doc/ops-guide/source/ops_maintenance_hdmwy.rst create mode 100644 doc/ops-guide/source/ops_maintenance_slow.rst create mode 100644 doc/ops-guide/source/ops_maintenance_storage.rst create mode 100644 doc/ops-guide/source/ops_uninstall.rst diff --git a/doc/ops-guide/source/ops_maintenance.rst b/doc/ops-guide/source/ops_maintenance.rst index 18142830eb..df937796c4 100644 --- a/doc/ops-guide/source/ops_maintenance.rst +++ b/doc/ops-guide/source/ops_maintenance.rst @@ -2,1049 +2,21 @@ Maintenance, Failures, and Debugging ==================================== +.. toctree:: + :maxdepth: 2 + + ops_maintenance_controller.rst + ops_maintenance_compute.rst + ops_maintenance_storage.rst + ops_maintenance_complete.rst + ops_maintenance_configuration.rst + ops_maintenance_hardware.rst + ops_maintenance_database.rst + ops_maintenance_hdmwy.rst + ops_maintenance_determine.rst + ops_maintenance_slow.rst + ops_uninstall.rst + Downtime, whether planned or unscheduled, is a certainty when running a cloud. This chapter aims to provide useful information for dealing proactively, or reactively, with these occurrences. - -Cloud Controller and Storage Proxy Failures and Maintenance -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The cloud controller and storage proxy are very similar to each other -when it comes to expected and unexpected downtime. One of each server -type typically runs in the cloud, which makes them very noticeable when -they are not running. - -For the cloud controller, the good news is if your cloud is using the -FlatDHCP multi-host HA network mode, existing instances and volumes -continue to operate while the cloud controller is offline. For the -storage proxy, however, no storage traffic is possible until it is back -up and running. - -Planned Maintenance -------------------- - -One way to plan for cloud controller or storage proxy maintenance is to -simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy -affects fewer users. If your cloud controller or storage proxy is too -important to have unavailable at any point in time, you must look into -high-availability options. - -Rebooting a Cloud Controller or Storage Proxy ---------------------------------------------- - -All in all, just issue the :command:`reboot` command. The operating system -cleanly shuts down services and then automatically reboots. If you want -to be very thorough, run your backup jobs just before you -reboot. - -After a cloud controller reboots, ensure that all required services were -successfully started. The following commands use :command:`ps` and -:command:`grep` to determine if nova, glance, and keystone are currently -running: - -.. code-block:: console - - # ps aux | grep nova- - # ps aux | grep glance- - # ps aux | grep keystone - # ps aux | grep cinder - -Also check that all services are functioning. The following set of -commands sources the ``openrc`` file, then runs some basic glance, nova, -and openstack commands. If the commands work as expected, you can be -confident that those services are in working condition: - -.. code-block:: console - - # source openrc - # glance index - # nova list - # openstack project list - -For the storage proxy, ensure that the :term:`Object Storage service` has -resumed: - -.. code-block:: console - - # ps aux | grep swift - -Also check that it is functioning: - -.. code-block:: console - - # swift stat - -Total Cloud Controller Failure ------------------------------- - -The cloud controller could completely fail if, for example, its -motherboard goes bad. Users will immediately notice the loss of a cloud -controller since it provides core functionality to your cloud -environment. If your infrastructure monitoring does not alert you that -your cloud controller has failed, your users definitely will. -Unfortunately, this is a rough situation. The cloud controller is an -integral part of your cloud. If you have only one controller, you will -have many missing services if it goes down. - -To avoid this situation, create a highly available cloud controller -cluster. This is outside the scope of this document, but you can read -more in the `OpenStack High Availability -Guide `_. - -The next best approach is to use a configuration-management tool, such -as Puppet, to automatically build a cloud controller. This should not -take more than 15 minutes if you have a spare server available. After -the controller rebuilds, restore any backups taken -(see :doc:`ops_backup_recovery`). - -Also, in practice, the ``nova-compute`` services on the compute nodes do -not always reconnect cleanly to rabbitmq hosted on the controller when -it comes back up after a long reboot; a restart on the nova services on -the compute nodes is required. - -Compute Node Failures and Maintenance -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Sometimes a compute node either crashes unexpectedly or requires a -reboot for maintenance reasons. - -If you need to reboot a compute node due to planned maintenance (such as -a software or hardware upgrade), first ensure that all hosted instances -have been moved off the node. If your cloud is utilizing shared storage, -use the :command:`nova live-migration` command. First, get a list of instances -that need to be moved: - -.. code-block:: console - - # nova list --host c01.example.com --all-tenants - -Next, migrate them one by one: - -.. code-block:: console - - # nova live-migration c02.example.com - -If you are not using shared storage, you can use the -:option:`--block-migrate` option: - -.. code-block:: console - - # nova live-migration --block-migrate c02.example.com - -After you have migrated all instances, ensure that the ``nova-compute`` -service has stopped: - -.. code-block:: console - - # stop nova-compute - -If you use a configuration-management system, such as Puppet, that -ensures the ``nova-compute`` service is always running, you can -temporarily move the ``init`` files: - -.. code-block:: console - - # mkdir /root/tmp - # mv /etc/init/nova-compute.conf /root/tmp - # mv /etc/init.d/nova-compute /root/tmp - -Next, shut down your compute node, perform your maintenance, and turn -the node back on. You can reenable the ``nova-compute`` service by -undoing the previous commands: - -.. code-block:: console - - # mv /root/tmp/nova-compute.conf /etc/init - # mv /root/tmp/nova-compute /etc/init.d/ - -Then start the ``nova-compute`` service: - -.. code-block:: console - - # start nova-compute - -You can now optionally migrate the instances back to their original -compute node. - -After a Compute Node Reboots ----------------------------- - -When you reboot a compute node, first verify that it booted -successfully. This includes ensuring that the ``nova-compute`` service -is running: - -.. code-block:: console - - # ps aux | grep nova-compute - # status nova-compute - -Also ensure that it has successfully connected to the AMQP server: - -.. code-block:: console - - # grep AMQP /var/log/nova/nova-compute - 2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672 - -After the compute node is successfully running, you must deal with the -instances that are hosted on that compute node because none of them are -running. Depending on your SLA with your users or customers, you might -have to start each instance and ensure that they start correctly. - -Instances ---------- - -You can create a list of instances that are hosted on the compute node -by performing the following command: - -.. code-block:: console - - # nova list --host c01.example.com --all-tenants - -After you have the list, you can use the :command:`nova` command to start each -instance: - -.. code-block:: console - - # nova reboot - -.. note:: - - Any time an instance shuts down unexpectedly, it might have problems - on boot. For example, the instance might require an ``fsck`` on the - root partition. If this happens, the user can use the dashboard VNC - console to fix this. - -If an instance does not boot, meaning ``virsh list`` never shows the -instance as even attempting to boot, do the following on the compute -node: - -.. code-block:: console - - # tail -f /var/log/nova/nova-compute.log - -Try executing the :command:`nova reboot` command again. You should see an -error message about why the instance was not able to boot - -In most cases, the error is the result of something in libvirt's XML -file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer -exists. You can enforce re-creation of the XML file as well as rebooting -the instance by running the following command: - -.. code-block:: console - - # nova reboot --hard - -Inspecting and Recovering Data from Failed Instances ----------------------------------------------------- - -In some scenarios, instances are running but are inaccessible through -SSH and do not respond to any command. The VNC console could be -displaying a boot failure or kernel panic error messages. This could be -an indication of file system corruption on the VM itself. If you need to -recover files or inspect the content of the instance, qemu-nbd can be -used to mount the disk. - -.. warning:: - - If you access or view the user's content and data, get approval - first! - -To access the instance's disk -(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following -steps: - -#. Suspend the instance using the ``virsh`` command. - -#. Connect the qemu-nbd device to the disk. - -#. Mount the qemu-nbd device. - -#. Unmount the device after inspecting. - -#. Disconnect the qemu-nbd device. - -#. Resume the instance. - -If you do not follow last three steps, OpenStack Compute cannot manage -the instance any longer. It fails to respond to any command issued by -OpenStack Compute, and it is marked as shut down. - -Once you mount the disk file, you should be able to access it and treat -it as a collection of normal directories with files and a directory -structure. However, we do not recommend that you edit or touch any files -because this could change the -:term:`access control lists (ACLs) ` that are used -to determine which accounts can perform what operations on files and -directories. Changing ACLs can make the instance unbootable if it is not -already. - -#. Suspend the instance using the :command:`virsh` command, taking note of the - internal ID: - - .. code-block:: console - - # virsh list - Id Name State - ---------------------------------- - 1 instance-00000981 running - 2 instance-000009f5 running - 30 instance-0000274a running - - # virsh suspend 30 - Domain 30 suspended - -#. Connect the qemu-nbd device to the disk: - - .. code-block:: console - - # cd /var/lib/nova/instances/instance-0000274a - # ls -lh - total 33M - -rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log - -rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk - -rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local - -rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml - # qemu-nbd -c /dev/nbd0 `pwd`/disk - -#. Mount the qemu-nbd device. - - The qemu-nbd device tries to export the instance disk's different - partitions as separate devices. For example, if vda is the disk and - vda1 is the root partition, qemu-nbd exports the device as - ``/dev/nbd0`` and ``/dev/nbd0p1``, respectively: - - .. code-block:: console - - # mount /dev/nbd0p1 /mnt/ - - You can now access the contents of ``/mnt``, which correspond to the - first partition of the instance's disk. - - To examine the secondary or ephemeral disk, use an alternate mount - point if you want both primary and secondary drives mounted at the - same time: - - .. code-block:: console - - # umount /mnt - # qemu-nbd -c /dev/nbd1 `pwd`/disk.local - # mount /dev/nbd1 /mnt/ - - .. code-block:: console - - # ls -lh /mnt/ - total 76K - lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin - dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot - drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev - drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc - drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home - lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib - lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64 - drwx------. 2 root root 16K Oct 15 00:42 lost+found - drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media - drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt - drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt - drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc - dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root - drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run - lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin - drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv - drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys - drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp - drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr - drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var - -#. Once you have completed the inspection, unmount the mount point and - release the qemu-nbd device: - - .. code-block:: console - - # umount /mnt - # qemu-nbd -d /dev/nbd0 - /dev/nbd0 disconnected - -#. Resume the instance using :command:`virsh`: - - .. code-block:: console - - # virsh list - Id Name State - ---------------------------------- - 1 instance-00000981 running - 2 instance-000009f5 running - 30 instance-0000274a paused - - # virsh resume 30 - Domain 30 resumed - -.. _volumes: - -Volumes -------- - -If the affected instances also had attached volumes, first generate a -list of instance and volume UUIDs: - -.. code-block:: console - - mysql> select nova.instances.uuid as instance_uuid, - cinder.volumes.id as volume_uuid, cinder.volumes.status, - cinder.volumes.attach_status, cinder.volumes.mountpoint, - cinder.volumes.display_name from cinder.volumes - inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid - where nova.instances.host = 'c01.example.com'; - -You should see a result similar to the following: - -.. code-block:: console - - +--------------+------------+-------+--------------+-----------+--------------+ - |instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name | - +--------------+------------+-------+--------------+-----------+--------------+ - |9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test | - +--------------+------------+-------+--------------+-----------+--------------+ - 1 row in set (0.00 sec) - -Next, manually detach and reattach the volumes, where X is the proper -mount point: - -.. code-block:: console - - # nova volume-detach - # nova volume-attach /dev/vdX - -Be sure that the instance has successfully booted and is at a login -screen before doing the above. - -Total Compute Node Failure --------------------------- - -Compute nodes can fail the same way a cloud controller can fail. A -motherboard failure or some other type of hardware failure can cause an -entire compute node to go offline. When this happens, all instances -running on that compute node will not be available. Just like with a -cloud controller failure, if your infrastructure monitoring does not -detect a failed compute node, your users will notify you because of -their lost instances. - -If a compute node fails and won't be fixed for a few hours (or at all), -you can relaunch all instances that are hosted on the failed node if you -use shared storage for ``/var/lib/nova/instances``. - -To do this, generate a list of instance UUIDs that are hosted on the -failed node by running the following query on the nova database: - -.. code-block:: console - - mysql> select uuid from instances where host = \ - 'c01.example.com' and deleted = 0; - -Next, update the nova database to indicate that all instances that used -to be hosted on c01.example.com are now hosted on c02.example.com: - -.. code-block:: console - - mysql> update instances set host = 'c02.example.com' where host = \ - 'c01.example.com' and deleted = 0; - -If you're using the Networking service ML2 plug-in, update the -Networking service database to indicate that all ports that used to be -hosted on c01.example.com are now hosted on c02.example.com: - -.. code-block:: console - - mysql> update ml2_port_bindings set host = 'c02.example.com' where host = \ - 'c01.example.com'; - -.. code-block:: console - - mysql> update ml2_port_binding_levels set host = 'c02.example.com' where host = \ - 'c01.example.com'; - -After that, use the :command:`nova` command to reboot all instances that were -on c01.example.com while regenerating their XML files at the same time: - -.. code-block:: console - - # nova reboot --hard - -Finally, reattach volumes using the same method described in the section -:ref:`volumes`. - -var/lib/nova/instances ----------------------- - -It's worth mentioning this directory in the context of failed compute -nodes. This directory contains the libvirt KVM file-based disk images -for the instances that are hosted on that compute node. If you are not -running your cloud in a shared storage environment, this directory is -unique across all compute nodes. - -``/var/lib/nova/instances`` contains two types of directories. - -The first is the ``_base`` directory. This contains all the cached base -images from glance for each unique image that has been launched on that -compute node. Files ending in ``_20`` (or a different number) are the -ephemeral base images. - -The other directories are titled ``instance-xxxxxxxx``. These -directories correspond to instances running on that compute node. The -files inside are related to one of the files in the ``_base`` directory. -They're essentially differential-based files containing only the changes -made from the original ``_base`` directory. - -All files and directories in ``/var/lib/nova/instances`` are uniquely -named. The files in \_base are uniquely titled for the glance image that -they are based on, and the directory names ``instance-xxxxxxxx`` are -uniquely titled for that particular instance. For example, if you copy -all data from ``/var/lib/nova/instances`` on one compute node to -another, you do not overwrite any files or cause any damage to images -that have the same unique name, because they are essentially the same -file. - -Although this method is not documented or supported, you can use it when -your compute node is permanently offline but you have instances locally -stored on it. - -Storage Node Failures and Maintenance -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Because of the high redundancy of Object Storage, dealing with object -storage node issues is a lot easier than dealing with compute node -issues. - -Rebooting a Storage Node ------------------------- - -If a storage node requires a reboot, simply reboot it. Requests for data -hosted on that node are redirected to other copies while the server is -rebooting. - -If you need to shut down a storage node for an extended period of time -(one or more days), consider removing the node from the storage ring. -For example: - -.. code-block:: console - - # swift-ring-builder account.builder remove - # swift-ring-builder container.builder remove - # swift-ring-builder object.builder remove - # swift-ring-builder account.builder rebalance - # swift-ring-builder container.builder rebalance - # swift-ring-builder object.builder rebalance - -Next, redistribute the ring files to the other nodes: - -.. code-block:: console - - # for i in s01.example.com s02.example.com s03.example.com - > do - > scp *.ring.gz $i:/etc/swift - > done - -These actions effectively take the storage node out of the storage -cluster. - -When the node is able to rejoin the cluster, just add it back to the -ring. The exact syntax you use to add a node to your swift cluster with -``swift-ring-builder`` heavily depends on the original options used when -you originally created your cluster. Please refer back to those -commands. - -Replacing a Swift Disk ----------------------- - -If a hard drive fails in an Object Storage node, replacing it is -relatively easy. This assumes that your Object Storage environment is -configured correctly, where the data that is stored on the failed drive -is also replicated to other drives in the Object Storage environment. - -This example assumes that ``/dev/sdb`` has failed. - -First, unmount the disk: - -.. code-block:: console - - # umount /dev/sdb - -Next, physically remove the disk from the server and replace it with a -working disk. - -Ensure that the operating system has recognized the new disk: - -.. code-block:: console - - # dmesg | tail - -You should see a message about ``/dev/sdb``. - -Because it is recommended to not use partitions on a swift disk, simply -format the disk as a whole: - -.. code-block:: console - - # mkfs.xfs /dev/sdb - -Finally, mount the disk: - -.. code-block:: console - - # mount -a - -Swift should notice the new disk and that no data exists. It then begins -replicating the data to the disk from the other existing replicas. - -Handling a Complete Failure -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -A common way of dealing with the recovery from a full system failure, -such as a power outage of a data center, is to assign each service a -priority, and restore in order. The below table shows an example. - -.. list-table:: Example service restoration priority list - :widths: 50 50 - :header-rows: 1 - - * - Priority - - Services - * - 1 - - Internal network connectivity - * - 2 - - Backing storage services - * - 3 - - Public network connectivity for user virtual machines - * - 4 - - ``nova-compute``, ``nova-network``, cinder hosts - * - 5 - - User virtual machines - * - 10 - - Message queue and database services - * - 15 - - Keystone services - * - 20 - - ``cinder-scheduler`` - * - 21 - - Image Catalog and Delivery services - * - 22 - - ``nova-scheduler`` services - * - 98 - - ``cinder-api`` - * - 99 - - ``nova-api`` services - * - 100 - - Dashboard node - -Use this example priority list to ensure that user-affected services are -restored as soon as possible, but not before a stable environment is in -place. Of course, despite being listed as a single-line item, each step -requires significant work. For example, just after starting the -database, you should check its integrity, or, after starting the nova -services, you should verify that the hypervisor matches the database and -fix any mismatches. - -Configuration Management -~~~~~~~~~~~~~~~~~~~~~~~~ - -Maintaining an OpenStack cloud requires that you manage multiple -physical servers, and this number might grow over time. Because managing -nodes manually is error prone, we strongly recommend that you use a -configuration-management tool. These tools automate the process of -ensuring that all your nodes are configured properly and encourage you -to maintain your configuration information (such as packages and -configuration options) in a version-controlled repository. - -.. note:: - - Several configuration-management tools are available, and this guide - does not recommend a specific one. The two most popular ones in the - OpenStack community are `Puppet `_, with - available `OpenStack Puppet - modules `_; and - `Chef `_, with available `OpenStack - Chef recipes `_. - Other newer configuration tools include - `Juju `_, - `Ansible `_, and - `Salt `_; and more mature configuration - management tools include `CFEngine `_ and - `Bcfg2 `_. - -Working with Hardware -~~~~~~~~~~~~~~~~~~~~~ - -As for your initial deployment, you should ensure that all hardware is -appropriately burned in before adding it to production. Run software -that uses the hardware to its limits—maxing out RAM, CPU, disk, and -network. Many options are available, and normally double as benchmark -software, so you also get a good idea of the performance of your -system. - -Adding a Compute Node ---------------------- - -If you find that you have reached or are reaching the capacity limit of -your computing resources, you should plan to add additional compute -nodes. Adding more nodes is quite easy. The process for adding compute -nodes is the same as when the initial compute nodes were deployed to -your cloud: use an automated deployment system to bootstrap the -bare-metal server with the operating system and then have a -configuration-management system install and configure OpenStack Compute. -Once the Compute service has been installed and configured in the same -way as the other compute nodes, it automatically attaches itself to the -cloud. The cloud controller notices the new node(s) and begins -scheduling instances to launch there. - -If your OpenStack Block Storage nodes are separate from your compute -nodes, the same procedure still applies because the same queuing and -polling system is used in both services. - -We recommend that you use the same hardware for new compute and block -storage nodes. At the very least, ensure that the CPUs are similar in -the compute nodes to not break live migration. - -Adding an Object Storage Node ------------------------------ - -Adding a new object storage node is different from adding compute or -block storage nodes. You still want to initially configure the server by -using your automated deployment and configuration-management systems. -After that is done, you need to add the local disks of the object -storage node into the object storage ring. The exact command to do this -is the same command that was used to add the initial disks to the ring. -Simply rerun this command on the object storage proxy server for all -disks on the new object storage node. Once this has been done, rebalance -the ring and copy the resulting ring files to the other storage nodes. - -.. note:: - - If your new object storage node has a different number of disks than - the original nodes have, the command to add the new node is - different from the original commands. These parameters vary from - environment to environment. - -Replacing Components --------------------- - -Failures of hardware are common in large-scale deployments such as an -infrastructure cloud. Consider your processes and balance time saving -against availability. For example, an Object Storage cluster can easily -live with dead disks in it for some period of time if it has sufficient -capacity. Or, if your compute installation is not full, you could -consider live migrating instances off a host with a RAM failure until -you have time to deal with the problem. - -Databases -~~~~~~~~~ - -Almost all OpenStack components have an underlying database to store -persistent information. Usually this database is MySQL. Normal MySQL -administration is applicable to these databases. OpenStack does not -configure the databases out of the ordinary. Basic administration -includes performance tweaking, high availability, backup, recovery, and -repairing. For more information, see a standard MySQL administration guide. - -You can perform a couple of tricks with the database to either more -quickly retrieve information or fix a data inconsistency error—for -example, an instance was terminated, but the status was not updated in -the database. These tricks are discussed throughout this book. - -Database Connectivity ---------------------- - -Review the component's configuration file to see how each OpenStack -component accesses its corresponding database. Look for either -``sql_connection`` or simply ``connection``. The following command uses -``grep`` to display the SQL connection string for nova, glance, cinder, -and keystone: - -.. code-block:: console - - # grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf - /etc/cinder/cinder.conf /etc/keystone/keystone.conf - sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova - sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance - sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance - sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder - connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone - -The connection strings take this format: - -.. code-block:: console - - mysql+pymysql:// : @ / - -Performance and Optimizing --------------------------- - -As your cloud grows, MySQL is utilized more and more. If you suspect -that MySQL might be becoming a bottleneck, you should start researching -MySQL optimization. The MySQL manual has an entire section dedicated to -this topic: `Optimization -Overview `_. - -HDWMY -~~~~~ - -Here's a quick list of various to-do items for each hour, day, week, -month, and year. Please note that these tasks are neither required nor -definitive but helpful ideas: - -Hourly ------- - -- Check your monitoring system for alerts and act on them. - -- Check your ticket queue for new tickets. - -Daily ------ - -- Check for instances in a failed or weird state and investigate why. - -- Check for security patches and apply them as needed. - -Weekly ------- - -- Check cloud usage: - - - User quotas - - - Disk space - - - Image usage - - - Large instances - - - Network usage (bandwidth and IP usage) - -- Verify your alert mechanisms are still working. - -Monthly -------- - -- Check usage and trends over the past month. - -- Check for user accounts that should be removed. - -- Check for operator accounts that should be removed. - -Quarterly ---------- - -- Review usage and trends over the past quarter. - -- Prepare any quarterly reports on usage and statistics. - -- Review and plan any necessary cloud additions. - -- Review and plan any major OpenStack upgrades. - -Semiannually ------------- - -- Upgrade OpenStack. - -- Clean up after an OpenStack upgrade (any unused or new services to be - aware of?). - -Determining Which Component Is Broken -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -OpenStack's collection of different components interact with each other -strongly. For example, uploading an image requires interaction from -``nova-api``, ``glance-api``, ``glance-registry``, keystone, and -potentially ``swift-proxy``. As a result, it is sometimes difficult to -determine exactly where problems lie. Assisting in this is the purpose -of this section. - -Tailing Logs ------------- - -The first place to look is the log file related to the command you are -trying to run. For example, if ``nova list`` is failing, try tailing a -nova log file and running the command again: - -Terminal 1: - -.. code-block:: console - - # tail -f /var/log/nova/nova-api.log - -Terminal 2: - -.. code-block:: console - - # nova list - -Look for any errors or traces in the log file. For more information, see -:doc:`ops_logging_monitoring`. - -If the error indicates that the problem is with another component, -switch to tailing that component's log file. For example, if nova cannot -access glance, look at the ``glance-api`` log: - -Terminal 1: - -.. code-block:: console - - # tail -f /var/log/glance/api.log - -Terminal 2: - -.. code-block:: console - - # nova list - -Wash, rinse, and repeat until you find the core cause of the problem. - -Running Daemons on the CLI --------------------------- - -Unfortunately, sometimes the error is not apparent from the log files. -In this case, switch tactics and use a different command; maybe run the -service directly on the command line. For example, if the ``glance-api`` -service refuses to start and stay running, try launching the daemon from -the command line: - -.. code-block:: console - - # sudo -u glance -H glance-api - -This might print the error and cause of the problem. - -.. note:: - - The ``-H`` flag is required when running the daemons with sudo - because some daemons will write files relative to the user's home - directory, and this write may fail if ``-H`` is left off. - -**Example of Complexity** - -One morning, a compute node failed to run any instances. The log files -were a bit vague, claiming that a certain instance was unable to be -started. This ended up being a red herring because the instance was -simply the first instance in alphabetical order, so it was the first -instance that ``nova-compute`` would touch. - -Further troubleshooting showed that libvirt was not running at all. This -made more sense. If libvirt wasn't running, then no instance could be -virtualized through KVM. Upon trying to start libvirt, it would silently -die immediately. The libvirt logs did not explain why. - -Next, the ``libvirtd`` daemon was run on the command line. Finally a -helpful error message: it could not connect to d-bus. As ridiculous as -it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and -somehow d-bus crashed. Simply starting d-bus set the entire chain back -on track, and soon everything was back up and running. - -What to do when things are running slowly -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When you are getting slow responses from various services, it can be -hard to know where to start looking. The first thing to check is the -extent of the slowness: is it specific to a single service, or varied -among different services? If your problem is isolated to a specific -service, it can temporarily be fixed by restarting the service, but that -is often only a fix for the symptom and not the actual problem. - -This is a collection of ideas from experienced operators on common -things to look at that may be the cause of slowness. It is not, however, -designed to be an exhaustive list. - -OpenStack Identity service --------------------------- - -If OpenStack :term:`Identity service` is responding slowly, it could be due -to the token table getting large. This can be fixed by running the -:command:`keystone-manage token_flush` command. - -Additionally, for Identity-related issues, try the tips -in :ref:`sql_backend`. - -OpenStack Image service ------------------------ - -OpenStack :term:`Image service` can be slowed down by things related to the -Identity service, but the Image service itself can be slowed down if -connectivity to the back-end storage in use is slow or otherwise -problematic. For example, your back-end NFS server might have gone down. - -OpenStack Block Storage service -------------------------------- - -OpenStack :term:`Block Storage service` is similar to the Image service, so -start by checking Identity-related services, and the back-end storage. -Additionally, both the Block Storage and Image services rely on AMQP and -SQL functionality, so consider these when debugging. - -OpenStack Compute service -------------------------- - -Services related to OpenStack Compute are normally fairly fast and rely -on a couple of backend services: Identity for authentication and -authorization), and AMQP for interoperability. Any slowness related to -services is normally related to one of these. Also, as with all other -services, SQL is used extensively. - -OpenStack Networking service ----------------------------- - -Slowness in the OpenStack :term:`Networking service` can be caused by services -that it relies upon, but it can also be related to either physical or -virtual networking. For example: network namespaces that do not exist or -are not tied to interfaces correctly; DHCP daemons that have hung or are -not running; a cable being physically disconnected; a switch not being -configured correctly. When debugging Networking service problems, begin -by verifying all physical networking functionality (switch -configuration, physical cabling, etc.). After the physical networking is -verified, check to be sure all of the Networking services are running -(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL -back ends. - -AMQP broker ------------ - -Regardless of which AMQP broker you use, such as RabbitMQ, there are -common issues which not only slow down operations, but can also cause -real problems. Sometimes messages queued for services stay on the queues -and are not consumed. This can be due to dead or stagnant services and -can be commonly cleared up by either restarting the AMQP-related -services or the OpenStack service in question. - -.. _sql_backend: - -SQL back end ------------- - -Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability -is essential to a functioning OpenStack environment. A large or -fragmented SQLite file can cause slowness when using files as a back -end. A locked or long-running query can cause delays for most RDBMS -services. In this case, do not kill the query immediately, but look into -it to see if it is a problem with something that is hung, or something -that is just taking a long time to run and needs to finish on its own. -The administration of an RDBMS is outside the scope of this document, -but it should be noted that a properly functioning RDBMS is essential to -most OpenStack services. - -Uninstalling -~~~~~~~~~~~~ - -While we'd always recommend using your automated deployment system to -reinstall systems from scratch, sometimes you do need to remove -OpenStack from a system the hard way. Here's how: - -- Remove all packages. - -- Remove remaining files. - -- Remove databases. - -These steps depend on your underlying distribution, but in general you -should be looking for :command:`purge` commands in your package manager, like -:command:`aptitude purge ~c $package`. Following this, you can look for -orphaned files in the directories referenced throughout this guide. To -uninstall the database properly, refer to the manual appropriate for the -product in use. diff --git a/doc/ops-guide/source/ops_maintenance_complete.rst b/doc/ops-guide/source/ops_maintenance_complete.rst new file mode 100644 index 0000000000..822388e743 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_complete.rst @@ -0,0 +1,50 @@ +=========================== +Handling a Complete Failure +=========================== + +A common way of dealing with the recovery from a full system failure, +such as a power outage of a data center, is to assign each service a +priority, and restore in order. +:ref:`table_example_priority` shows an example. + +.. _table_example_priority: + +.. list-table:: Table. Example service restoration priority list + :header-rows: 1 + + * - Priority + - Services + * - 1 + - Internal network connectivity + * - 2 + - Backing storage services + * - 3 + - Public network connectivity for user virtual machines + * - 4 + - ``nova-compute``, ``nova-network``, cinder hosts + * - 5 + - User virtual machines + * - 10 + - Message queue and database services + * - 15 + - Keystone services + * - 20 + - ``cinder-scheduler`` + * - 21 + - Image Catalog and Delivery services + * - 22 + - ``nova-scheduler`` services + * - 98 + - ``cinder-api`` + * - 99 + - ``nova-api`` services + * - 100 + - Dashboard node + +Use this example priority list to ensure that user-affected services are +restored as soon as possible, but not before a stable environment is in +place. Of course, despite being listed as a single-line item, each step +requires significant work. For example, just after starting the +database, you should check its integrity, or, after starting the nova +services, you should verify that the hypervisor matches the database and +fix any mismatches. diff --git a/doc/ops-guide/source/ops_maintenance_compute.rst b/doc/ops-guide/source/ops_maintenance_compute.rst new file mode 100644 index 0000000000..de92edc142 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_compute.rst @@ -0,0 +1,401 @@ +===================================== +Compute Node Failures and Maintenance +===================================== + +Sometimes a compute node either crashes unexpectedly or requires a +reboot for maintenance reasons. + +Planned Maintenance +~~~~~~~~~~~~~~~~~~~ + +If you need to reboot a compute node due to planned maintenance (such as +a software or hardware upgrade), first ensure that all hosted instances +have been moved off the node. If your cloud is utilizing shared storage, +use the :command:`nova live-migration` command. First, get a list of instances +that need to be moved: + +.. code-block:: console + + # nova list --host c01.example.com --all-tenants + +Next, migrate them one by one: + +.. code-block:: console + + # nova live-migration c02.example.com + +If you are not using shared storage, you can use the +:option:`--block-migrate` option: + +.. code-block:: console + + # nova live-migration --block-migrate c02.example.com + +After you have migrated all instances, ensure that the ``nova-compute`` +service has stopped: + +.. code-block:: console + + # stop nova-compute + +If you use a configuration-management system, such as Puppet, that +ensures the ``nova-compute`` service is always running, you can +temporarily move the ``init`` files: + +.. code-block:: console + + # mkdir /root/tmp + # mv /etc/init/nova-compute.conf /root/tmp + # mv /etc/init.d/nova-compute /root/tmp + +Next, shut down your compute node, perform your maintenance, and turn +the node back on. You can reenable the ``nova-compute`` service by +undoing the previous commands: + +.. code-block:: console + + # mv /root/tmp/nova-compute.conf /etc/init + # mv /root/tmp/nova-compute /etc/init.d/ + +Then start the ``nova-compute`` service: + +.. code-block:: console + + # start nova-compute + +You can now optionally migrate the instances back to their original +compute node. + +After a Compute Node Reboots +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +When you reboot a compute node, first verify that it booted +successfully. This includes ensuring that the ``nova-compute`` service +is running: + +.. code-block:: console + + # ps aux | grep nova-compute + # status nova-compute + +Also ensure that it has successfully connected to the AMQP server: + +.. code-block:: console + + # grep AMQP /var/log/nova/nova-compute + 2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672 + +After the compute node is successfully running, you must deal with the +instances that are hosted on that compute node because none of them are +running. Depending on your SLA with your users or customers, you might +have to start each instance and ensure that they start correctly. + +Instances +~~~~~~~~~ + +You can create a list of instances that are hosted on the compute node +by performing the following command: + +.. code-block:: console + + # nova list --host c01.example.com --all-tenants + +After you have the list, you can use the :command:`nova` command to start each +instance: + +.. code-block:: console + + # nova reboot + +.. note:: + + Any time an instance shuts down unexpectedly, it might have problems + on boot. For example, the instance might require an ``fsck`` on the + root partition. If this happens, the user can use the dashboard VNC + console to fix this. + +If an instance does not boot, meaning ``virsh list`` never shows the +instance as even attempting to boot, do the following on the compute +node: + +.. code-block:: console + + # tail -f /var/log/nova/nova-compute.log + +Try executing the :command:`nova reboot` command again. You should see an +error message about why the instance was not able to boot + +In most cases, the error is the result of something in libvirt's XML +file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer +exists. You can enforce re-creation of the XML file as well as rebooting +the instance by running the following command: + +.. code-block:: console + + # nova reboot --hard + +Inspecting and Recovering Data from Failed Instances +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In some scenarios, instances are running but are inaccessible through +SSH and do not respond to any command. The VNC console could be +displaying a boot failure or kernel panic error messages. This could be +an indication of file system corruption on the VM itself. If you need to +recover files or inspect the content of the instance, qemu-nbd can be +used to mount the disk. + +.. warning:: + + If you access or view the user's content and data, get approval first! + +To access the instance's disk +(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following +steps: + +#. Suspend the instance using the ``virsh`` command. + +#. Connect the qemu-nbd device to the disk. + +#. Mount the qemu-nbd device. + +#. Unmount the device after inspecting. + +#. Disconnect the qemu-nbd device. + +#. Resume the instance. + +If you do not follow last three steps, OpenStack Compute cannot manage +the instance any longer. It fails to respond to any command issued by +OpenStack Compute, and it is marked as shut down. + +Once you mount the disk file, you should be able to access it and treat +it as a collection of normal directories with files and a directory +structure. However, we do not recommend that you edit or touch any files +because this could change the +:term:`access control lists (ACLs) ` that are used +to determine which accounts can perform what operations on files and +directories. Changing ACLs can make the instance unbootable if it is not +already. + +#. Suspend the instance using the :command:`virsh` command, taking note of the + internal ID: + + .. code-block:: console + + # virsh list + Id Name State + ---------------------------------- + 1 instance-00000981 running + 2 instance-000009f5 running + 30 instance-0000274a running + + # virsh suspend 30 + Domain 30 suspended + +#. Connect the qemu-nbd device to the disk: + + .. code-block:: console + + # cd /var/lib/nova/instances/instance-0000274a + # ls -lh + total 33M + -rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log + -rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk + -rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local + -rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml + # qemu-nbd -c /dev/nbd0 `pwd`/disk + +#. Mount the qemu-nbd device. + + The qemu-nbd device tries to export the instance disk's different + partitions as separate devices. For example, if vda is the disk and + vda1 is the root partition, qemu-nbd exports the device as + ``/dev/nbd0`` and ``/dev/nbd0p1``, respectively: + + .. code-block:: console + + # mount /dev/nbd0p1 /mnt/ + + You can now access the contents of ``/mnt``, which correspond to the + first partition of the instance's disk. + + To examine the secondary or ephemeral disk, use an alternate mount + point if you want both primary and secondary drives mounted at the + same time: + + .. code-block:: console + + # umount /mnt + # qemu-nbd -c /dev/nbd1 `pwd`/disk.local + # mount /dev/nbd1 /mnt/ + # ls -lh /mnt/ + total 76K + lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin + dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot + drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev + drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc + drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home + lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib + lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64 + drwx------. 2 root root 16K Oct 15 00:42 lost+found + drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media + drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt + drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt + drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc + dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root + drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run + lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin + drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv + drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys + drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp + drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr + drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var + +#. Once you have completed the inspection, unmount the mount point and + release the qemu-nbd device: + + .. code-block:: console + + # umount /mnt + # qemu-nbd -d /dev/nbd0 + /dev/nbd0 disconnected + +#. Resume the instance using :command:`virsh`: + + .. code-block:: console + + # virsh list + Id Name State + ---------------------------------- + 1 instance-00000981 running + 2 instance-000009f5 running + 30 instance-0000274a paused + + # virsh resume 30 + Domain 30 resumed + +.. _volumes: + +Volumes +~~~~~~~ + +If the affected instances also had attached volumes, first generate a +list of instance and volume UUIDs: + +.. code-block:: mysql + + mysql> select nova.instances.uuid as instance_uuid, + cinder.volumes.id as volume_uuid, cinder.volumes.status, + cinder.volumes.attach_status, cinder.volumes.mountpoint, + cinder.volumes.display_name from cinder.volumes + inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid + where nova.instances.host = 'c01.example.com'; + +You should see a result similar to the following: + +.. code-block:: mysql + + +--------------+------------+-------+--------------+-----------+--------------+ + |instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name | + +--------------+------------+-------+--------------+-----------+--------------+ + |9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test | + +--------------+------------+-------+--------------+-----------+--------------+ + 1 row in set (0.00 sec) + +Next, manually detach and reattach the volumes, where X is the proper +mount point: + +.. code-block:: console + + # nova volume-detach + # nova volume-attach /dev/vdX + +Be sure that the instance has successfully booted and is at a login +screen before doing the above. + +Total Compute Node Failure +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Compute nodes can fail the same way a cloud controller can fail. A +motherboard failure or some other type of hardware failure can cause an +entire compute node to go offline. When this happens, all instances +running on that compute node will not be available. Just like with a +cloud controller failure, if your infrastructure monitoring does not +detect a failed compute node, your users will notify you because of +their lost instances. + +If a compute node fails and won't be fixed for a few hours (or at all), +you can relaunch all instances that are hosted on the failed node if you +use shared storage for ``/var/lib/nova/instances``. + +To do this, generate a list of instance UUIDs that are hosted on the +failed node by running the following query on the nova database: + +.. code-block:: mysql + + mysql> select uuid from instances + where host = 'c01.example.com' and deleted = 0; + +Next, update the nova database to indicate that all instances that used +to be hosted on c01.example.com are now hosted on c02.example.com: + +.. code-block:: mysql + + mysql> update instances set host = 'c02.example.com' + where host = 'c01.example.com' and deleted = 0; + +If you're using the Networking service ML2 plug-in, update the +Networking service database to indicate that all ports that used to be +hosted on c01.example.com are now hosted on c02.example.com: + +.. code-block:: mysql + + mysql> update ml2_port_bindings set host = 'c02.example.com' + where host = 'c01.example.com'; + mysql> update ml2_port_binding_levels set host = 'c02.example.com' + where host = 'c01.example.com'; + +After that, use the :command:`nova` command to reboot all instances that were +on c01.example.com while regenerating their XML files at the same time: + +.. code-block:: console + + # nova reboot --hard + +Finally, reattach volumes using the same method described in the section +:ref:`volumes`. + +/var/lib/nova/instances +~~~~~~~~~~~~~~~~~~~~~~~ + +It's worth mentioning this directory in the context of failed compute +nodes. This directory contains the libvirt KVM file-based disk images +for the instances that are hosted on that compute node. If you are not +running your cloud in a shared storage environment, this directory is +unique across all compute nodes. + +``/var/lib/nova/instances`` contains two types of directories. + +The first is the ``_base`` directory. This contains all the cached base +images from glance for each unique image that has been launched on that +compute node. Files ending in ``_20`` (or a different number) are the +ephemeral base images. + +The other directories are titled ``instance-xxxxxxxx``. These +directories correspond to instances running on that compute node. The +files inside are related to one of the files in the ``_base`` directory. +They're essentially differential-based files containing only the changes +made from the original ``_base`` directory. + +All files and directories in ``/var/lib/nova/instances`` are uniquely +named. The files in \_base are uniquely titled for the glance image that +they are based on, and the directory names ``instance-xxxxxxxx`` are +uniquely titled for that particular instance. For example, if you copy +all data from ``/var/lib/nova/instances`` on one compute node to +another, you do not overwrite any files or cause any damage to images +that have the same unique name, because they are essentially the same +file. + +Although this method is not documented or supported, you can use it when +your compute node is permanently offline but you have instances locally +stored on it. diff --git a/doc/ops-guide/source/ops_maintenance_configuration.rst b/doc/ops-guide/source/ops_maintenance_configuration.rst new file mode 100644 index 0000000000..4c8302698a --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_configuration.rst @@ -0,0 +1,27 @@ +======================== +Configuration Management +======================== + +Maintaining an OpenStack cloud requires that you manage multiple +physical servers, and this number might grow over time. Because managing +nodes manually is error prone, we strongly recommend that you use a +configuration-management tool. These tools automate the process of +ensuring that all your nodes are configured properly and encourage you +to maintain your configuration information (such as packages and +configuration options) in a version-controlled repository. + +.. note:: + + Several configuration-management tools are available, and this guide + does not recommend a specific one. The two most popular ones in the + OpenStack community are `Puppet `_, with + available `OpenStack Puppet + modules `_; and + `Chef `_, with available `OpenStack + Chef recipes `_. + Other newer configuration tools include + `Juju `_, + `Ansible `_, and + `Salt `_; and more mature configuration + management tools include `CFEngine `_ and + `Bcfg2 `_. diff --git a/doc/ops-guide/source/ops_maintenance_controller.rst b/doc/ops-guide/source/ops_maintenance_controller.rst new file mode 100644 index 0000000000..af9a34c285 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_controller.rst @@ -0,0 +1,96 @@ +=========================================================== +Cloud Controller and Storage Proxy Failures and Maintenance +=========================================================== + +The cloud controller and storage proxy are very similar to each other +when it comes to expected and unexpected downtime. One of each server +type typically runs in the cloud, which makes them very noticeable when +they are not running. + +For the cloud controller, the good news is if your cloud is using the +FlatDHCP multi-host HA network mode, existing instances and volumes +continue to operate while the cloud controller is offline. For the +storage proxy, however, no storage traffic is possible until it is back +up and running. + +Planned Maintenance +~~~~~~~~~~~~~~~~~~~ + +One way to plan for cloud controller or storage proxy maintenance is to +simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy +affects fewer users. If your cloud controller or storage proxy is too +important to have unavailable at any point in time, you must look into +high-availability options. + +Rebooting a Cloud Controller or Storage Proxy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +All in all, just issue the :command:`reboot` command. The operating system +cleanly shuts down services and then automatically reboots. If you want +to be very thorough, run your backup jobs just before you +reboot. + +After a cloud controller reboots, ensure that all required services were +successfully started. The following commands use :command:`ps` and +:command:`grep` to determine if nova, glance, and keystone are currently +running: + +.. code-block:: console + + # ps aux | grep nova- + # ps aux | grep glance- + # ps aux | grep keystone + # ps aux | grep cinder + +Also check that all services are functioning. The following set of +commands sources the ``openrc`` file, then runs some basic glance, nova, +and openstack commands. If the commands work as expected, you can be +confident that those services are in working condition: + +.. code-block:: console + + # source openrc + # glance index + # nova list + # openstack project list + +For the storage proxy, ensure that the :term:`Object Storage service` has +resumed: + +.. code-block:: console + + # ps aux | grep swift + +Also check that it is functioning: + +.. code-block:: console + + # swift stat + +Total Cloud Controller Failure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The cloud controller could completely fail if, for example, its +motherboard goes bad. Users will immediately notice the loss of a cloud +controller since it provides core functionality to your cloud +environment. If your infrastructure monitoring does not alert you that +your cloud controller has failed, your users definitely will. +Unfortunately, this is a rough situation. The cloud controller is an +integral part of your cloud. If you have only one controller, you will +have many missing services if it goes down. + +To avoid this situation, create a highly available cloud controller +cluster. This is outside the scope of this document, but you can read +more in the `OpenStack High Availability +Guide `_. + +The next best approach is to use a configuration-management tool, such +as Puppet, to automatically build a cloud controller. This should not +take more than 15 minutes if you have a spare server available. After +the controller rebuilds, restore any backups taken +(see :doc:`ops_backup_recovery`). + +Also, in practice, the ``nova-compute`` services on the compute nodes do +not always reconnect cleanly to rabbitmq hosted on the controller when +it comes back up after a long reboot; a restart on the nova services on +the compute nodes is required. diff --git a/doc/ops-guide/source/ops_maintenance_database.rst b/doc/ops-guide/source/ops_maintenance_database.rst new file mode 100644 index 0000000000..a2e72ac344 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_database.rst @@ -0,0 +1,49 @@ +========= +Databases +========= + +Almost all OpenStack components have an underlying database to store +persistent information. Usually this database is MySQL. Normal MySQL +administration is applicable to these databases. OpenStack does not +configure the databases out of the ordinary. Basic administration +includes performance tweaking, high availability, backup, recovery, and +repairing. For more information, see a standard MySQL administration guide. + +You can perform a couple of tricks with the database to either more +quickly retrieve information or fix a data inconsistency error—for +example, an instance was terminated, but the status was not updated in +the database. These tricks are discussed throughout this book. + +Database Connectivity +~~~~~~~~~~~~~~~~~~~~~ + +Review the component's configuration file to see how each OpenStack +component accesses its corresponding database. Look for either +``sql_connection`` or simply ``connection``. The following command uses +``grep`` to display the SQL connection string for nova, glance, cinder, +and keystone: + +.. code-block:: console + + # grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf \ + /etc/cinder/cinder.conf /etc/keystone/keystone.conf + sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova + sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance + sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance + sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder + connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone + +The connection strings take this format: + +.. code-block:: console + + mysql+pymysql:// : @ / + +Performance and Optimizing +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As your cloud grows, MySQL is utilized more and more. If you suspect +that MySQL might be becoming a bottleneck, you should start researching +MySQL optimization. The MySQL manual has an entire section dedicated to +this topic: `Optimization Overview +`_. diff --git a/doc/ops-guide/source/ops_maintenance_determine.rst b/doc/ops-guide/source/ops_maintenance_determine.rst new file mode 100644 index 0000000000..eea2a70947 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_determine.rst @@ -0,0 +1,92 @@ +===================================== +Determining Which Component Is Broken +===================================== + +OpenStack's collection of different components interact with each other +strongly. For example, uploading an image requires interaction from +``nova-api``, ``glance-api``, ``glance-registry``, keystone, and +potentially ``swift-proxy``. As a result, it is sometimes difficult to +determine exactly where problems lie. Assisting in this is the purpose +of this section. + +Tailing Logs +~~~~~~~~~~~~ + +The first place to look is the log file related to the command you are +trying to run. For example, if ``nova list`` is failing, try tailing a +nova log file and running the command again: + +Terminal 1: + +.. code-block:: console + + # tail -f /var/log/nova/nova-api.log + +Terminal 2: + +.. code-block:: console + + # nova list + +Look for any errors or traces in the log file. For more information, see +:doc:`ops_logging_monitoring`. + +If the error indicates that the problem is with another component, +switch to tailing that component's log file. For example, if nova cannot +access glance, look at the ``glance-api`` log: + +Terminal 1: + +.. code-block:: console + + # tail -f /var/log/glance/api.log + +Terminal 2: + +.. code-block:: console + + # nova list + +Wash, rinse, and repeat until you find the core cause of the problem. + +Running Daemons on the CLI +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, sometimes the error is not apparent from the log files. +In this case, switch tactics and use a different command; maybe run the +service directly on the command line. For example, if the ``glance-api`` +service refuses to start and stay running, try launching the daemon from +the command line: + +.. code-block:: console + + # sudo -u glance -H glance-api + +This might print the error and cause of the problem. + +.. note:: + + The ``-H`` flag is required when running the daemons with sudo + because some daemons will write files relative to the user's home + directory, and this write may fail if ``-H`` is left off. + +.. Tip:: + + **Example of Complexity** + + One morning, a compute node failed to run any instances. The log files + were a bit vague, claiming that a certain instance was unable to be + started. This ended up being a red herring because the instance was + simply the first instance in alphabetical order, so it was the first + instance that ``nova-compute`` would touch. + + Further troubleshooting showed that libvirt was not running at all. This + made more sense. If libvirt wasn't running, then no instance could be + virtualized through KVM. Upon trying to start libvirt, it would silently + die immediately. The libvirt logs did not explain why. + + Next, the ``libvirtd`` daemon was run on the command line. Finally a + helpful error message: it could not connect to d-bus. As ridiculous as + it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and + somehow d-bus crashed. Simply starting d-bus set the entire chain back + on track, and soon everything was back up and running. diff --git a/doc/ops-guide/source/ops_maintenance_hardware.rst b/doc/ops-guide/source/ops_maintenance_hardware.rst new file mode 100644 index 0000000000..64ead9f6c0 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_hardware.rst @@ -0,0 +1,64 @@ +===================== +Working with Hardware +===================== + +As for your initial deployment, you should ensure that all hardware is +appropriately burned in before adding it to production. Run software +that uses the hardware to its limits—maxing out RAM, CPU, disk, and +network. Many options are available, and normally double as benchmark +software, so you also get a good idea of the performance of your +system. + +Adding a Compute Node +~~~~~~~~~~~~~~~~~~~~~ + +If you find that you have reached or are reaching the capacity limit of +your computing resources, you should plan to add additional compute +nodes. Adding more nodes is quite easy. The process for adding compute +nodes is the same as when the initial compute nodes were deployed to +your cloud: use an automated deployment system to bootstrap the +bare-metal server with the operating system and then have a +configuration-management system install and configure OpenStack Compute. +Once the Compute service has been installed and configured in the same +way as the other compute nodes, it automatically attaches itself to the +cloud. The cloud controller notices the new node(s) and begins +scheduling instances to launch there. + +If your OpenStack Block Storage nodes are separate from your compute +nodes, the same procedure still applies because the same queuing and +polling system is used in both services. + +We recommend that you use the same hardware for new compute and block +storage nodes. At the very least, ensure that the CPUs are similar in +the compute nodes to not break live migration. + +Adding an Object Storage Node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Adding a new object storage node is different from adding compute or +block storage nodes. You still want to initially configure the server by +using your automated deployment and configuration-management systems. +After that is done, you need to add the local disks of the object +storage node into the object storage ring. The exact command to do this +is the same command that was used to add the initial disks to the ring. +Simply rerun this command on the object storage proxy server for all +disks on the new object storage node. Once this has been done, rebalance +the ring and copy the resulting ring files to the other storage nodes. + +.. note:: + + If your new object storage node has a different number of disks than + the original nodes have, the command to add the new node is + different from the original commands. These parameters vary from + environment to environment. + +Replacing Components +~~~~~~~~~~~~~~~~~~~~ + +Failures of hardware are common in large-scale deployments such as an +infrastructure cloud. Consider your processes and balance time saving +against availability. For example, an Object Storage cluster can easily +live with dead disks in it for some period of time if it has sufficient +capacity. Or, if your compute installation is not full, you could +consider live migrating instances off a host with a RAM failure until +you have time to deal with the problem. diff --git a/doc/ops-guide/source/ops_maintenance_hdmwy.rst b/doc/ops-guide/source/ops_maintenance_hdmwy.rst new file mode 100644 index 0000000000..7651aaca93 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_hdmwy.rst @@ -0,0 +1,54 @@ +===== +HDWMY +===== + +Here's a quick list of various to-do items for each hour, day, week, +month, and year. Please note that these tasks are neither required nor +definitive but helpful ideas: + +Hourly +~~~~~~ + +* Check your monitoring system for alerts and act on them. +* Check your ticket queue for new tickets. + +Daily +~~~~~ + +* Check for instances in a failed or weird state and investigate why. +* Check for security patches and apply them as needed. + +Weekly +~~~~~~ + +* Check cloud usage: + + * User quotas + * Disk space + * Image usage + * Large instances + * Network usage (bandwidth and IP usage) + +* Verify your alert mechanisms are still working. + +Monthly +~~~~~~~ + +* Check usage and trends over the past month. +* Check for user accounts that should be removed. +* Check for operator accounts that should be removed. + +Quarterly +~~~~~~~~~ + +* Review usage and trends over the past quarter. +* Prepare any quarterly reports on usage and statistics. +* Review and plan any necessary cloud additions. +* Review and plan any major OpenStack upgrades. + +Semiannually +~~~~~~~~~~~~ + +* Upgrade OpenStack. +* Clean up after an OpenStack upgrade (any unused or new services to be + aware of?). diff --git a/doc/ops-guide/source/ops_maintenance_slow.rst b/doc/ops-guide/source/ops_maintenance_slow.rst new file mode 100644 index 0000000000..41b3a62c41 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_slow.rst @@ -0,0 +1,90 @@ +========================================= +What to do when things are running slowly +========================================= + +When you are getting slow responses from various services, it can be +hard to know where to start looking. The first thing to check is the +extent of the slowness: is it specific to a single service, or varied +among different services? If your problem is isolated to a specific +service, it can temporarily be fixed by restarting the service, but that +is often only a fix for the symptom and not the actual problem. + +This is a collection of ideas from experienced operators on common +things to look at that may be the cause of slowness. It is not, however, +designed to be an exhaustive list. + +OpenStack Identity service +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If OpenStack :term:`Identity service` is responding slowly, it could be due +to the token table getting large. This can be fixed by running the +:command:`keystone-manage token_flush` command. + +Additionally, for Identity-related issues, try the tips +in :ref:`sql_backend`. + +OpenStack Image service +~~~~~~~~~~~~~~~~~~~~~~~ + +OpenStack :term:`Image service` can be slowed down by things related to the +Identity service, but the Image service itself can be slowed down if +connectivity to the back-end storage in use is slow or otherwise +problematic. For example, your back-end NFS server might have gone down. + +OpenStack Block Storage service +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +OpenStack :term:`Block Storage service` is similar to the Image service, so +start by checking Identity-related services, and the back-end storage. +Additionally, both the Block Storage and Image services rely on AMQP and +SQL functionality, so consider these when debugging. + +OpenStack Compute service +~~~~~~~~~~~~~~~~~~~~~~~~~ + +Services related to OpenStack Compute are normally fairly fast and rely +on a couple of backend services: Identity for authentication and +authorization), and AMQP for interoperability. Any slowness related to +services is normally related to one of these. Also, as with all other +services, SQL is used extensively. + +OpenStack Networking service +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Slowness in the OpenStack :term:`Networking service` can be caused by services +that it relies upon, but it can also be related to either physical or +virtual networking. For example: network namespaces that do not exist or +are not tied to interfaces correctly; DHCP daemons that have hung or are +not running; a cable being physically disconnected; a switch not being +configured correctly. When debugging Networking service problems, begin +by verifying all physical networking functionality (switch +configuration, physical cabling, etc.). After the physical networking is +verified, check to be sure all of the Networking services are running +(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL +back ends. + +AMQP broker +~~~~~~~~~~~ + +Regardless of which AMQP broker you use, such as RabbitMQ, there are +common issues which not only slow down operations, but can also cause +real problems. Sometimes messages queued for services stay on the queues +and are not consumed. This can be due to dead or stagnant services and +can be commonly cleared up by either restarting the AMQP-related +services or the OpenStack service in question. + +.. _sql_backend: + +SQL back end +~~~~~~~~~~~~ + +Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability +is essential to a functioning OpenStack environment. A large or +fragmented SQLite file can cause slowness when using files as a back +end. A locked or long-running query can cause delays for most RDBMS +services. In this case, do not kill the query immediately, but look into +it to see if it is a problem with something that is hung, or something +that is just taking a long time to run and needs to finish on its own. +The administration of an RDBMS is outside the scope of this document, +but it should be noted that a properly functioning RDBMS is essential to +most OpenStack services. diff --git a/doc/ops-guide/source/ops_maintenance_storage.rst b/doc/ops-guide/source/ops_maintenance_storage.rst new file mode 100644 index 0000000000..52e5b31b34 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_storage.rst @@ -0,0 +1,91 @@ +===================================== +Storage Node Failures and Maintenance +===================================== + +Because of the high redundancy of Object Storage, dealing with object +storage node issues is a lot easier than dealing with compute node +issues. + +Rebooting a Storage Node +~~~~~~~~~~~~~~~~~~~~~~~~ + +If a storage node requires a reboot, simply reboot it. Requests for data +hosted on that node are redirected to other copies while the server is +rebooting. + +Shutting Down a Storage Node +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If you need to shut down a storage node for an extended period of time +(one or more days), consider removing the node from the storage ring. +For example: + +.. code-block:: console + + # swift-ring-builder account.builder remove + # swift-ring-builder container.builder remove + # swift-ring-builder object.builder remove + # swift-ring-builder account.builder rebalance + # swift-ring-builder container.builder rebalance + # swift-ring-builder object.builder rebalance + +Next, redistribute the ring files to the other nodes: + +.. code-block:: console + + # for i in s01.example.com s02.example.com s03.example.com + > do + > scp *.ring.gz $i:/etc/swift + > done + +These actions effectively take the storage node out of the storage +cluster. + +When the node is able to rejoin the cluster, just add it back to the +ring. The exact syntax you use to add a node to your swift cluster with +``swift-ring-builder`` heavily depends on the original options used when +you originally created your cluster. Please refer back to those +commands. + +Replacing a Swift Disk +~~~~~~~~~~~~~~~~~~~~~~ + +If a hard drive fails in an Object Storage node, replacing it is +relatively easy. This assumes that your Object Storage environment is +configured correctly, where the data that is stored on the failed drive +is also replicated to other drives in the Object Storage environment. + +This example assumes that ``/dev/sdb`` has failed. + +First, unmount the disk: + +.. code-block:: console + + # umount /dev/sdb + +Next, physically remove the disk from the server and replace it with a +working disk. + +Ensure that the operating system has recognized the new disk: + +.. code-block:: console + + # dmesg | tail + +You should see a message about ``/dev/sdb``. + +Because it is recommended to not use partitions on a swift disk, simply +format the disk as a whole: + +.. code-block:: console + + # mkfs.xfs /dev/sdb + +Finally, mount the disk: + +.. code-block:: console + + # mount -a + +Swift should notice the new disk and that no data exists. It then begins +replicating the data to the disk from the other existing replicas. diff --git a/doc/ops-guide/source/ops_uninstall.rst b/doc/ops-guide/source/ops_uninstall.rst new file mode 100644 index 0000000000..792023cf6d --- /dev/null +++ b/doc/ops-guide/source/ops_uninstall.rst @@ -0,0 +1,18 @@ +============ +Uninstalling +============ + +While we'd always recommend using your automated deployment system to +reinstall systems from scratch, sometimes you do need to remove +OpenStack from a system the hard way. Here's how: + +* Remove all packages. +* Remove remaining files. +* Remove databases. + +These steps depend on your underlying distribution, but in general you +should be looking for :command:`purge` commands in your package manager, like +:command:`aptitude purge ~c $package`. Following this, you can look for +orphaned files in the directories referenced throughout this guide. To +uninstall the database properly, refer to the manual appropriate for the +product in use.