Merge "[ops-guide] Cleanup maintenance chapter"
This commit is contained in:
commit
f366908017
File diff suppressed because it is too large
Load Diff
50
doc/ops-guide/source/ops_maintenance_complete.rst
Normal file
50
doc/ops-guide/source/ops_maintenance_complete.rst
Normal file
@ -0,0 +1,50 @@
|
|||||||
|
===========================
|
||||||
|
Handling a Complete Failure
|
||||||
|
===========================
|
||||||
|
|
||||||
|
A common way of dealing with the recovery from a full system failure,
|
||||||
|
such as a power outage of a data center, is to assign each service a
|
||||||
|
priority, and restore in order.
|
||||||
|
:ref:`table_example_priority` shows an example.
|
||||||
|
|
||||||
|
.. _table_example_priority:
|
||||||
|
|
||||||
|
.. list-table:: Table. Example service restoration priority list
|
||||||
|
:header-rows: 1
|
||||||
|
|
||||||
|
* - Priority
|
||||||
|
- Services
|
||||||
|
* - 1
|
||||||
|
- Internal network connectivity
|
||||||
|
* - 2
|
||||||
|
- Backing storage services
|
||||||
|
* - 3
|
||||||
|
- Public network connectivity for user virtual machines
|
||||||
|
* - 4
|
||||||
|
- ``nova-compute``, ``nova-network``, cinder hosts
|
||||||
|
* - 5
|
||||||
|
- User virtual machines
|
||||||
|
* - 10
|
||||||
|
- Message queue and database services
|
||||||
|
* - 15
|
||||||
|
- Keystone services
|
||||||
|
* - 20
|
||||||
|
- ``cinder-scheduler``
|
||||||
|
* - 21
|
||||||
|
- Image Catalog and Delivery services
|
||||||
|
* - 22
|
||||||
|
- ``nova-scheduler`` services
|
||||||
|
* - 98
|
||||||
|
- ``cinder-api``
|
||||||
|
* - 99
|
||||||
|
- ``nova-api`` services
|
||||||
|
* - 100
|
||||||
|
- Dashboard node
|
||||||
|
|
||||||
|
Use this example priority list to ensure that user-affected services are
|
||||||
|
restored as soon as possible, but not before a stable environment is in
|
||||||
|
place. Of course, despite being listed as a single-line item, each step
|
||||||
|
requires significant work. For example, just after starting the
|
||||||
|
database, you should check its integrity, or, after starting the nova
|
||||||
|
services, you should verify that the hypervisor matches the database and
|
||||||
|
fix any mismatches.
|
401
doc/ops-guide/source/ops_maintenance_compute.rst
Normal file
401
doc/ops-guide/source/ops_maintenance_compute.rst
Normal file
@ -0,0 +1,401 @@
|
|||||||
|
=====================================
|
||||||
|
Compute Node Failures and Maintenance
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
Sometimes a compute node either crashes unexpectedly or requires a
|
||||||
|
reboot for maintenance reasons.
|
||||||
|
|
||||||
|
Planned Maintenance
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you need to reboot a compute node due to planned maintenance (such as
|
||||||
|
a software or hardware upgrade), first ensure that all hosted instances
|
||||||
|
have been moved off the node. If your cloud is utilizing shared storage,
|
||||||
|
use the :command:`nova live-migration` command. First, get a list of instances
|
||||||
|
that need to be moved:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova list --host c01.example.com --all-tenants
|
||||||
|
|
||||||
|
Next, migrate them one by one:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova live-migration <uuid> c02.example.com
|
||||||
|
|
||||||
|
If you are not using shared storage, you can use the
|
||||||
|
:option:`--block-migrate` option:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova live-migration --block-migrate <uuid> c02.example.com
|
||||||
|
|
||||||
|
After you have migrated all instances, ensure that the ``nova-compute``
|
||||||
|
service has stopped:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# stop nova-compute
|
||||||
|
|
||||||
|
If you use a configuration-management system, such as Puppet, that
|
||||||
|
ensures the ``nova-compute`` service is always running, you can
|
||||||
|
temporarily move the ``init`` files:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# mkdir /root/tmp
|
||||||
|
# mv /etc/init/nova-compute.conf /root/tmp
|
||||||
|
# mv /etc/init.d/nova-compute /root/tmp
|
||||||
|
|
||||||
|
Next, shut down your compute node, perform your maintenance, and turn
|
||||||
|
the node back on. You can reenable the ``nova-compute`` service by
|
||||||
|
undoing the previous commands:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# mv /root/tmp/nova-compute.conf /etc/init
|
||||||
|
# mv /root/tmp/nova-compute /etc/init.d/
|
||||||
|
|
||||||
|
Then start the ``nova-compute`` service:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# start nova-compute
|
||||||
|
|
||||||
|
You can now optionally migrate the instances back to their original
|
||||||
|
compute node.
|
||||||
|
|
||||||
|
After a Compute Node Reboots
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When you reboot a compute node, first verify that it booted
|
||||||
|
successfully. This includes ensuring that the ``nova-compute`` service
|
||||||
|
is running:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# ps aux | grep nova-compute
|
||||||
|
# status nova-compute
|
||||||
|
|
||||||
|
Also ensure that it has successfully connected to the AMQP server:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# grep AMQP /var/log/nova/nova-compute
|
||||||
|
2013-02-26 09:51:31 12427 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 199.116.232.36:5672
|
||||||
|
|
||||||
|
After the compute node is successfully running, you must deal with the
|
||||||
|
instances that are hosted on that compute node because none of them are
|
||||||
|
running. Depending on your SLA with your users or customers, you might
|
||||||
|
have to start each instance and ensure that they start correctly.
|
||||||
|
|
||||||
|
Instances
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
You can create a list of instances that are hosted on the compute node
|
||||||
|
by performing the following command:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova list --host c01.example.com --all-tenants
|
||||||
|
|
||||||
|
After you have the list, you can use the :command:`nova` command to start each
|
||||||
|
instance:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova reboot <uuid>
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Any time an instance shuts down unexpectedly, it might have problems
|
||||||
|
on boot. For example, the instance might require an ``fsck`` on the
|
||||||
|
root partition. If this happens, the user can use the dashboard VNC
|
||||||
|
console to fix this.
|
||||||
|
|
||||||
|
If an instance does not boot, meaning ``virsh list`` never shows the
|
||||||
|
instance as even attempting to boot, do the following on the compute
|
||||||
|
node:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# tail -f /var/log/nova/nova-compute.log
|
||||||
|
|
||||||
|
Try executing the :command:`nova reboot` command again. You should see an
|
||||||
|
error message about why the instance was not able to boot
|
||||||
|
|
||||||
|
In most cases, the error is the result of something in libvirt's XML
|
||||||
|
file (``/etc/libvirt/qemu/instance-xxxxxxxx.xml``) that no longer
|
||||||
|
exists. You can enforce re-creation of the XML file as well as rebooting
|
||||||
|
the instance by running the following command:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova reboot --hard <uuid>
|
||||||
|
|
||||||
|
Inspecting and Recovering Data from Failed Instances
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
In some scenarios, instances are running but are inaccessible through
|
||||||
|
SSH and do not respond to any command. The VNC console could be
|
||||||
|
displaying a boot failure or kernel panic error messages. This could be
|
||||||
|
an indication of file system corruption on the VM itself. If you need to
|
||||||
|
recover files or inspect the content of the instance, qemu-nbd can be
|
||||||
|
used to mount the disk.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
If you access or view the user's content and data, get approval first!
|
||||||
|
|
||||||
|
To access the instance's disk
|
||||||
|
(``/var/lib/nova/instances/instance-xxxxxx/disk``), use the following
|
||||||
|
steps:
|
||||||
|
|
||||||
|
#. Suspend the instance using the ``virsh`` command.
|
||||||
|
|
||||||
|
#. Connect the qemu-nbd device to the disk.
|
||||||
|
|
||||||
|
#. Mount the qemu-nbd device.
|
||||||
|
|
||||||
|
#. Unmount the device after inspecting.
|
||||||
|
|
||||||
|
#. Disconnect the qemu-nbd device.
|
||||||
|
|
||||||
|
#. Resume the instance.
|
||||||
|
|
||||||
|
If you do not follow last three steps, OpenStack Compute cannot manage
|
||||||
|
the instance any longer. It fails to respond to any command issued by
|
||||||
|
OpenStack Compute, and it is marked as shut down.
|
||||||
|
|
||||||
|
Once you mount the disk file, you should be able to access it and treat
|
||||||
|
it as a collection of normal directories with files and a directory
|
||||||
|
structure. However, we do not recommend that you edit or touch any files
|
||||||
|
because this could change the
|
||||||
|
:term:`access control lists (ACLs) <access control list>` that are used
|
||||||
|
to determine which accounts can perform what operations on files and
|
||||||
|
directories. Changing ACLs can make the instance unbootable if it is not
|
||||||
|
already.
|
||||||
|
|
||||||
|
#. Suspend the instance using the :command:`virsh` command, taking note of the
|
||||||
|
internal ID:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# virsh list
|
||||||
|
Id Name State
|
||||||
|
----------------------------------
|
||||||
|
1 instance-00000981 running
|
||||||
|
2 instance-000009f5 running
|
||||||
|
30 instance-0000274a running
|
||||||
|
|
||||||
|
# virsh suspend 30
|
||||||
|
Domain 30 suspended
|
||||||
|
|
||||||
|
#. Connect the qemu-nbd device to the disk:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# cd /var/lib/nova/instances/instance-0000274a
|
||||||
|
# ls -lh
|
||||||
|
total 33M
|
||||||
|
-rw-rw---- 1 libvirt-qemu kvm 6.3K Oct 15 11:31 console.log
|
||||||
|
-rw-r--r-- 1 libvirt-qemu kvm 33M Oct 15 22:06 disk
|
||||||
|
-rw-r--r-- 1 libvirt-qemu kvm 384K Oct 15 22:06 disk.local
|
||||||
|
-rw-rw-r-- 1 nova nova 1.7K Oct 15 11:30 libvirt.xml
|
||||||
|
# qemu-nbd -c /dev/nbd0 `pwd`/disk
|
||||||
|
|
||||||
|
#. Mount the qemu-nbd device.
|
||||||
|
|
||||||
|
The qemu-nbd device tries to export the instance disk's different
|
||||||
|
partitions as separate devices. For example, if vda is the disk and
|
||||||
|
vda1 is the root partition, qemu-nbd exports the device as
|
||||||
|
``/dev/nbd0`` and ``/dev/nbd0p1``, respectively:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# mount /dev/nbd0p1 /mnt/
|
||||||
|
|
||||||
|
You can now access the contents of ``/mnt``, which correspond to the
|
||||||
|
first partition of the instance's disk.
|
||||||
|
|
||||||
|
To examine the secondary or ephemeral disk, use an alternate mount
|
||||||
|
point if you want both primary and secondary drives mounted at the
|
||||||
|
same time:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# umount /mnt
|
||||||
|
# qemu-nbd -c /dev/nbd1 `pwd`/disk.local
|
||||||
|
# mount /dev/nbd1 /mnt/
|
||||||
|
# ls -lh /mnt/
|
||||||
|
total 76K
|
||||||
|
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 bin -> usr/bin
|
||||||
|
dr-xr-xr-x. 4 root root 4.0K Oct 15 01:07 boot
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 dev
|
||||||
|
drwxr-xr-x. 70 root root 4.0K Oct 15 11:31 etc
|
||||||
|
drwxr-xr-x. 3 root root 4.0K Oct 15 01:07 home
|
||||||
|
lrwxrwxrwx. 1 root root 7 Oct 15 00:44 lib -> usr/lib
|
||||||
|
lrwxrwxrwx. 1 root root 9 Oct 15 00:44 lib64 -> usr/lib64
|
||||||
|
drwx------. 2 root root 16K Oct 15 00:42 lost+found
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 media
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 mnt
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 opt
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 proc
|
||||||
|
dr-xr-x---. 3 root root 4.0K Oct 15 21:56 root
|
||||||
|
drwxr-xr-x. 14 root root 4.0K Oct 15 01:07 run
|
||||||
|
lrwxrwxrwx. 1 root root 8 Oct 15 00:44 sbin -> usr/sbin
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Feb 3 2012 srv
|
||||||
|
drwxr-xr-x. 2 root root 4.0K Oct 15 00:42 sys
|
||||||
|
drwxrwxrwt. 9 root root 4.0K Oct 15 16:29 tmp
|
||||||
|
drwxr-xr-x. 13 root root 4.0K Oct 15 00:44 usr
|
||||||
|
drwxr-xr-x. 17 root root 4.0K Oct 15 00:44 var
|
||||||
|
|
||||||
|
#. Once you have completed the inspection, unmount the mount point and
|
||||||
|
release the qemu-nbd device:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# umount /mnt
|
||||||
|
# qemu-nbd -d /dev/nbd0
|
||||||
|
/dev/nbd0 disconnected
|
||||||
|
|
||||||
|
#. Resume the instance using :command:`virsh`:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# virsh list
|
||||||
|
Id Name State
|
||||||
|
----------------------------------
|
||||||
|
1 instance-00000981 running
|
||||||
|
2 instance-000009f5 running
|
||||||
|
30 instance-0000274a paused
|
||||||
|
|
||||||
|
# virsh resume 30
|
||||||
|
Domain 30 resumed
|
||||||
|
|
||||||
|
.. _volumes:
|
||||||
|
|
||||||
|
Volumes
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
If the affected instances also had attached volumes, first generate a
|
||||||
|
list of instance and volume UUIDs:
|
||||||
|
|
||||||
|
.. code-block:: mysql
|
||||||
|
|
||||||
|
mysql> select nova.instances.uuid as instance_uuid,
|
||||||
|
cinder.volumes.id as volume_uuid, cinder.volumes.status,
|
||||||
|
cinder.volumes.attach_status, cinder.volumes.mountpoint,
|
||||||
|
cinder.volumes.display_name from cinder.volumes
|
||||||
|
inner join nova.instances on cinder.volumes.instance_uuid=nova.instances.uuid
|
||||||
|
where nova.instances.host = 'c01.example.com';
|
||||||
|
|
||||||
|
You should see a result similar to the following:
|
||||||
|
|
||||||
|
.. code-block:: mysql
|
||||||
|
|
||||||
|
+--------------+------------+-------+--------------+-----------+--------------+
|
||||||
|
|instance_uuid |volume_uuid |status |attach_status |mountpoint | display_name |
|
||||||
|
+--------------+------------+-------+--------------+-----------+--------------+
|
||||||
|
|9b969a05 |1f0fbf36 |in-use |attached |/dev/vdc | test |
|
||||||
|
+--------------+------------+-------+--------------+-----------+--------------+
|
||||||
|
1 row in set (0.00 sec)
|
||||||
|
|
||||||
|
Next, manually detach and reattach the volumes, where X is the proper
|
||||||
|
mount point:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova volume-detach <instance_uuid> <volume_uuid>
|
||||||
|
# nova volume-attach <instance_uuid> <volume_uuid> /dev/vdX
|
||||||
|
|
||||||
|
Be sure that the instance has successfully booted and is at a login
|
||||||
|
screen before doing the above.
|
||||||
|
|
||||||
|
Total Compute Node Failure
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Compute nodes can fail the same way a cloud controller can fail. A
|
||||||
|
motherboard failure or some other type of hardware failure can cause an
|
||||||
|
entire compute node to go offline. When this happens, all instances
|
||||||
|
running on that compute node will not be available. Just like with a
|
||||||
|
cloud controller failure, if your infrastructure monitoring does not
|
||||||
|
detect a failed compute node, your users will notify you because of
|
||||||
|
their lost instances.
|
||||||
|
|
||||||
|
If a compute node fails and won't be fixed for a few hours (or at all),
|
||||||
|
you can relaunch all instances that are hosted on the failed node if you
|
||||||
|
use shared storage for ``/var/lib/nova/instances``.
|
||||||
|
|
||||||
|
To do this, generate a list of instance UUIDs that are hosted on the
|
||||||
|
failed node by running the following query on the nova database:
|
||||||
|
|
||||||
|
.. code-block:: mysql
|
||||||
|
|
||||||
|
mysql> select uuid from instances
|
||||||
|
where host = 'c01.example.com' and deleted = 0;
|
||||||
|
|
||||||
|
Next, update the nova database to indicate that all instances that used
|
||||||
|
to be hosted on c01.example.com are now hosted on c02.example.com:
|
||||||
|
|
||||||
|
.. code-block:: mysql
|
||||||
|
|
||||||
|
mysql> update instances set host = 'c02.example.com'
|
||||||
|
where host = 'c01.example.com' and deleted = 0;
|
||||||
|
|
||||||
|
If you're using the Networking service ML2 plug-in, update the
|
||||||
|
Networking service database to indicate that all ports that used to be
|
||||||
|
hosted on c01.example.com are now hosted on c02.example.com:
|
||||||
|
|
||||||
|
.. code-block:: mysql
|
||||||
|
|
||||||
|
mysql> update ml2_port_bindings set host = 'c02.example.com'
|
||||||
|
where host = 'c01.example.com';
|
||||||
|
mysql> update ml2_port_binding_levels set host = 'c02.example.com'
|
||||||
|
where host = 'c01.example.com';
|
||||||
|
|
||||||
|
After that, use the :command:`nova` command to reboot all instances that were
|
||||||
|
on c01.example.com while regenerating their XML files at the same time:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova reboot --hard <uuid>
|
||||||
|
|
||||||
|
Finally, reattach volumes using the same method described in the section
|
||||||
|
:ref:`volumes`.
|
||||||
|
|
||||||
|
/var/lib/nova/instances
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
It's worth mentioning this directory in the context of failed compute
|
||||||
|
nodes. This directory contains the libvirt KVM file-based disk images
|
||||||
|
for the instances that are hosted on that compute node. If you are not
|
||||||
|
running your cloud in a shared storage environment, this directory is
|
||||||
|
unique across all compute nodes.
|
||||||
|
|
||||||
|
``/var/lib/nova/instances`` contains two types of directories.
|
||||||
|
|
||||||
|
The first is the ``_base`` directory. This contains all the cached base
|
||||||
|
images from glance for each unique image that has been launched on that
|
||||||
|
compute node. Files ending in ``_20`` (or a different number) are the
|
||||||
|
ephemeral base images.
|
||||||
|
|
||||||
|
The other directories are titled ``instance-xxxxxxxx``. These
|
||||||
|
directories correspond to instances running on that compute node. The
|
||||||
|
files inside are related to one of the files in the ``_base`` directory.
|
||||||
|
They're essentially differential-based files containing only the changes
|
||||||
|
made from the original ``_base`` directory.
|
||||||
|
|
||||||
|
All files and directories in ``/var/lib/nova/instances`` are uniquely
|
||||||
|
named. The files in \_base are uniquely titled for the glance image that
|
||||||
|
they are based on, and the directory names ``instance-xxxxxxxx`` are
|
||||||
|
uniquely titled for that particular instance. For example, if you copy
|
||||||
|
all data from ``/var/lib/nova/instances`` on one compute node to
|
||||||
|
another, you do not overwrite any files or cause any damage to images
|
||||||
|
that have the same unique name, because they are essentially the same
|
||||||
|
file.
|
||||||
|
|
||||||
|
Although this method is not documented or supported, you can use it when
|
||||||
|
your compute node is permanently offline but you have instances locally
|
||||||
|
stored on it.
|
27
doc/ops-guide/source/ops_maintenance_configuration.rst
Normal file
27
doc/ops-guide/source/ops_maintenance_configuration.rst
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
========================
|
||||||
|
Configuration Management
|
||||||
|
========================
|
||||||
|
|
||||||
|
Maintaining an OpenStack cloud requires that you manage multiple
|
||||||
|
physical servers, and this number might grow over time. Because managing
|
||||||
|
nodes manually is error prone, we strongly recommend that you use a
|
||||||
|
configuration-management tool. These tools automate the process of
|
||||||
|
ensuring that all your nodes are configured properly and encourage you
|
||||||
|
to maintain your configuration information (such as packages and
|
||||||
|
configuration options) in a version-controlled repository.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Several configuration-management tools are available, and this guide
|
||||||
|
does not recommend a specific one. The two most popular ones in the
|
||||||
|
OpenStack community are `Puppet <https://puppetlabs.com/>`_, with
|
||||||
|
available `OpenStack Puppet
|
||||||
|
modules <https://github.com/puppetlabs/puppetlabs-openstack>`_; and
|
||||||
|
`Chef <http://www.getchef.com/chef/>`_, with available `OpenStack
|
||||||
|
Chef recipes <https://github.com/opscode/openstack-chef-repo>`_.
|
||||||
|
Other newer configuration tools include
|
||||||
|
`Juju <https://juju.ubuntu.com/>`_,
|
||||||
|
`Ansible <https://www.ansible.com/>`_, and
|
||||||
|
`Salt <http://www.saltstack.com/>`_; and more mature configuration
|
||||||
|
management tools include `CFEngine <http://cfengine.com/>`_ and
|
||||||
|
`Bcfg2 <http://bcfg2.org/>`_.
|
96
doc/ops-guide/source/ops_maintenance_controller.rst
Normal file
96
doc/ops-guide/source/ops_maintenance_controller.rst
Normal file
@ -0,0 +1,96 @@
|
|||||||
|
===========================================================
|
||||||
|
Cloud Controller and Storage Proxy Failures and Maintenance
|
||||||
|
===========================================================
|
||||||
|
|
||||||
|
The cloud controller and storage proxy are very similar to each other
|
||||||
|
when it comes to expected and unexpected downtime. One of each server
|
||||||
|
type typically runs in the cloud, which makes them very noticeable when
|
||||||
|
they are not running.
|
||||||
|
|
||||||
|
For the cloud controller, the good news is if your cloud is using the
|
||||||
|
FlatDHCP multi-host HA network mode, existing instances and volumes
|
||||||
|
continue to operate while the cloud controller is offline. For the
|
||||||
|
storage proxy, however, no storage traffic is possible until it is back
|
||||||
|
up and running.
|
||||||
|
|
||||||
|
Planned Maintenance
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
One way to plan for cloud controller or storage proxy maintenance is to
|
||||||
|
simply do it off-hours, such as at 1 a.m. or 2 a.m. This strategy
|
||||||
|
affects fewer users. If your cloud controller or storage proxy is too
|
||||||
|
important to have unavailable at any point in time, you must look into
|
||||||
|
high-availability options.
|
||||||
|
|
||||||
|
Rebooting a Cloud Controller or Storage Proxy
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
All in all, just issue the :command:`reboot` command. The operating system
|
||||||
|
cleanly shuts down services and then automatically reboots. If you want
|
||||||
|
to be very thorough, run your backup jobs just before you
|
||||||
|
reboot.
|
||||||
|
|
||||||
|
After a cloud controller reboots, ensure that all required services were
|
||||||
|
successfully started. The following commands use :command:`ps` and
|
||||||
|
:command:`grep` to determine if nova, glance, and keystone are currently
|
||||||
|
running:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# ps aux | grep nova-
|
||||||
|
# ps aux | grep glance-
|
||||||
|
# ps aux | grep keystone
|
||||||
|
# ps aux | grep cinder
|
||||||
|
|
||||||
|
Also check that all services are functioning. The following set of
|
||||||
|
commands sources the ``openrc`` file, then runs some basic glance, nova,
|
||||||
|
and openstack commands. If the commands work as expected, you can be
|
||||||
|
confident that those services are in working condition:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# source openrc
|
||||||
|
# glance index
|
||||||
|
# nova list
|
||||||
|
# openstack project list
|
||||||
|
|
||||||
|
For the storage proxy, ensure that the :term:`Object Storage service` has
|
||||||
|
resumed:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# ps aux | grep swift
|
||||||
|
|
||||||
|
Also check that it is functioning:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# swift stat
|
||||||
|
|
||||||
|
Total Cloud Controller Failure
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The cloud controller could completely fail if, for example, its
|
||||||
|
motherboard goes bad. Users will immediately notice the loss of a cloud
|
||||||
|
controller since it provides core functionality to your cloud
|
||||||
|
environment. If your infrastructure monitoring does not alert you that
|
||||||
|
your cloud controller has failed, your users definitely will.
|
||||||
|
Unfortunately, this is a rough situation. The cloud controller is an
|
||||||
|
integral part of your cloud. If you have only one controller, you will
|
||||||
|
have many missing services if it goes down.
|
||||||
|
|
||||||
|
To avoid this situation, create a highly available cloud controller
|
||||||
|
cluster. This is outside the scope of this document, but you can read
|
||||||
|
more in the `OpenStack High Availability
|
||||||
|
Guide <http://docs.openstack.org/ha-guide/index.html>`_.
|
||||||
|
|
||||||
|
The next best approach is to use a configuration-management tool, such
|
||||||
|
as Puppet, to automatically build a cloud controller. This should not
|
||||||
|
take more than 15 minutes if you have a spare server available. After
|
||||||
|
the controller rebuilds, restore any backups taken
|
||||||
|
(see :doc:`ops_backup_recovery`).
|
||||||
|
|
||||||
|
Also, in practice, the ``nova-compute`` services on the compute nodes do
|
||||||
|
not always reconnect cleanly to rabbitmq hosted on the controller when
|
||||||
|
it comes back up after a long reboot; a restart on the nova services on
|
||||||
|
the compute nodes is required.
|
49
doc/ops-guide/source/ops_maintenance_database.rst
Normal file
49
doc/ops-guide/source/ops_maintenance_database.rst
Normal file
@ -0,0 +1,49 @@
|
|||||||
|
=========
|
||||||
|
Databases
|
||||||
|
=========
|
||||||
|
|
||||||
|
Almost all OpenStack components have an underlying database to store
|
||||||
|
persistent information. Usually this database is MySQL. Normal MySQL
|
||||||
|
administration is applicable to these databases. OpenStack does not
|
||||||
|
configure the databases out of the ordinary. Basic administration
|
||||||
|
includes performance tweaking, high availability, backup, recovery, and
|
||||||
|
repairing. For more information, see a standard MySQL administration guide.
|
||||||
|
|
||||||
|
You can perform a couple of tricks with the database to either more
|
||||||
|
quickly retrieve information or fix a data inconsistency error—for
|
||||||
|
example, an instance was terminated, but the status was not updated in
|
||||||
|
the database. These tricks are discussed throughout this book.
|
||||||
|
|
||||||
|
Database Connectivity
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Review the component's configuration file to see how each OpenStack
|
||||||
|
component accesses its corresponding database. Look for either
|
||||||
|
``sql_connection`` or simply ``connection``. The following command uses
|
||||||
|
``grep`` to display the SQL connection string for nova, glance, cinder,
|
||||||
|
and keystone:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# grep -hE "connection ?=" /etc/nova/nova.conf /etc/glance/glance-*.conf \
|
||||||
|
/etc/cinder/cinder.conf /etc/keystone/keystone.conf
|
||||||
|
sql_connection = mysql+pymysql://nova:nova@cloud.alberta.sandbox.cybera.ca/nova
|
||||||
|
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
||||||
|
sql_connection = mysql+pymysql://glance:password@cloud.example.com/glance
|
||||||
|
sql_connection = mysql+pymysql://cinder:password@cloud.example.com/cinder
|
||||||
|
connection = mysql+pymysql://keystone_admin:password@cloud.example.com/keystone
|
||||||
|
|
||||||
|
The connection strings take this format:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
mysql+pymysql:// <username> : <password> @ <hostname> / <database name>
|
||||||
|
|
||||||
|
Performance and Optimizing
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
As your cloud grows, MySQL is utilized more and more. If you suspect
|
||||||
|
that MySQL might be becoming a bottleneck, you should start researching
|
||||||
|
MySQL optimization. The MySQL manual has an entire section dedicated to
|
||||||
|
this topic: `Optimization Overview
|
||||||
|
<http://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html>`_.
|
92
doc/ops-guide/source/ops_maintenance_determine.rst
Normal file
92
doc/ops-guide/source/ops_maintenance_determine.rst
Normal file
@ -0,0 +1,92 @@
|
|||||||
|
=====================================
|
||||||
|
Determining Which Component Is Broken
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
OpenStack's collection of different components interact with each other
|
||||||
|
strongly. For example, uploading an image requires interaction from
|
||||||
|
``nova-api``, ``glance-api``, ``glance-registry``, keystone, and
|
||||||
|
potentially ``swift-proxy``. As a result, it is sometimes difficult to
|
||||||
|
determine exactly where problems lie. Assisting in this is the purpose
|
||||||
|
of this section.
|
||||||
|
|
||||||
|
Tailing Logs
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The first place to look is the log file related to the command you are
|
||||||
|
trying to run. For example, if ``nova list`` is failing, try tailing a
|
||||||
|
nova log file and running the command again:
|
||||||
|
|
||||||
|
Terminal 1:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# tail -f /var/log/nova/nova-api.log
|
||||||
|
|
||||||
|
Terminal 2:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova list
|
||||||
|
|
||||||
|
Look for any errors or traces in the log file. For more information, see
|
||||||
|
:doc:`ops_logging_monitoring`.
|
||||||
|
|
||||||
|
If the error indicates that the problem is with another component,
|
||||||
|
switch to tailing that component's log file. For example, if nova cannot
|
||||||
|
access glance, look at the ``glance-api`` log:
|
||||||
|
|
||||||
|
Terminal 1:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# tail -f /var/log/glance/api.log
|
||||||
|
|
||||||
|
Terminal 2:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# nova list
|
||||||
|
|
||||||
|
Wash, rinse, and repeat until you find the core cause of the problem.
|
||||||
|
|
||||||
|
Running Daemons on the CLI
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Unfortunately, sometimes the error is not apparent from the log files.
|
||||||
|
In this case, switch tactics and use a different command; maybe run the
|
||||||
|
service directly on the command line. For example, if the ``glance-api``
|
||||||
|
service refuses to start and stay running, try launching the daemon from
|
||||||
|
the command line:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# sudo -u glance -H glance-api
|
||||||
|
|
||||||
|
This might print the error and cause of the problem.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The ``-H`` flag is required when running the daemons with sudo
|
||||||
|
because some daemons will write files relative to the user's home
|
||||||
|
directory, and this write may fail if ``-H`` is left off.
|
||||||
|
|
||||||
|
.. Tip::
|
||||||
|
|
||||||
|
**Example of Complexity**
|
||||||
|
|
||||||
|
One morning, a compute node failed to run any instances. The log files
|
||||||
|
were a bit vague, claiming that a certain instance was unable to be
|
||||||
|
started. This ended up being a red herring because the instance was
|
||||||
|
simply the first instance in alphabetical order, so it was the first
|
||||||
|
instance that ``nova-compute`` would touch.
|
||||||
|
|
||||||
|
Further troubleshooting showed that libvirt was not running at all. This
|
||||||
|
made more sense. If libvirt wasn't running, then no instance could be
|
||||||
|
virtualized through KVM. Upon trying to start libvirt, it would silently
|
||||||
|
die immediately. The libvirt logs did not explain why.
|
||||||
|
|
||||||
|
Next, the ``libvirtd`` daemon was run on the command line. Finally a
|
||||||
|
helpful error message: it could not connect to d-bus. As ridiculous as
|
||||||
|
it sounds, libvirt, and thus ``nova-compute``, relies on d-bus and
|
||||||
|
somehow d-bus crashed. Simply starting d-bus set the entire chain back
|
||||||
|
on track, and soon everything was back up and running.
|
64
doc/ops-guide/source/ops_maintenance_hardware.rst
Normal file
64
doc/ops-guide/source/ops_maintenance_hardware.rst
Normal file
@ -0,0 +1,64 @@
|
|||||||
|
=====================
|
||||||
|
Working with Hardware
|
||||||
|
=====================
|
||||||
|
|
||||||
|
As for your initial deployment, you should ensure that all hardware is
|
||||||
|
appropriately burned in before adding it to production. Run software
|
||||||
|
that uses the hardware to its limits—maxing out RAM, CPU, disk, and
|
||||||
|
network. Many options are available, and normally double as benchmark
|
||||||
|
software, so you also get a good idea of the performance of your
|
||||||
|
system.
|
||||||
|
|
||||||
|
Adding a Compute Node
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you find that you have reached or are reaching the capacity limit of
|
||||||
|
your computing resources, you should plan to add additional compute
|
||||||
|
nodes. Adding more nodes is quite easy. The process for adding compute
|
||||||
|
nodes is the same as when the initial compute nodes were deployed to
|
||||||
|
your cloud: use an automated deployment system to bootstrap the
|
||||||
|
bare-metal server with the operating system and then have a
|
||||||
|
configuration-management system install and configure OpenStack Compute.
|
||||||
|
Once the Compute service has been installed and configured in the same
|
||||||
|
way as the other compute nodes, it automatically attaches itself to the
|
||||||
|
cloud. The cloud controller notices the new node(s) and begins
|
||||||
|
scheduling instances to launch there.
|
||||||
|
|
||||||
|
If your OpenStack Block Storage nodes are separate from your compute
|
||||||
|
nodes, the same procedure still applies because the same queuing and
|
||||||
|
polling system is used in both services.
|
||||||
|
|
||||||
|
We recommend that you use the same hardware for new compute and block
|
||||||
|
storage nodes. At the very least, ensure that the CPUs are similar in
|
||||||
|
the compute nodes to not break live migration.
|
||||||
|
|
||||||
|
Adding an Object Storage Node
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Adding a new object storage node is different from adding compute or
|
||||||
|
block storage nodes. You still want to initially configure the server by
|
||||||
|
using your automated deployment and configuration-management systems.
|
||||||
|
After that is done, you need to add the local disks of the object
|
||||||
|
storage node into the object storage ring. The exact command to do this
|
||||||
|
is the same command that was used to add the initial disks to the ring.
|
||||||
|
Simply rerun this command on the object storage proxy server for all
|
||||||
|
disks on the new object storage node. Once this has been done, rebalance
|
||||||
|
the ring and copy the resulting ring files to the other storage nodes.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If your new object storage node has a different number of disks than
|
||||||
|
the original nodes have, the command to add the new node is
|
||||||
|
different from the original commands. These parameters vary from
|
||||||
|
environment to environment.
|
||||||
|
|
||||||
|
Replacing Components
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Failures of hardware are common in large-scale deployments such as an
|
||||||
|
infrastructure cloud. Consider your processes and balance time saving
|
||||||
|
against availability. For example, an Object Storage cluster can easily
|
||||||
|
live with dead disks in it for some period of time if it has sufficient
|
||||||
|
capacity. Or, if your compute installation is not full, you could
|
||||||
|
consider live migrating instances off a host with a RAM failure until
|
||||||
|
you have time to deal with the problem.
|
54
doc/ops-guide/source/ops_maintenance_hdmwy.rst
Normal file
54
doc/ops-guide/source/ops_maintenance_hdmwy.rst
Normal file
@ -0,0 +1,54 @@
|
|||||||
|
=====
|
||||||
|
HDWMY
|
||||||
|
=====
|
||||||
|
|
||||||
|
Here's a quick list of various to-do items for each hour, day, week,
|
||||||
|
month, and year. Please note that these tasks are neither required nor
|
||||||
|
definitive but helpful ideas:
|
||||||
|
|
||||||
|
Hourly
|
||||||
|
~~~~~~
|
||||||
|
|
||||||
|
* Check your monitoring system for alerts and act on them.
|
||||||
|
* Check your ticket queue for new tickets.
|
||||||
|
|
||||||
|
Daily
|
||||||
|
~~~~~
|
||||||
|
|
||||||
|
* Check for instances in a failed or weird state and investigate why.
|
||||||
|
* Check for security patches and apply them as needed.
|
||||||
|
|
||||||
|
Weekly
|
||||||
|
~~~~~~
|
||||||
|
|
||||||
|
* Check cloud usage:
|
||||||
|
|
||||||
|
* User quotas
|
||||||
|
* Disk space
|
||||||
|
* Image usage
|
||||||
|
* Large instances
|
||||||
|
* Network usage (bandwidth and IP usage)
|
||||||
|
|
||||||
|
* Verify your alert mechanisms are still working.
|
||||||
|
|
||||||
|
Monthly
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
* Check usage and trends over the past month.
|
||||||
|
* Check for user accounts that should be removed.
|
||||||
|
* Check for operator accounts that should be removed.
|
||||||
|
|
||||||
|
Quarterly
|
||||||
|
~~~~~~~~~
|
||||||
|
|
||||||
|
* Review usage and trends over the past quarter.
|
||||||
|
* Prepare any quarterly reports on usage and statistics.
|
||||||
|
* Review and plan any necessary cloud additions.
|
||||||
|
* Review and plan any major OpenStack upgrades.
|
||||||
|
|
||||||
|
Semiannually
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
* Upgrade OpenStack.
|
||||||
|
* Clean up after an OpenStack upgrade (any unused or new services to be
|
||||||
|
aware of?).
|
90
doc/ops-guide/source/ops_maintenance_slow.rst
Normal file
90
doc/ops-guide/source/ops_maintenance_slow.rst
Normal file
@ -0,0 +1,90 @@
|
|||||||
|
=========================================
|
||||||
|
What to do when things are running slowly
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
When you are getting slow responses from various services, it can be
|
||||||
|
hard to know where to start looking. The first thing to check is the
|
||||||
|
extent of the slowness: is it specific to a single service, or varied
|
||||||
|
among different services? If your problem is isolated to a specific
|
||||||
|
service, it can temporarily be fixed by restarting the service, but that
|
||||||
|
is often only a fix for the symptom and not the actual problem.
|
||||||
|
|
||||||
|
This is a collection of ideas from experienced operators on common
|
||||||
|
things to look at that may be the cause of slowness. It is not, however,
|
||||||
|
designed to be an exhaustive list.
|
||||||
|
|
||||||
|
OpenStack Identity service
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If OpenStack :term:`Identity service` is responding slowly, it could be due
|
||||||
|
to the token table getting large. This can be fixed by running the
|
||||||
|
:command:`keystone-manage token_flush` command.
|
||||||
|
|
||||||
|
Additionally, for Identity-related issues, try the tips
|
||||||
|
in :ref:`sql_backend`.
|
||||||
|
|
||||||
|
OpenStack Image service
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
OpenStack :term:`Image service` can be slowed down by things related to the
|
||||||
|
Identity service, but the Image service itself can be slowed down if
|
||||||
|
connectivity to the back-end storage in use is slow or otherwise
|
||||||
|
problematic. For example, your back-end NFS server might have gone down.
|
||||||
|
|
||||||
|
OpenStack Block Storage service
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
OpenStack :term:`Block Storage service` is similar to the Image service, so
|
||||||
|
start by checking Identity-related services, and the back-end storage.
|
||||||
|
Additionally, both the Block Storage and Image services rely on AMQP and
|
||||||
|
SQL functionality, so consider these when debugging.
|
||||||
|
|
||||||
|
OpenStack Compute service
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Services related to OpenStack Compute are normally fairly fast and rely
|
||||||
|
on a couple of backend services: Identity for authentication and
|
||||||
|
authorization), and AMQP for interoperability. Any slowness related to
|
||||||
|
services is normally related to one of these. Also, as with all other
|
||||||
|
services, SQL is used extensively.
|
||||||
|
|
||||||
|
OpenStack Networking service
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Slowness in the OpenStack :term:`Networking service` can be caused by services
|
||||||
|
that it relies upon, but it can also be related to either physical or
|
||||||
|
virtual networking. For example: network namespaces that do not exist or
|
||||||
|
are not tied to interfaces correctly; DHCP daemons that have hung or are
|
||||||
|
not running; a cable being physically disconnected; a switch not being
|
||||||
|
configured correctly. When debugging Networking service problems, begin
|
||||||
|
by verifying all physical networking functionality (switch
|
||||||
|
configuration, physical cabling, etc.). After the physical networking is
|
||||||
|
verified, check to be sure all of the Networking services are running
|
||||||
|
(neutron-server, neutron-dhcp-agent, etc.), then check on AMQP and SQL
|
||||||
|
back ends.
|
||||||
|
|
||||||
|
AMQP broker
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
Regardless of which AMQP broker you use, such as RabbitMQ, there are
|
||||||
|
common issues which not only slow down operations, but can also cause
|
||||||
|
real problems. Sometimes messages queued for services stay on the queues
|
||||||
|
and are not consumed. This can be due to dead or stagnant services and
|
||||||
|
can be commonly cleared up by either restarting the AMQP-related
|
||||||
|
services or the OpenStack service in question.
|
||||||
|
|
||||||
|
.. _sql_backend:
|
||||||
|
|
||||||
|
SQL back end
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Whether you use SQLite or an RDBMS (such as MySQL), SQL interoperability
|
||||||
|
is essential to a functioning OpenStack environment. A large or
|
||||||
|
fragmented SQLite file can cause slowness when using files as a back
|
||||||
|
end. A locked or long-running query can cause delays for most RDBMS
|
||||||
|
services. In this case, do not kill the query immediately, but look into
|
||||||
|
it to see if it is a problem with something that is hung, or something
|
||||||
|
that is just taking a long time to run and needs to finish on its own.
|
||||||
|
The administration of an RDBMS is outside the scope of this document,
|
||||||
|
but it should be noted that a properly functioning RDBMS is essential to
|
||||||
|
most OpenStack services.
|
91
doc/ops-guide/source/ops_maintenance_storage.rst
Normal file
91
doc/ops-guide/source/ops_maintenance_storage.rst
Normal file
@ -0,0 +1,91 @@
|
|||||||
|
=====================================
|
||||||
|
Storage Node Failures and Maintenance
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
Because of the high redundancy of Object Storage, dealing with object
|
||||||
|
storage node issues is a lot easier than dealing with compute node
|
||||||
|
issues.
|
||||||
|
|
||||||
|
Rebooting a Storage Node
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If a storage node requires a reboot, simply reboot it. Requests for data
|
||||||
|
hosted on that node are redirected to other copies while the server is
|
||||||
|
rebooting.
|
||||||
|
|
||||||
|
Shutting Down a Storage Node
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you need to shut down a storage node for an extended period of time
|
||||||
|
(one or more days), consider removing the node from the storage ring.
|
||||||
|
For example:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# swift-ring-builder account.builder remove <ip address of storage node>
|
||||||
|
# swift-ring-builder container.builder remove <ip address of storage node>
|
||||||
|
# swift-ring-builder object.builder remove <ip address of storage node>
|
||||||
|
# swift-ring-builder account.builder rebalance
|
||||||
|
# swift-ring-builder container.builder rebalance
|
||||||
|
# swift-ring-builder object.builder rebalance
|
||||||
|
|
||||||
|
Next, redistribute the ring files to the other nodes:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# for i in s01.example.com s02.example.com s03.example.com
|
||||||
|
> do
|
||||||
|
> scp *.ring.gz $i:/etc/swift
|
||||||
|
> done
|
||||||
|
|
||||||
|
These actions effectively take the storage node out of the storage
|
||||||
|
cluster.
|
||||||
|
|
||||||
|
When the node is able to rejoin the cluster, just add it back to the
|
||||||
|
ring. The exact syntax you use to add a node to your swift cluster with
|
||||||
|
``swift-ring-builder`` heavily depends on the original options used when
|
||||||
|
you originally created your cluster. Please refer back to those
|
||||||
|
commands.
|
||||||
|
|
||||||
|
Replacing a Swift Disk
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If a hard drive fails in an Object Storage node, replacing it is
|
||||||
|
relatively easy. This assumes that your Object Storage environment is
|
||||||
|
configured correctly, where the data that is stored on the failed drive
|
||||||
|
is also replicated to other drives in the Object Storage environment.
|
||||||
|
|
||||||
|
This example assumes that ``/dev/sdb`` has failed.
|
||||||
|
|
||||||
|
First, unmount the disk:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# umount /dev/sdb
|
||||||
|
|
||||||
|
Next, physically remove the disk from the server and replace it with a
|
||||||
|
working disk.
|
||||||
|
|
||||||
|
Ensure that the operating system has recognized the new disk:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# dmesg | tail
|
||||||
|
|
||||||
|
You should see a message about ``/dev/sdb``.
|
||||||
|
|
||||||
|
Because it is recommended to not use partitions on a swift disk, simply
|
||||||
|
format the disk as a whole:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# mkfs.xfs /dev/sdb
|
||||||
|
|
||||||
|
Finally, mount the disk:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
# mount -a
|
||||||
|
|
||||||
|
Swift should notice the new disk and that no data exists. It then begins
|
||||||
|
replicating the data to the disk from the other existing replicas.
|
18
doc/ops-guide/source/ops_uninstall.rst
Normal file
18
doc/ops-guide/source/ops_uninstall.rst
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
============
|
||||||
|
Uninstalling
|
||||||
|
============
|
||||||
|
|
||||||
|
While we'd always recommend using your automated deployment system to
|
||||||
|
reinstall systems from scratch, sometimes you do need to remove
|
||||||
|
OpenStack from a system the hard way. Here's how:
|
||||||
|
|
||||||
|
* Remove all packages.
|
||||||
|
* Remove remaining files.
|
||||||
|
* Remove databases.
|
||||||
|
|
||||||
|
These steps depend on your underlying distribution, but in general you
|
||||||
|
should be looking for :command:`purge` commands in your package manager, like
|
||||||
|
:command:`aptitude purge ~c $package`. Following this, you can look for
|
||||||
|
orphaned files in the directories referenced throughout this guide. To
|
||||||
|
uninstall the database properly, refer to the manual appropriate for the
|
||||||
|
product in use.
|
Loading…
Reference in New Issue
Block a user