openstack-manuals/doc/common/section_support-object-storage.xml
Diane Fleming 64b6c9261e Folder rename, file rename, flattening of directories
Current folder name	New folder name	        Book title
----------------------------------------------------------
basic-install 	        DELETE
cli-guide	        DELETE
common	                common
NEW	                admin-guide-cloud	Cloud Administrators Guide
docbkx-example	        DELETE
openstack-block-storage-admin 	DELETE
openstack-compute-admin 	DELETE
openstack-config 	config-reference	OpenStack Configuration Reference
openstack-ha 	        high-availability-guide	OpenStack High Availabilty Guide
openstack-image	        image-guide	OpenStack Virtual Machine Image Guide
openstack-install 	install-guide	OpenStack Installation Guide
openstack-network-connectivity-admin 	admin-guide-network 	OpenStack Networking Administration Guide
openstack-object-storage-admin 	DELETE
openstack-security 	security-guide	OpenStack Security Guide
openstack-training 	training-guide	OpenStack Training Guide
openstack-user 	        user-guide	OpenStack End User Guide
openstack-user-admin 	user-guide-admin	OpenStack Admin User Guide
glossary	        NEW        	OpenStack Glossary

bug: #1220407

Change-Id: Id5ffc774b966ba7b9a591743a877aa10ab3094c7
author: diane fleming
2013-09-08 15:15:50 -07:00

88 lines
6.4 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="UTF-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="troubleshooting-openstack-object-storage">
<title>Troubleshooting OpenStack Object Storage</title>
<para>For OpenStack Object Storage, everything is logged in /var/log/syslog (or messages on some distros). Several settings enable further customization of logging, such as log_name, log_facility, and log_level, within the object server configuration files.</para>
<section xml:id="handling-drive-failure">
<title>Handling Drive Failure</title>
<para>In the event that a drive has failed, the first step is to make sure the drive is unmounted. This will make it easier for OpenStack Object Storage to work around the failure until it has been resolved. If the drive is going to be replaced immediately, then it is just best to replace the drive, format it, remount it, and let replication fill it up.</para>
<para>If the drive cant be replaced immediately, then it is best to leave it unmounted, and remove the drive from the ring. This will allow all the replicas that were on that drive to be replicated elsewhere until the drive is replaced. Once the drive is replaced, it can be re-added to the ring.</para>
<para>Rackspace has seen hints at drive failures by looking at error messages in /var/log/kern.log -
do consider checking this in your monitoring</para>
</section>
<section xml:id="handling-server-failure">
<title>Handling Server Failure</title>
<para>If a server is having hardware issues, it is a good idea to make sure the OpenStack Object Storage services are not running. This will allow OpenStack Object Storage to work around the failure while you troubleshoot.</para>
<para>If the server just needs a reboot, or a small amount of work that should only last a couple of hours, then it is probably best to let OpenStack Object Storage work around the failure and get the machine fixed and back online. When the machine comes back online, replication will make sure that anything that is missing during the downtime will get updated.</para>
<para>If the server has more serious issues, then it is probably best to remove all of the servers devices from the ring. Once the server has been repaired and is back online, the servers devices can be added back into the ring. It is important that the devices are reformatted before putting them back into the ring as it is likely to be responsible for a different set of partitions than before.</para>
</section>
<section xml:id="detecting-failed-drives">
<title>Detecting Failed Drives</title>
<para>It has been our experience that when a drive is about to fail, error messages will spew into /var/log/kern.log. There is a script called swift-drive-audit that can be run via cron to watch for bad drives. If errors are detected, it will unmount the bad drive, so that OpenStack Object Storage can work around it. The script takes a configuration file with the following settings:
</para>
<xi:include href="tables/swift-drive-audit-drive-audit.xml"/>
<para>This script has only been tested on Ubuntu 10.04, so if you are using a different distro or OS, some care should be taken before using in production.
</para></section>
<section xml:id="recover-ring-builder-file">
<title>Emergency Recovery of Ring Builder Files</title>
<para>You should always keep a backup of Swift ring builder files.
However, if an emergency occurs, this procedure may assist in returning
your cluster to an operational state.</para>
<para>Using existing Swift tools, there is no way to recover a builder
file from a ring.gz file. However, if you have a knowledge of Python,
it is possible to construct a builder file that is pretty close to
the one you have lost. The following is what you will need to do.</para>
<warning><title>Warning</title>
<para>This procedure is a last-resort for emergency circumstances - it
requires knowledge of the swift python code and may not succeed.</para></warning>
<para>First, load the ring and a new ringbuilder object in a Python REPL:</para>
<programlisting language="python">>>> from swift.common.ring import RingData, RingBuilder
>>> ring = RingData.load('/path/to/account.ring.gz')</programlisting>
<para>Now, start copying the data we have in the ring into the builder.</para>
<programlisting language="python">>>> import math
>>> partitions = len(ring._replica2part2dev_id[0])
>>> replicas = len(ring._replica2part2dev_id)
>>> builder = RingBuilder(int(Math.log(partitions, 2)), replicas, 1)
>>> builder.devs = ring.devs
>>> builder._replica2part2dev = ring.replica2part2dev_id
>>> builder._last_part_moves_epoch = 0
>>> builder._last_part_moves = array('B', (0 for _ in xrange(self.parts)))
>>> builder._set_parts_wanted()
>>> for d in builder._iter_devs():
d['parts'] = 0
>>> for p2d in builder._replica2part2dev:
for dev_id in p2d:
builder.devs[dev_id]['parts'] += 1</programlisting>
<para>This is the extent of the recoverable fields. For
<literal>min_part_hours</literal> you'll either have to remember
what the value you used was, or just make up a new one.</para>
<programlisting language="python">>>> builder.change_min_part_hours(24) # or whatever you want it to be</programlisting>
<para>Try some validation: if this doesn't raise an exception, you may
feel some hope. Not too much, though.</para>
<programlisting language="python">>>> builder.validate()</programlisting>
<para>Save the builder.</para>
<programlisting language="python">>>> import pickle
>>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2)</programlisting>
<para>You should now have a file called 'account.builder' in the current
working directory.
Next, run <literal>swift-ring-builder account.builder write_ring</literal>
and compare the new account.ring.gz to the account.ring.gz that you started
from. They probably won't be byte-for-byte identical, but if you load them
up in a REPL and their <literal>_replica2part2dev_id</literal> and
<literal>devs</literal> attributes are the same (or nearly so), then you're
in good shape.</para>
<para>Next, repeat the procedure for <literal>container.ring.gz</literal>
and <literal>object.ring.gz</literal>, and you might get usable builder
files.</para>
</section>
</chapter>