openstack-manuals/doc/training-guides/module003-ch005-the-ring.xml
Sean Roberts 7ec78aef2d changes the trunk location for the training guides
was incorrectly placed in trunk/training-guide non-plural, now trunk/training-guides.
also add redirect from trunk/openstack-training and trunk/training-guide to the
new location.

Change-Id: I0648a9604dc6a1d6c7480a90c07871608a8752ca
Closes-Bug: #1255684
2013-11-27 14:41:18 -08:00

146 lines
9.1 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch005-the-ring">
<title>Ring Builder</title>
<para>The rings are built and managed manually by a utility called
the ring-builder. The ring-builder assigns partitions to
devices and writes an optimized Python structure to a gzipped,
serialized file on disk for shipping out to the servers. The
server processes just check the modification time of the file
occasionally and reload their in-memory copies of the ring
structure as needed. Because of how the ring-builder manages
changes to the ring, using a slightly older ring usually just
means one of the three replicas for a subset of the partitions
will be incorrect, which can be easily worked around.</para>
<para>The ring-builder also keeps its own builder file with the
ring information and additional data required to build future
rings. It is very important to keep multiple backup copies of
these builder files. One option is to copy the builder files
out to every server while copying the ring files themselves.
Another is to upload the builder files into the cluster
itself. Complete loss of a builder file will mean creating a
new ring from scratch, nearly all partitions will end up
assigned to different devices, and therefore nearly all data
stored will have to be replicated to new locations. So,
recovery from a builder file loss is possible, but data will
definitely be unreachable for an extended time.</para>
<para><guilabel>Ring Data Structure</guilabel></para>
<para>The ring data structure consists of three top level
fields: a list of devices in the cluster, a list of lists
of device ids indicating partition to device assignments,
and an integer indicating the number of bits to shift an
MD5 hash to calculate the partition for the hash.</para>
<para><guilabel>Partition Assignment
List</guilabel></para>
<para>This is a list of array(H) of devices ids. The
outermost list contains an array(H) for each
replica. Each array(H) has a length equal to the
partition count for the ring. Each integer in the
array(H) is an index into the above list of devices.
The partition list is known internally to the Ring
class as _replica2part2dev_id.</para>
<para>So, to create a list of device dictionaries assigned
to a partition, the Python code would look like:
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]</para>
<para>That code is a little simplistic, as it does not
account for the removal of duplicate devices. If a
ring has more replicas than devices, then a partition
will have more than one replica on one device; thats
simply the pigeonhole principle at work.</para>
<para>array(H) is used for memory conservation as there
may be millions of partitions.</para>
<para><guilabel>Fractional Replicas</guilabel></para>
<para>A ring is not restricted to having an integer number
of replicas. In order to support the gradual changing
of replica counts, the ring is able to have a real
number of replicas.</para>
<para>When the number of replicas is not an integer, then
the last element of _replica2part2dev_id will have a
length that is less than the partition count for the
ring. This means that some partitions will have more
replicas than others. For example, if a ring has 3.25
replicas, then 25% of its partitions will have four
replicas, while the remaining 75% will have just
three.</para>
<para><guilabel>Partition Shift Value</guilabel></para>
<para>The partition shift value is known internally to the
Ring class as _part_shift. This value used to shift an
MD5 hash to calculate the partition on which the data
for that hash should reside. Only the top four bytes
of the hash is used in this process. For example, to
compute the partition for the path
/account/container/object the Python code might look
like: partition = unpack_from('&gt;I',
md5('/account/container/object').digest())[0] &gt;&gt;
self._part_shift</para>
<para>For a ring generated with part_power P, the
partition shift value is 32 - P.</para>
<para><guilabel>Building the Ring</guilabel></para>
<para>The initial building of the ring first calculates the
number of partitions that should ideally be assigned to
each device based the devices weight. For example, given
a partition power of 20, the ring will have 1,048,576
partitions. If there are 1,000 devices of equal weight
they will each desire 1,048.576 partitions. The devices
are then sorted by the number of partitions they desire
and kept in order throughout the initialization
process.</para>
<para>Note: each device is also assigned a random tiebreaker
value that is used when two devices desire the same number
of partitions. This tiebreaker is not stored on disk
anywhere, and so two different rings created with the same
parameters will have different partition assignments. For
repeatable partition assignments, RingBuilder.rebalance()
takes an optional seed value that will be used to seed
Pythons pseudo-random number generator.</para>
<para>Then, the ring builder assigns each replica of each
partition to the device that desires the most partitions
at that point while keeping it as far away as possible
from other replicas. The ring builder prefers to assign a
replica to a device in a regions that has no replicas
already; should there be no such region available, the
ring builder will try to find a device in a different
zone; if not possible, it will look on a different server;
failing that, it will just look for a device that has no
replicas; finally, if all other options are exhausted, the
ring builder will assign the replica to the device that
has the fewest replicas already assigned. Note that
assignment of multiple replicas to one device will only
happen if the ring has fewer devices than it has
replicas.</para>
<para>When building a new ring based on an old ring, the
desired number of partitions each device wants is
recalculated. Next the partitions to be reassigned are
gathered up. Any removed devices have all their assigned
partitions unassigned and added to the gathered list. Any
partition replicas that (due to the addition of new
devices) can be spread out for better durability are
unassigned and added to the gathered list. Any devices
that have more partitions than they now desire have random
partitions unassigned from them and added to the gathered
list. Lastly, the gathered partitions are then reassigned
to devices using a similar method as in the initial
assignment described above.</para>
<para>Whenever a partition has a replica reassigned, the time
of the reassignment is recorded. This is taken into
account when gathering partitions to reassign so that no
partition is moved twice in a configurable amount of time.
This configurable amount of time is known internally to
the RingBuilder class as min_part_hours. This restriction
is ignored for replicas of partitions on devices that have
been removed, as removing a device only happens on device
failure and theres no choice but to make a
reassignment.</para>
<para>The above processes dont always perfectly rebalance a
ring due to the random nature of gathering partitions for
reassignment. To help reach a more balanced ring, the
rebalance process is repeated until near perfect (less 1%
off) or when the balance doesnt improve by at least 1%
(indicating we probably cant get perfect balance due to
wildly imbalanced zones or too many partitions recently
moved).</para>
</chapter>