7ec78aef2d
was incorrectly placed in trunk/training-guide non-plural, now trunk/training-guides. also add redirect from trunk/openstack-training and trunk/training-guide to the new location. Change-Id: I0648a9604dc6a1d6c7480a90c07871608a8752ca Closes-Bug: #1255684
146 lines
9.1 KiB
XML
146 lines
9.1 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="module003-ch005-the-ring">
|
||
<title>Ring Builder</title>
|
||
<para>The rings are built and managed manually by a utility called
|
||
the ring-builder. The ring-builder assigns partitions to
|
||
devices and writes an optimized Python structure to a gzipped,
|
||
serialized file on disk for shipping out to the servers. The
|
||
server processes just check the modification time of the file
|
||
occasionally and reload their in-memory copies of the ring
|
||
structure as needed. Because of how the ring-builder manages
|
||
changes to the ring, using a slightly older ring usually just
|
||
means one of the three replicas for a subset of the partitions
|
||
will be incorrect, which can be easily worked around.</para>
|
||
<para>The ring-builder also keeps its own builder file with the
|
||
ring information and additional data required to build future
|
||
rings. It is very important to keep multiple backup copies of
|
||
these builder files. One option is to copy the builder files
|
||
out to every server while copying the ring files themselves.
|
||
Another is to upload the builder files into the cluster
|
||
itself. Complete loss of a builder file will mean creating a
|
||
new ring from scratch, nearly all partitions will end up
|
||
assigned to different devices, and therefore nearly all data
|
||
stored will have to be replicated to new locations. So,
|
||
recovery from a builder file loss is possible, but data will
|
||
definitely be unreachable for an extended time.</para>
|
||
<para><guilabel>Ring Data Structure</guilabel></para>
|
||
<para>The ring data structure consists of three top level
|
||
fields: a list of devices in the cluster, a list of lists
|
||
of device ids indicating partition to device assignments,
|
||
and an integer indicating the number of bits to shift an
|
||
MD5 hash to calculate the partition for the hash.</para>
|
||
<para><guilabel>Partition Assignment
|
||
List</guilabel></para>
|
||
<para>This is a list of array(‘H’) of devices ids. The
|
||
outermost list contains an array(‘H’) for each
|
||
replica. Each array(‘H’) has a length equal to the
|
||
partition count for the ring. Each integer in the
|
||
array(‘H’) is an index into the above list of devices.
|
||
The partition list is known internally to the Ring
|
||
class as _replica2part2dev_id.</para>
|
||
<para>So, to create a list of device dictionaries assigned
|
||
to a partition, the Python code would look like:
|
||
devices = [self.devs[part2dev_id[partition]] for
|
||
part2dev_id in self._replica2part2dev_id]</para>
|
||
<para>That code is a little simplistic, as it does not
|
||
account for the removal of duplicate devices. If a
|
||
ring has more replicas than devices, then a partition
|
||
will have more than one replica on one device; that’s
|
||
simply the pigeonhole principle at work.</para>
|
||
<para>array(‘H’) is used for memory conservation as there
|
||
may be millions of partitions.</para>
|
||
<para><guilabel>Fractional Replicas</guilabel></para>
|
||
<para>A ring is not restricted to having an integer number
|
||
of replicas. In order to support the gradual changing
|
||
of replica counts, the ring is able to have a real
|
||
number of replicas.</para>
|
||
<para>When the number of replicas is not an integer, then
|
||
the last element of _replica2part2dev_id will have a
|
||
length that is less than the partition count for the
|
||
ring. This means that some partitions will have more
|
||
replicas than others. For example, if a ring has 3.25
|
||
replicas, then 25% of its partitions will have four
|
||
replicas, while the remaining 75% will have just
|
||
three.</para>
|
||
<para><guilabel>Partition Shift Value</guilabel></para>
|
||
<para>The partition shift value is known internally to the
|
||
Ring class as _part_shift. This value used to shift an
|
||
MD5 hash to calculate the partition on which the data
|
||
for that hash should reside. Only the top four bytes
|
||
of the hash is used in this process. For example, to
|
||
compute the partition for the path
|
||
/account/container/object the Python code might look
|
||
like: partition = unpack_from('>I',
|
||
md5('/account/container/object').digest())[0] >>
|
||
self._part_shift</para>
|
||
<para>For a ring generated with part_power P, the
|
||
partition shift value is 32 - P.</para>
|
||
<para><guilabel>Building the Ring</guilabel></para>
|
||
<para>The initial building of the ring first calculates the
|
||
number of partitions that should ideally be assigned to
|
||
each device based the device’s weight. For example, given
|
||
a partition power of 20, the ring will have 1,048,576
|
||
partitions. If there are 1,000 devices of equal weight
|
||
they will each desire 1,048.576 partitions. The devices
|
||
are then sorted by the number of partitions they desire
|
||
and kept in order throughout the initialization
|
||
process.</para>
|
||
<para>Note: each device is also assigned a random tiebreaker
|
||
value that is used when two devices desire the same number
|
||
of partitions. This tiebreaker is not stored on disk
|
||
anywhere, and so two different rings created with the same
|
||
parameters will have different partition assignments. For
|
||
repeatable partition assignments, RingBuilder.rebalance()
|
||
takes an optional seed value that will be used to seed
|
||
Python’s pseudo-random number generator.</para>
|
||
<para>Then, the ring builder assigns each replica of each
|
||
partition to the device that desires the most partitions
|
||
at that point while keeping it as far away as possible
|
||
from other replicas. The ring builder prefers to assign a
|
||
replica to a device in a regions that has no replicas
|
||
already; should there be no such region available, the
|
||
ring builder will try to find a device in a different
|
||
zone; if not possible, it will look on a different server;
|
||
failing that, it will just look for a device that has no
|
||
replicas; finally, if all other options are exhausted, the
|
||
ring builder will assign the replica to the device that
|
||
has the fewest replicas already assigned. Note that
|
||
assignment of multiple replicas to one device will only
|
||
happen if the ring has fewer devices than it has
|
||
replicas.</para>
|
||
<para>When building a new ring based on an old ring, the
|
||
desired number of partitions each device wants is
|
||
recalculated. Next the partitions to be reassigned are
|
||
gathered up. Any removed devices have all their assigned
|
||
partitions unassigned and added to the gathered list. Any
|
||
partition replicas that (due to the addition of new
|
||
devices) can be spread out for better durability are
|
||
unassigned and added to the gathered list. Any devices
|
||
that have more partitions than they now desire have random
|
||
partitions unassigned from them and added to the gathered
|
||
list. Lastly, the gathered partitions are then reassigned
|
||
to devices using a similar method as in the initial
|
||
assignment described above.</para>
|
||
<para>Whenever a partition has a replica reassigned, the time
|
||
of the reassignment is recorded. This is taken into
|
||
account when gathering partitions to reassign so that no
|
||
partition is moved twice in a configurable amount of time.
|
||
This configurable amount of time is known internally to
|
||
the RingBuilder class as min_part_hours. This restriction
|
||
is ignored for replicas of partitions on devices that have
|
||
been removed, as removing a device only happens on device
|
||
failure and there’s no choice but to make a
|
||
reassignment.</para>
|
||
<para>The above processes don’t always perfectly rebalance a
|
||
ring due to the random nature of gathering partitions for
|
||
reassignment. To help reach a more balanced ring, the
|
||
rebalance process is repeated until near perfect (less 1%
|
||
off) or when the balance doesn’t improve by at least 1%
|
||
(indicating we probably can’t get perfect balance due to
|
||
wildly imbalanced zones or too many partitions recently
|
||
moved).</para>
|
||
</chapter> |