ab93e636b8
Brief Summary: Added Modules and Lab Sections of Aptira's Existing OpenStack Training Docs. Please do refer Full Summary for more details. For those who want to review this and save some time on building it, I have hosted the content on http://office.aptira.in Please talk to Sean Robetrs if you are concerened about repetition of Doc Content or similar issues like short URLs etc., this is supposed to be a rough patch and not final. Full Summary: Added the following modules. 1. Module001 - Introduction To OpenStack. - Brief Overview of OpenStack. - Basic Concepts - Detailed Description of Core Projects (Grizzly) under OpenStack. - All But Networking and Swift. 2. Module002 - OpenStack Networking In detail. 3. Module003 - OpenStack Object Storage In detail. 4. Lab001 - OpenStack Control Node and Compute Node. 5. Lab002 - OpenStack Network Node. Full Summary added due to the size of the commit. I Apologize for the size of this commit and will try not to commit huge content like in this patch. The reason for the size of this commit is to meet OpenStack Training Sprint day. bp/training-manuals Change-Id: Ie3c44527992868b4d9571b66cc1c048e558ec669
146 lines
9.1 KiB
XML
146 lines
9.1 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="module003-ch005-the-ring">
|
||
<title>Ring Builder</title>
|
||
<para>The rings are built and managed manually by a utility called
|
||
the ring-builder. The ring-builder assigns partitions to
|
||
devices and writes an optimized Python structure to a gzipped,
|
||
serialized file on disk for shipping out to the servers. The
|
||
server processes just check the modification time of the file
|
||
occasionally and reload their in-memory copies of the ring
|
||
structure as needed. Because of how the ring-builder manages
|
||
changes to the ring, using a slightly older ring usually just
|
||
means one of the three replicas for a subset of the partitions
|
||
will be incorrect, which can be easily worked around.</para>
|
||
<para>The ring-builder also keeps its own builder file with the
|
||
ring information and additional data required to build future
|
||
rings. It is very important to keep multiple backup copies of
|
||
these builder files. One option is to copy the builder files
|
||
out to every server while copying the ring files themselves.
|
||
Another is to upload the builder files into the cluster
|
||
itself. Complete loss of a builder file will mean creating a
|
||
new ring from scratch, nearly all partitions will end up
|
||
assigned to different devices, and therefore nearly all data
|
||
stored will have to be replicated to new locations. So,
|
||
recovery from a builder file loss is possible, but data will
|
||
definitely be unreachable for an extended time.</para>
|
||
<para><guilabel>Ring Data Structure</guilabel></para>
|
||
<para>The ring data structure consists of three top level
|
||
fields: a list of devices in the cluster, a list of lists
|
||
of device ids indicating partition to device assignments,
|
||
and an integer indicating the number of bits to shift an
|
||
MD5 hash to calculate the partition for the hash.</para>
|
||
<para><guilabel>Partition Assignment
|
||
List</guilabel></para>
|
||
<para>This is a list of array(‘H’) of devices ids. The
|
||
outermost list contains an array(‘H’) for each
|
||
replica. Each array(‘H’) has a length equal to the
|
||
partition count for the ring. Each integer in the
|
||
array(‘H’) is an index into the above list of devices.
|
||
The partition list is known internally to the Ring
|
||
class as _replica2part2dev_id.</para>
|
||
<para>So, to create a list of device dictionaries assigned
|
||
to a partition, the Python code would look like:
|
||
devices = [self.devs[part2dev_id[partition]] for
|
||
part2dev_id in self._replica2part2dev_id]</para>
|
||
<para>That code is a little simplistic, as it does not
|
||
account for the removal of duplicate devices. If a
|
||
ring has more replicas than devices, then a partition
|
||
will have more than one replica on one device; that’s
|
||
simply the pigeonhole principle at work.</para>
|
||
<para>array(‘H’) is used for memory conservation as there
|
||
may be millions of partitions.</para>
|
||
<para><guilabel>Fractional Replicas</guilabel></para>
|
||
<para>A ring is not restricted to having an integer number
|
||
of replicas. In order to support the gradual changing
|
||
of replica counts, the ring is able to have a real
|
||
number of replicas.</para>
|
||
<para>When the number of replicas is not an integer, then
|
||
the last element of _replica2part2dev_id will have a
|
||
length that is less than the partition count for the
|
||
ring. This means that some partitions will have more
|
||
replicas than others. For example, if a ring has 3.25
|
||
replicas, then 25% of its partitions will have four
|
||
replicas, while the remaining 75% will have just
|
||
three.</para>
|
||
<para><guilabel>Partition Shift Value</guilabel></para>
|
||
<para>The partition shift value is known internally to the
|
||
Ring class as _part_shift. This value used to shift an
|
||
MD5 hash to calculate the partition on which the data
|
||
for that hash should reside. Only the top four bytes
|
||
of the hash is used in this process. For example, to
|
||
compute the partition for the path
|
||
/account/container/object the Python code might look
|
||
like: partition = unpack_from('>I',
|
||
md5('/account/container/object').digest())[0] >>
|
||
self._part_shift</para>
|
||
<para>For a ring generated with part_power P, the
|
||
partition shift value is 32 - P.</para>
|
||
<para><guilabel>Building the Ring</guilabel></para>
|
||
<para>The initial building of the ring first calculates the
|
||
number of partitions that should ideally be assigned to
|
||
each device based the device’s weight. For example, given
|
||
a partition power of 20, the ring will have 1,048,576
|
||
partitions. If there are 1,000 devices of equal weight
|
||
they will each desire 1,048.576 partitions. The devices
|
||
are then sorted by the number of partitions they desire
|
||
and kept in order throughout the initialization
|
||
process.</para>
|
||
<para>Note: each device is also assigned a random tiebreaker
|
||
value that is used when two devices desire the same number
|
||
of partitions. This tiebreaker is not stored on disk
|
||
anywhere, and so two different rings created with the same
|
||
parameters will have different partition assignments. For
|
||
repeatable partition assignments, RingBuilder.rebalance()
|
||
takes an optional seed value that will be used to seed
|
||
Python’s pseudo-random number generator.</para>
|
||
<para>Then, the ring builder assigns each replica of each
|
||
partition to the device that desires the most partitions
|
||
at that point while keeping it as far away as possible
|
||
from other replicas. The ring builder prefers to assign a
|
||
replica to a device in a regions that has no replicas
|
||
already; should there be no such region available, the
|
||
ring builder will try to find a device in a different
|
||
zone; if not possible, it will look on a different server;
|
||
failing that, it will just look for a device that has no
|
||
replicas; finally, if all other options are exhausted, the
|
||
ring builder will assign the replica to the device that
|
||
has the fewest replicas already assigned. Note that
|
||
assignment of multiple replicas to one device will only
|
||
happen if the ring has fewer devices than it has
|
||
replicas.</para>
|
||
<para>When building a new ring based on an old ring, the
|
||
desired number of partitions each device wants is
|
||
recalculated. Next the partitions to be reassigned are
|
||
gathered up. Any removed devices have all their assigned
|
||
partitions unassigned and added to the gathered list. Any
|
||
partition replicas that (due to the addition of new
|
||
devices) can be spread out for better durability are
|
||
unassigned and added to the gathered list. Any devices
|
||
that have more partitions than they now desire have random
|
||
partitions unassigned from them and added to the gathered
|
||
list. Lastly, the gathered partitions are then reassigned
|
||
to devices using a similar method as in the initial
|
||
assignment described above.</para>
|
||
<para>Whenever a partition has a replica reassigned, the time
|
||
of the reassignment is recorded. This is taken into
|
||
account when gathering partitions to reassign so that no
|
||
partition is moved twice in a configurable amount of time.
|
||
This configurable amount of time is known internally to
|
||
the RingBuilder class as min_part_hours. This restriction
|
||
is ignored for replicas of partitions on devices that have
|
||
been removed, as removing a device only happens on device
|
||
failure and there’s no choice but to make a
|
||
reassignment.</para>
|
||
<para>The above processes don’t always perfectly rebalance a
|
||
ring due to the random nature of gathering partitions for
|
||
reassignment. To help reach a more balanced ring, the
|
||
rebalance process is repeated until near perfect (less 1%
|
||
off) or when the balance doesn’t improve by at least 1%
|
||
(indicating we probably can’t get perfect balance due to
|
||
wildly imbalanced zones or too many partitions recently
|
||
moved).</para>
|
||
</chapter> |