b2235bf3fb
Execluded all XML files in the directory doc/common/tables because they are autogenerated. The XML root element of Docbook XML files should match the following format: <ELEMENT xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="THE_XML_ID_OF_THE_ELEMENT"> Change-Id: If12091be81ec8b2e6e53bfcb4c3a883a65e24736
227 lines
13 KiB
XML
227 lines
13 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<section xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="section_objectstorage-ringbuilder">
|
||
<title>Ring-builder</title>
|
||
<para>Use the swift-ring-builder utility to build and manage rings. This
|
||
utility assigns partitions to devices and writes an optimized
|
||
Python structure to a gzipped, serialized file on disk for
|
||
transmission to the servers. The server processes occasionally
|
||
check the modification time of the file and reload in-memory
|
||
copies of the ring structure as needed. If you use a slightly
|
||
older version of the ring, one of the three replicas for a
|
||
partition subset will be incorrect because of the way the
|
||
ring-builder manages changes to the ring. You can work around
|
||
this issue.</para>
|
||
<para>The ring-builder also keeps its own builder file with the
|
||
ring information and additional data required to build future
|
||
rings. It is very important to keep multiple backup copies of
|
||
these builder files. One option is to copy the builder files
|
||
out to every server while copying the ring files themselves.
|
||
Another is to upload the builder files into the cluster
|
||
itself. If you lose the builder file, you have to create a new
|
||
ring from scratch. Nearly all partitions would be assigned to
|
||
different devices and, therefore, nearly all of the stored
|
||
data would have to be replicated to new locations. So,
|
||
recovery from a builder file loss is possible, but data would
|
||
be unreachable for an extended time.</para>
|
||
<section xml:id="section_ring-data-structure">
|
||
<title>Ring data structure</title>
|
||
<para>The ring data structure consists of three top level
|
||
fields: a list of devices in the cluster, a list of lists
|
||
of device ids indicating partition to device assignments,
|
||
and an integer indicating the number of bits to shift an
|
||
MD5 hash to calculate the partition for the hash.</para>
|
||
</section>
|
||
<section xml:id="section_partition-assignment">
|
||
<title>Partition assignment list</title>
|
||
<para>This is a list of <literal>array(‘H’)</literal> of
|
||
devices ids. The outermost list contains an
|
||
<literal>array(‘H’)</literal> for each replica. Each
|
||
<literal>array(‘H’)</literal> has a length equal to
|
||
the partition count for the ring. Each integer in the
|
||
<literal>array(‘H’)</literal> is an index into the
|
||
above list of devices. The partition list is known
|
||
internally to the Ring class as
|
||
<literal>_replica2part2dev_id</literal>.</para>
|
||
<para>So, to create a list of device dictionaries assigned to
|
||
a partition, the Python code would look like:
|
||
<programlisting>devices = [self.devs[part2dev_id[partition]] for
|
||
part2dev_id in self._replica2part2dev_id]</programlisting></para>
|
||
<para>That code is a little simplistic because it does not
|
||
account for the removal of duplicate devices. If a ring
|
||
has more replicas than devices, a partition will have more
|
||
than one replica on a device.</para>
|
||
<para><literal>array(‘H’)</literal> is used for memory
|
||
conservation as there may be millions of
|
||
partitions.</para>
|
||
</section>
|
||
<section xml:id="section_fractional-replicas">
|
||
<title>Replica counts</title>
|
||
<para>To support the gradual change in replica counts, a ring
|
||
can have a real number of replicas and is not restricted
|
||
to an integer number of replicas.</para>
|
||
<para>A fractional replica count is for the whole ring and not
|
||
for individual partitions. It indicates the average number
|
||
of replicas for each partition. For example, a replica
|
||
count of 3.2 means that 20 percent of partitions have four
|
||
replicas and 80 percent have three replicas.</para>
|
||
<para>The replica count is adjustable.</para>
|
||
<para>Example:</para>
|
||
<screen><prompt>$</prompt> <userinput>swift-ring-builder account.builder set_replicas 4</userinput>
|
||
<prompt>$</prompt> <userinput>swift-ring-builder account.builder rebalance</userinput></screen>
|
||
<para>You must rebalance the replica ring in globally
|
||
distributed clusters. Operators of these clusters
|
||
generally want an equal number of replicas and regions.
|
||
Therefore, when an operator adds or removes a region, the
|
||
operator adds or removes a replica. Removing unneeded
|
||
replicas saves on the cost of disks.</para>
|
||
<para>You can gradually increase the replica count at a rate
|
||
that does not adversely affect cluster performance.</para>
|
||
<para>For example:</para>
|
||
<screen><prompt>$</prompt> <userinput>swift-ring-builder object.builder set_replicas 3.01</userinput>
|
||
<prompt>$</prompt> <userinput>swift-ring-builder object.builder rebalance</userinput>
|
||
<computeroutput><distribute rings and wait>...</computeroutput>
|
||
|
||
<prompt>$</prompt> <userinput>swift-ring-builder object.builder set_replicas 3.02</userinput>
|
||
<prompt>$</prompt> <userinput>swift-ring-builder object.builder rebalance</userinput>
|
||
<computeroutput><creatdistribute rings and wait>...</computeroutput></screen>
|
||
<para>Changes take effect after the ring is rebalanced.
|
||
Therefore, if you intend to change from 3 replicas to 3.01
|
||
but you accidentally type <literal>2.01</literal>, no data
|
||
is lost.</para>
|
||
<para>Additionally, <command>swift-ring-builder
|
||
<replaceable>X.builder</replaceable>
|
||
create</command> can now take a decimal argument for
|
||
the number of replicas.</para>
|
||
</section>
|
||
<section xml:id="section_partition-shift-value">
|
||
<title>Partition shift value</title>
|
||
<para>The partition shift value is known internally to the
|
||
Ring class as <literal>_part_shift</literal>. This value
|
||
is used to shift an MD5 hash to calculate the partition
|
||
where the data for that hash should reside. Only the top
|
||
four bytes of the hash is used in this process. For
|
||
example, to compute the partition for the
|
||
<literal>/account/container/object</literal> path, the
|
||
Python code might look like the following code:
|
||
<programlisting>partition = unpack_from('>I',
|
||
md5('/account/container/object').digest())[0] >>
|
||
self._part_shift</programlisting></para>
|
||
<para>For a ring generated with part_power P, the partition
|
||
shift value is <literal>32 - P</literal>.</para>
|
||
</section>
|
||
<section xml:id="section_build-ring">
|
||
<title>Build the ring</title>
|
||
<para>The ring builder process includes these high-level
|
||
steps:</para>
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>The utility calculates the number of partitions to
|
||
assign to each device based on the weight of the
|
||
device. For example, for a partition at the power
|
||
of 20, the ring has 1,048,576 partitions. One
|
||
thousand devices of equal weight each want
|
||
1,048.576 partitions. The devices are sorted by
|
||
the number of partitions they desire and kept in
|
||
order throughout the initialization
|
||
process.</para>
|
||
<note>
|
||
<para>Each device is also assigned a random
|
||
tiebreaker value that is used when two devices
|
||
desire the same number of partitions. This
|
||
tiebreaker is not stored on disk anywhere, and
|
||
so two different rings created with the same
|
||
parameters will have different partition
|
||
assignments. For repeatable partition
|
||
assignments,
|
||
<literal>RingBuilder.rebalance()</literal>
|
||
takes an optional seed value that seeds the
|
||
Python pseudo-random number generator.</para>
|
||
</note>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The ring builder assigns each partition replica
|
||
to the device that requires most partitions at
|
||
that point while keeping it as far away as
|
||
possible from other replicas. The ring builder
|
||
prefers to assign a replica to a device in a
|
||
region that does not already have a replica. If no
|
||
such region is available, the ring builder
|
||
searches for a device in a different zone, or on a
|
||
different server. If it does not find one, it
|
||
looks for a device with no replicas. Finally, if
|
||
all options are exhausted, the ring builder
|
||
assigns the replica to the device that has the
|
||
fewest replicas already assigned.</para>
|
||
<note>
|
||
<para>The ring builder assigns multiple replicas
|
||
to one device only if the ring has fewer
|
||
devices than it has replicas.</para>
|
||
</note>
|
||
</listitem>
|
||
<listitem>
|
||
<para>When building a new ring from an old ring, the
|
||
ring builder recalculates the desired number of
|
||
partitions that each device wants.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The ring builder unassigns partitions and
|
||
gathers these partitions for reassignment, as
|
||
follows: <itemizedlist>
|
||
<listitem>
|
||
<para>The ring builder unassigns any
|
||
assigned partitions from any removed
|
||
devices and adds these partitions to
|
||
the gathered list.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The ring builder unassigns any
|
||
partition replicas that can be spread
|
||
out for better durability and adds
|
||
these partitions to the gathered list.
|
||
</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The ring builder unassigns random
|
||
partitions from any devices that have
|
||
more partitions than they need and
|
||
adds these partitions to the gathered
|
||
list.</para>
|
||
</listitem>
|
||
|
||
</itemizedlist></para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The ring builder reassigns the gathered
|
||
partitions to devices by using a similar method to
|
||
the one described previously.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>When the ring builder reassigns a replica to a
|
||
partition, the ring builder records the time of
|
||
the reassignment. The ring builder uses this value
|
||
when it gathers partitions for reassignment so
|
||
that no partition is moved twice in a configurable
|
||
amount of time. The RingBuilder class knows this
|
||
configurable amount of time as
|
||
<literal>min_part_hours</literal>. The ring
|
||
builder ignores this restriction for replicas of
|
||
partitions on removed devices because removal of a
|
||
device happens on device failure only, and
|
||
reassignment is the only choice.</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
<para>Theses steps do not always perfectly rebalance a ring
|
||
due to the random nature of gathering partitions for
|
||
reassignment. To help reach a more balanced ring, the
|
||
rebalance process is repeated until near perfect (less
|
||
than 1 percent off) or when the balance does not improve
|
||
by at least 1 percent (indicating we probably cannot get
|
||
perfect balance due to wildly imbalanced zones or too many
|
||
partitions recently moved).</para>
|
||
</section>
|
||
</section>
|