Ring doc cleanups
Change-Id: Ie51ea5c729341da793887e1e25c1e45301a96751
This commit is contained in:
parent
41c8f1330f
commit
b5ee8c88d0
@ -5,38 +5,40 @@ The Rings
|
|||||||
The rings determine where data should reside in the cluster. There is a
|
The rings determine where data should reside in the cluster. There is a
|
||||||
separate ring for account databases, container databases, and individual
|
separate ring for account databases, container databases, and individual
|
||||||
object storage policies but each ring works in the same way. These rings are
|
object storage policies but each ring works in the same way. These rings are
|
||||||
externally managed, in that the server processes themselves do not modify the
|
externally managed. The server processes themselves do not modify the
|
||||||
rings, they are instead given new rings modified by other tools.
|
rings; they are instead given new rings modified by other tools.
|
||||||
|
|
||||||
The ring uses a configurable number of bits from a path's MD5 hash as a
|
The ring uses a configurable number of bits from the MD5 hash of an item's path
|
||||||
partition index that designates a device. The number of bits kept from the hash
|
as a partition index that designates the device(s) on which that item should
|
||||||
is known as the partition power, and 2 to the partition power indicates the
|
be stored. The number of bits kept from the hash is known as the partition
|
||||||
partition count. Partitioning the full MD5 hash ring allows other parts of the
|
power, and 2 to the partition power indicates the partition count. Partitioning
|
||||||
cluster to work in batches of items at once which ends up either more efficient
|
the full MD5 hash ring allows the cluster components to process resources in
|
||||||
or at least less complex than working with each item separately or the entire
|
batches. This ends up either more efficient or at least less complex than
|
||||||
cluster all at once.
|
working with each item separately or the entire cluster all at once.
|
||||||
|
|
||||||
Another configurable value is the replica count, which indicates how many of
|
Another configurable value is the replica count, which indicates how many
|
||||||
the partition->device assignments comprise a single ring. For a given partition
|
devices to assign for each partition in the ring. By having multiple devices
|
||||||
number, each replica will be assigned to a different device in the ring.
|
responsible for each partition, the cluster can recover from drive or network
|
||||||
|
failures.
|
||||||
|
|
||||||
Devices are added to the ring to describe the capacity available for
|
Devices are added to the ring to describe the capacity available for
|
||||||
part-replica assignment. Devices are placed into failure domains consisting
|
partition replica assignments. Devices are placed into failure domains
|
||||||
of region, zone, and server. Regions can be used to describe geographical
|
consisting of region, zone, and server. Regions can be used to describe
|
||||||
systems characterized by lower-bandwidth or higher latency between machines in
|
geographical systems characterized by lower bandwidth or higher latency between
|
||||||
different regions. Many rings will consist of only a single region. Zones
|
machines in different regions. Many rings will consist of only a single
|
||||||
can be used to group devices based on physical locations, power separations,
|
region. Zones can be used to group devices based on physical locations, power
|
||||||
network separations, or any other attribute that would lessen multiple
|
separations, network separations, or any other attribute that would lessen
|
||||||
replicas being unavailable at the same time.
|
multiple replicas being unavailable at the same time.
|
||||||
|
|
||||||
Devices are given a weight which describes relative weight of the device in
|
Devices are given a weight which describes the relative storage capacity
|
||||||
comparison to other devices.
|
contributed by the device in comparison to other devices.
|
||||||
|
|
||||||
When building a ring all of each part's replicas will be assigned to devices
|
When building a ring, replicas for each partition will be assigned to devices
|
||||||
according to their weight. Additionally, each replica of a part will attempt
|
according to the devices' weights. Additionally, each replica of a partition
|
||||||
to be assigned to a device who's failure domain does not already have a
|
will preferentially be assigned to a device whose failure domain does not
|
||||||
replica for the part. Only a single replica of a part may be assigned to each
|
already have a replica for that partition. Only a single replica of a
|
||||||
device - you must have as many devices as replicas.
|
partition may be assigned to each device - you must have at least as many
|
||||||
|
devices as replicas.
|
||||||
|
|
||||||
.. _ring_builder:
|
.. _ring_builder:
|
||||||
|
|
||||||
@ -45,24 +47,24 @@ Ring Builder
|
|||||||
------------
|
------------
|
||||||
|
|
||||||
The rings are built and managed manually by a utility called the ring-builder.
|
The rings are built and managed manually by a utility called the ring-builder.
|
||||||
The ring-builder assigns partitions to devices and writes an optimized Python
|
The ring-builder assigns partitions to devices and writes an optimized
|
||||||
structure to a gzipped, serialized file on disk for shipping out to the servers.
|
structure to a gzipped, serialized file on disk for shipping out to the
|
||||||
The server processes just check the modification time of the file occasionally
|
servers. The server processes check the modification time of the file
|
||||||
and reload their in-memory copies of the ring structure as needed. Because of
|
occasionally and reload their in-memory copies of the ring structure as needed.
|
||||||
how the ring-builder manages changes to the ring, using a slightly older ring
|
Because of how the ring-builder manages changes to the ring, using a slightly
|
||||||
usually just means one of the three replicas for a subset of the partitions
|
older ring usually just means that for a subset of the partitions the device
|
||||||
will be incorrect, which can be easily worked around.
|
for one of the replicas will be incorrect, which can be easily worked around.
|
||||||
|
|
||||||
The ring-builder also keeps its own builder file with the ring information and
|
The ring-builder also keeps a separate builder file which includes the ring
|
||||||
additional data required to build future rings. It is very important to keep
|
information as well as additional data required to build future rings. It is
|
||||||
multiple backup copies of these builder files. One option is to copy the
|
very important to keep multiple backup copies of these builder files. One
|
||||||
builder files out to every server while copying the ring files themselves.
|
option is to copy the builder files out to every server while copying the ring
|
||||||
Another is to upload the builder files into the cluster itself. Complete loss
|
files themselves. Another is to upload the builder files into the cluster
|
||||||
of a builder file will mean creating a new ring from scratch, nearly all
|
itself. Complete loss of a builder file will mean creating a new ring from
|
||||||
partitions will end up assigned to different devices, and therefore nearly all
|
scratch, nearly all partitions will end up assigned to different devices, and
|
||||||
data stored will have to be replicated to new locations. So, recovery from a
|
therefore nearly all data stored will have to be replicated to new locations.
|
||||||
builder file loss is possible, but data will definitely be unreachable for an
|
So, recovery from a builder file loss is possible, but data will definitely be
|
||||||
extended time.
|
unreachable for an extended time.
|
||||||
|
|
||||||
-------------------
|
-------------------
|
||||||
Ring Data Structure
|
Ring Data Structure
|
||||||
@ -77,13 +79,13 @@ to calculate the partition for the hash.
|
|||||||
List of Devices
|
List of Devices
|
||||||
***************
|
***************
|
||||||
|
|
||||||
The list of devices is known internally to the Ring class as devs. Each item in
|
The list of devices is known internally to the Ring class as ``devs``. Each
|
||||||
the list of devices is a dictionary with the following keys:
|
item in the list of devices is a dictionary with the following keys:
|
||||||
|
|
||||||
====== ======= ==============================================================
|
====== ======= ==============================================================
|
||||||
id integer The index into the list devices.
|
id integer The index into the list of devices.
|
||||||
zone integer The zone the device resides in.
|
zone integer The zone in which the device resides.
|
||||||
region integer The region the zone resides in.
|
region integer The region in which the zone resides.
|
||||||
weight float The relative weight of the device in comparison to other
|
weight float The relative weight of the device in comparison to other
|
||||||
devices. This usually corresponds directly to the amount of
|
devices. This usually corresponds directly to the amount of
|
||||||
disk space the device has compared to other devices. For
|
disk space the device has compared to other devices. For
|
||||||
@ -94,10 +96,10 @@ weight float The relative weight of the device in comparison to other
|
|||||||
data than desired over time. A good average weight of 100.0
|
data than desired over time. A good average weight of 100.0
|
||||||
allows flexibility in lowering the weight later if necessary.
|
allows flexibility in lowering the weight later if necessary.
|
||||||
ip string The IP address or hostname of the server containing the device.
|
ip string The IP address or hostname of the server containing the device.
|
||||||
port int The TCP port the listening server process uses that serves
|
port int The TCP port on which the server process listens to serve
|
||||||
requests for the device.
|
requests for the device.
|
||||||
device string The on disk name of the device on the server.
|
device string The on-disk name of the device on the server.
|
||||||
For example: sdb1
|
For example: ``sdb1``
|
||||||
meta string A general-use field for storing additional information for the
|
meta string A general-use field for storing additional information for the
|
||||||
device. This information isn't used directly by the server
|
device. This information isn't used directly by the server
|
||||||
processes, but can be useful in debugging. For example, the
|
processes, but can be useful in debugging. For example, the
|
||||||
@ -105,47 +107,54 @@ meta string A general-use field for storing additional information for the
|
|||||||
be stored here.
|
be stored here.
|
||||||
====== ======= ==============================================================
|
====== ======= ==============================================================
|
||||||
|
|
||||||
Note: The list of devices may contain holes, or indexes set to None, for
|
.. note::
|
||||||
devices that have been removed from the cluster. However, device ids are
|
The list of devices may contain holes, or indexes set to ``None``, for
|
||||||
reused. Device ids are reused to avoid potentially running out of device id
|
devices that have been removed from the cluster. However, device ids are
|
||||||
slots when there are available slots (from prior removal of devices). A
|
reused. Device ids are reused to avoid potentially running out of device id
|
||||||
consequence of this device id reuse is that the device id (integer value) does
|
slots when there are available slots (from prior removal of devices). A
|
||||||
not necessarily correspond with the chronology of when the device was added to
|
consequence of this device id reuse is that the device id (integer value)
|
||||||
the ring. Also, some devices may be temporarily disabled by setting their
|
does not necessarily correspond with the chronology of when the device was
|
||||||
weight to 0.0. To obtain a list of active devices (for uptime polling, for
|
added to the ring. Also, some devices may be temporarily disabled by
|
||||||
example) the Python code would look like: ``devices = list(self._iter_devs())``
|
setting their weight to ``0.0``. To obtain a list of active devices (for
|
||||||
|
uptime polling, for example) the Python code would look like::
|
||||||
|
|
||||||
|
devices = list(self._iter_devs())
|
||||||
|
|
||||||
*************************
|
*************************
|
||||||
Partition Assignment List
|
Partition Assignment List
|
||||||
*************************
|
*************************
|
||||||
|
|
||||||
This is a list of array('H') of devices ids. The outermost list contains an
|
The partition assignment list is known internally to the Ring class as
|
||||||
array('H') for each replica. Each array('H') has a length equal to the
|
``_replica2part2dev_id``. This is a list of ``array('H')``\s, one for each
|
||||||
partition count for the ring. Each integer in the array('H') is an index into
|
replica. Each ``array('H')`` has a length equal to the partition count for the
|
||||||
the above list of devices. The partition list is known internally to the Ring
|
ring. Each integer in the ``array('H')`` is an index into the above list of
|
||||||
class as _replica2part2dev_id.
|
devices.
|
||||||
|
|
||||||
So, to create a list of device dictionaries assigned to a partition, the Python
|
So, to create a list of device dictionaries assigned to a partition, the Python
|
||||||
code would look like: ``devices = [self.devs[part2dev_id[partition]] for
|
code would look like::
|
||||||
part2dev_id in self._replica2part2dev_id]``
|
|
||||||
|
|
||||||
array('H') is used for memory conservation as there may be millions of
|
devices = [self.devs[part2dev_id[partition]]
|
||||||
|
for part2dev_id in self._replica2part2dev_id]
|
||||||
|
|
||||||
|
``array('H')`` is used for memory conservation as there may be millions of
|
||||||
partitions.
|
partitions.
|
||||||
|
|
||||||
*********************
|
*********************
|
||||||
Partition Shift Value
|
Partition Shift Value
|
||||||
*********************
|
*********************
|
||||||
|
|
||||||
The partition shift value is known internally to the Ring class as _part_shift.
|
The partition shift value is known internally to the Ring class as
|
||||||
This value used to shift an MD5 hash to calculate the partition on which the
|
``_part_shift``. This value is used to shift an MD5 hash of an item's path to
|
||||||
data for that hash should reside. Only the top four bytes of the hash is used
|
calculate the partition on which the data for that item should reside. Only the
|
||||||
in this process. For example, to compute the partition for the path
|
top four bytes of the hash are used in this process. For example, to compute
|
||||||
/account/container/object the Python code might look like: ``partition =
|
the partition for the path ``/account/container/object``, the Python code might
|
||||||
unpack_from('>I', md5('/account/container/object').digest())[0] >>
|
look like::
|
||||||
self._part_shift``
|
|
||||||
|
|
||||||
For a ring generated with part_power P, the partition shift value is
|
objhash = md5('/account/container/object').digest()
|
||||||
32 - P.
|
partition = struct.unpack_from('>I', objhash)[0] >> self._part_shift
|
||||||
|
|
||||||
|
For a ring generated with partition power ``P``, the partition shift value is
|
||||||
|
``32 - P``.
|
||||||
|
|
||||||
*******************
|
*******************
|
||||||
Fractional Replicas
|
Fractional Replicas
|
||||||
@ -155,11 +164,12 @@ A ring is not restricted to having an integer number of replicas. In order to
|
|||||||
support the gradual changing of replica counts, the ring is able to have a real
|
support the gradual changing of replica counts, the ring is able to have a real
|
||||||
number of replicas.
|
number of replicas.
|
||||||
|
|
||||||
When the number of replicas is not an integer, then the last element of
|
When the number of replicas is not an integer, the last element of
|
||||||
_replica2part2dev_id will have a length that is less than the partition count
|
``_replica2part2dev_id`` will have a length that is less than the partition
|
||||||
for the ring. This means that some partitions will have more replicas than
|
count for the ring. This means that some partitions will have more replicas
|
||||||
others. For example, if a ring has 3.25 replicas, then 25% of its partitions
|
than others. For example, if a ring has ``3.25`` replicas, then 25% of its
|
||||||
will have four replicas, while the remaining 75% will have just three.
|
partitions will have four replicas, while the remaining 75% will have just
|
||||||
|
three.
|
||||||
|
|
||||||
.. _ring_dispersion:
|
.. _ring_dispersion:
|
||||||
|
|
||||||
@ -186,24 +196,24 @@ Overload
|
|||||||
|
|
||||||
The ring builder tries to keep replicas as far apart as possible while
|
The ring builder tries to keep replicas as far apart as possible while
|
||||||
still respecting device weights. When it can't do both, the overload
|
still respecting device weights. When it can't do both, the overload
|
||||||
factor determines what happens. Each device will take some extra
|
factor determines what happens. Each device may take some extra
|
||||||
fraction of its desired partitions to allow for replica dispersion;
|
fraction of its desired partitions to allow for replica dispersion;
|
||||||
once that extra fraction is exhausted, replicas will be placed closer
|
once that extra fraction is exhausted, replicas will be placed closer
|
||||||
together than optimal.
|
together than is optimal for durability.
|
||||||
|
|
||||||
Essentially, the overload factor lets the operator trade off replica
|
Essentially, the overload factor lets the operator trade off replica
|
||||||
dispersion (durability) against data dispersion (uniform disk usage).
|
dispersion (durability) against device balance (uniform disk usage).
|
||||||
|
|
||||||
The default overload factor is 0, so device weights will be strictly
|
The default overload factor is ``0``, so device weights will be strictly
|
||||||
followed.
|
followed.
|
||||||
|
|
||||||
With an overload factor of 0.1, each device will accept 10% more
|
With an overload factor of ``0.1``, each device will accept 10% more
|
||||||
partitions than it otherwise would, but only if needed to maintain
|
partitions than it otherwise would, but only if needed to maintain
|
||||||
partition dispersion.
|
dispersion.
|
||||||
|
|
||||||
Example: Consider a 3-node cluster of machines with equal-size disks;
|
Example: Consider a 3-node cluster of machines with equal-size disks;
|
||||||
let node A have 12 disks, node B have 12 disks, and node C have only
|
let node A have 12 disks, node B have 12 disks, and node C have only
|
||||||
11 disks. Let the ring have an overload factor of 0.1 (10%).
|
11 disks. Let the ring have an overload factor of ``0.1`` (10%).
|
||||||
|
|
||||||
Without the overload, some partitions would end up with replicas only
|
Without the overload, some partitions would end up with replicas only
|
||||||
on nodes A and B. However, with the overload, every device is willing
|
on nodes A and B. However, with the overload, every device is willing
|
||||||
@ -228,65 +238,63 @@ number of "nodes" to which keys in the keyspace must be assigned. Swift calls
|
|||||||
these ranges `partitions` - they are partitions of the total keyspace.
|
these ranges `partitions` - they are partitions of the total keyspace.
|
||||||
|
|
||||||
Each partition will have multiple replicas. Every replica of each partition
|
Each partition will have multiple replicas. Every replica of each partition
|
||||||
must be assigned to a device in the ring. When a describing a specific
|
must be assigned to a device in the ring. When describing a specific replica
|
||||||
replica of a partition (like when it's assigned a device) it is described as a
|
of a partition (like when it's assigned a device) it is described as a
|
||||||
`part-replica` in that it is a specific `replica` of the specific `partition`.
|
`part-replica` in that it is a specific `replica` of the specific `partition`.
|
||||||
A single device may be assigned different replicas from many parts, but it may
|
A single device will likely be assigned different replicas from many
|
||||||
not be assigned multiple replicas of a single part.
|
partitions, but it may not be assigned multiple replicas of a single partition.
|
||||||
|
|
||||||
The total number of partitions in a ring is calculated as ``2 **
|
The total number of partitions in a ring is calculated as ``2 **
|
||||||
<part-power>``. The total number of part-replicas in a ring is calculated as
|
<part-power>``. The total number of part-replicas in a ring is calculated as
|
||||||
``<replica-count> * 2 ** <part-power>``.
|
``<replica-count> * 2 ** <part-power>``.
|
||||||
|
|
||||||
When considering a device's `weight` it is useful to describe the number of
|
When considering a device's `weight` it is useful to describe the number of
|
||||||
part-replicas it would like to be assigned. A single device regardless of
|
part-replicas it would like to be assigned. A single device, regardless of
|
||||||
weight will never hold more than ``2 ** <part-power>`` part-replicas because
|
weight, will never hold more than ``2 ** <part-power>`` part-replicas because
|
||||||
it can not have more than one replica of any part assigned. The number of
|
it can not have more than one replica of any partition assigned. The number of
|
||||||
part-replicas a device can take by weights is calculated as it's
|
part-replicas a device can take by weights is calculated as its `parts-wanted`.
|
||||||
`parts_wanted`. The true number of part-replicas assigned to a device can be
|
The true number of part-replicas assigned to a device can be compared to its
|
||||||
compared to it's parts wanted similarly to a calculation of percentage error -
|
parts-wanted similarly to a calculation of percentage error - this deviation in
|
||||||
this deviation in the observed result from the idealized target is called a
|
the observed result from the idealized target is called a device's `balance`.
|
||||||
devices `balance`.
|
|
||||||
|
|
||||||
When considering a device's `failure domain` it is useful to describe the
|
When considering a device's `failure domain` it is useful to describe the number
|
||||||
number of part-replicas it would like to be assigned. The number of
|
of part-replicas it would like to be assigned. The number of part-replicas
|
||||||
part-replicas wanted in a failure domain of a tier is the sum of the
|
wanted in a failure domain of a tier is the sum of the part-replicas wanted in
|
||||||
part-replicas wanted in the failure domains of it's sub-tier. However,
|
the failure domains of its sub-tier. However, collectively when the total
|
||||||
collectively when the total number of part-replicas in a failure domain
|
number of part-replicas in a failure domain exceeds or is equal to ``2 **
|
||||||
exceeds or is equal to ``2 ** <part-power>`` it is most obvious that it's no
|
<part-power>`` it is most obvious that it's no longer sufficient to consider
|
||||||
longer sufficient to consider only the number of total part-replicas, but
|
only the number of total part-replicas, but rather the fraction of each
|
||||||
rather the fraction of each replica's partitions. Consider for example a ring
|
replica's partitions. Consider for example a ring with 3 replicas and 3
|
||||||
with ``3`` replicas and ``3`` servers, while it's necessary for dispersion
|
servers: while dispersion requires that each server hold only ⅓ of the total
|
||||||
that each server hold only ``1/3`` of the total part-replicas it is
|
part-replicas, placement is additionally constrained to require ``1.0`` replica
|
||||||
additionally constrained to require ``1.0`` replica of *each* partition. It
|
of *each* partition per server. It would not be sufficient to satisfy
|
||||||
would not be sufficient to satisfy dispersion if two devices on one of the
|
dispersion if two devices on one of the servers each held a replica of a single
|
||||||
servers each held a replica of a single partition, while another server held
|
partition, while another server held none. By considering a decimal fraction
|
||||||
none. By considering a decimal fraction of one replica's worth of parts in a
|
of one replica's worth of partitions in a failure domain we can derive the
|
||||||
failure domain we can derive the total part-replicas wanted in a failure
|
total part-replicas wanted in a failure domain (``1.0 * 2 ** <part-power>``).
|
||||||
domain (``1.0 * 2 ** <part-power>``). Additionally we infer more about
|
Additionally we infer more about `which` part-replicas must go in the failure
|
||||||
`which` part-replicas must go in the failure domain. Consider a ring with
|
domain. Consider a ring with three replicas and two zones, each with two
|
||||||
three replicas, and two zones, each with two servers (four servers total).
|
servers (four servers total). The three replicas worth of partitions will be
|
||||||
The three replicas worth of partitions will be assigned into two failure
|
assigned into two failure domains at the zone tier. Each zone must hold more
|
||||||
domains at the zone tier. Each zone must hold more than one replica of some
|
than one replica of some partitions. We represent this improper fraction of a
|
||||||
parts. We represent this improper faction of a replica's worth of partitions
|
replica's worth of partitions in decimal form as ``1.5`` (``3.0 / 2``). This
|
||||||
in decimal form as ``1.5`` (``3.0 / 2``). This tells us not only the *number*
|
tells us not only the *number* of total partitions (``1.5 * 2 **
|
||||||
of total parts (``1.5 * 2 ** <part-power>``) but also that *each* partition
|
<part-power>``) but also that *each* partition must have `at least` one replica
|
||||||
must have `at least` one replica in this failure domain (in fact ``0.5`` of
|
in this failure domain (in fact ``0.5`` of the partitions will have 2
|
||||||
the partitions will have ``2`` replicas). Within each zone the two servers
|
replicas). Within each zone the two servers will hold ``0.75`` of a replica's
|
||||||
will hold ``0.75`` of a replica's worth of partitions - this is equal both to
|
worth of partitions - this is equal both to "the fraction of a replica's worth
|
||||||
"the fraction of a replica's worth of partitions assigned to each zone
|
of partitions assigned to each zone (``1.5``) divided evenly among the number
|
||||||
(``1.5``) divided evenly among the number of failure domain's in it's sub-tier
|
of failure domains in its sub-tier (2 servers in each zone, i.e. ``1.5 / 2``)"
|
||||||
(``2`` servers in each zone, i.e. ``1.5 / 2``)" but *also* "the total number
|
but *also* "the total number of replicas (``3.0``) divided evenly among the
|
||||||
of replicas (``3.0``) divided evenly among the total number of failure domains
|
total number of failure domains in the server tier (2 servers × 2 zones = 4,
|
||||||
in the server tier (``2`` servers x ``2`` zones = ``4``, i.e. ``3.0 / 4``)".
|
i.e. ``3.0 / 4``)". It is useful to consider that each server in this ring
|
||||||
It is useful to consider that each server in this ring will hold only ``0.75``
|
will hold only ``0.75`` of a replica's worth of partitions which tells that any
|
||||||
of a replica's worth of partitions which tells that any server should have `at
|
server should have `at most` one replica of a given partition assigned. In the
|
||||||
most` one replica of a given part assigned. In the interests of brevity, some
|
interests of brevity, some variable names will often refer to the concept
|
||||||
variable names will often refer to the concept representing the fraction of a
|
representing the fraction of a replica's worth of partitions in decimal form as
|
||||||
replica's worth of partitions in decimal form as *replicanths* - this is meant
|
*replicanths* - this is meant to invoke connotations similar to ordinal numbers
|
||||||
to invoke connotations similar to ordinal numbers as applied to fractions, but
|
as applied to fractions, but generalized to a replica instead of a four\*th* or
|
||||||
generalized to a replica instead of four*th* or a fif*th*. The 'n' was
|
a fif\*th*. The "n" was probably thrown in because of Blade Runner.
|
||||||
probably thrown in because of Blade Runner.
|
|
||||||
|
|
||||||
-----------------
|
-----------------
|
||||||
Building the Ring
|
Building the Ring
|
||||||
@ -298,8 +306,8 @@ ring's topology based on weight.
|
|||||||
Then the ring builder calculates the replicanths wanted at each tier in the
|
Then the ring builder calculates the replicanths wanted at each tier in the
|
||||||
ring's topology based on dispersion.
|
ring's topology based on dispersion.
|
||||||
|
|
||||||
Then the ring calculates the maximum deviation on a single device between it's
|
Then the ring builder calculates the maximum deviation on a single device
|
||||||
weighted replicanths and wanted replicanths.
|
between its weighted replicanths and wanted replicanths.
|
||||||
|
|
||||||
Next we interpolate between the two replicanth values (weighted & wanted) at
|
Next we interpolate between the two replicanth values (weighted & wanted) at
|
||||||
each tier using the specified overload (up to the maximum required overload).
|
each tier using the specified overload (up to the maximum required overload).
|
||||||
@ -309,19 +317,19 @@ calculate the intersection of the line with the desired overload. This
|
|||||||
becomes the target.
|
becomes the target.
|
||||||
|
|
||||||
From the target we calculate the minimum and maximum number of replicas any
|
From the target we calculate the minimum and maximum number of replicas any
|
||||||
part may have in a tier. This becomes the replica_plan.
|
partition may have in a tier. This becomes the `replica-plan`.
|
||||||
|
|
||||||
Finally, we calculate the number of partitions that should ideally be assigned
|
Finally, we calculate the number of partitions that should ideally be assigned
|
||||||
to each device based the replica_plan.
|
to each device based the replica-plan.
|
||||||
|
|
||||||
On initial balance, the first time partitions are placed to generate a ring,
|
On initial balance (i.e., the first time partitions are placed to generate a
|
||||||
we must assign each replica of each partition to the device that desires the
|
ring) we must assign each replica of each partition to the device that desires
|
||||||
most partitions excluding any devices that already have their maximum number
|
the most partitions excluding any devices that already have their maximum
|
||||||
of replicas of that part assigned to some parent tier of that device's failure
|
number of replicas of that partition assigned to some parent tier of that
|
||||||
domain.
|
device's failure domain.
|
||||||
|
|
||||||
When building a new ring based on an old ring, the desired number of
|
When building a new ring based on an old ring, the desired number of
|
||||||
partitions each device wants is recalculated from the current replica_plan.
|
partitions each device wants is recalculated from the current replica-plan.
|
||||||
Next the partitions to be reassigned are gathered up. Any removed devices have
|
Next the partitions to be reassigned are gathered up. Any removed devices have
|
||||||
all their assigned partitions unassigned and added to the gathered list. Any
|
all their assigned partitions unassigned and added to the gathered list. Any
|
||||||
partition replicas that (due to the addition of new devices) can be spread out
|
partition replicas that (due to the addition of new devices) can be spread out
|
||||||
@ -335,21 +343,16 @@ Whenever a partition has a replica reassigned, the time of the reassignment is
|
|||||||
recorded. This is taken into account when gathering partitions to reassign so
|
recorded. This is taken into account when gathering partitions to reassign so
|
||||||
that no partition is moved twice in a configurable amount of time. This
|
that no partition is moved twice in a configurable amount of time. This
|
||||||
configurable amount of time is known internally to the RingBuilder class as
|
configurable amount of time is known internally to the RingBuilder class as
|
||||||
min_part_hours. This restriction is ignored for replicas of partitions on
|
``min_part_hours``. This restriction is ignored for replicas of partitions on
|
||||||
devices that have been removed, as removing a device only happens on device
|
devices that have been removed, as device removal should only happens on device
|
||||||
failure and there's no choice but to make a reassignment.
|
failure and there's no choice but to make a reassignment.
|
||||||
|
|
||||||
The above processes don't always perfectly rebalance a ring due to the random
|
The above processes don't always perfectly rebalance a ring due to the random
|
||||||
nature of gathering partitions for reassignment. To help reach a more balanced
|
nature of gathering partitions for reassignment. To help reach a more balanced
|
||||||
ring, the rebalance process is repeated a fixed number of times until the
|
ring, the rebalance process is repeated a fixed number of times until the
|
||||||
replica_plan is fulfilled or unable to be fulfilled (indicating we probably
|
replica-plan is fulfilled or unable to be fulfilled (indicating we probably
|
||||||
can't get perfect balance due to too many partitions recently moved).
|
can't get perfect balance due to too many partitions recently moved).
|
||||||
|
|
||||||
---------------------
|
|
||||||
Ring Builder Analyzer
|
|
||||||
---------------------
|
|
||||||
.. automodule:: swift.cli.ring_builder_analyzer
|
|
||||||
|
|
||||||
|
|
||||||
.. _composite_rings:
|
.. _composite_rings:
|
||||||
|
|
||||||
@ -358,6 +361,11 @@ Composite Rings
|
|||||||
---------------
|
---------------
|
||||||
.. automodule:: swift.common.ring.composite_builder
|
.. automodule:: swift.common.ring.composite_builder
|
||||||
|
|
||||||
|
---------------------
|
||||||
|
Ring Builder Analyzer
|
||||||
|
---------------------
|
||||||
|
.. automodule:: swift.cli.ring_builder_analyzer
|
||||||
|
|
||||||
-------
|
-------
|
||||||
History
|
History
|
||||||
-------
|
-------
|
||||||
@ -371,7 +379,7 @@ discarded.
|
|||||||
A "live ring" option was considered where each server could maintain its own
|
A "live ring" option was considered where each server could maintain its own
|
||||||
copy of the ring and the servers would use a gossip protocol to communicate the
|
copy of the ring and the servers would use a gossip protocol to communicate the
|
||||||
changes they made. This was discarded as too complex and error prone to code
|
changes they made. This was discarded as too complex and error prone to code
|
||||||
correctly in the project time span available. One bug could easily gossip bad
|
correctly in the project timespan available. One bug could easily gossip bad
|
||||||
data out to the entire cluster and be difficult to recover from. Having an
|
data out to the entire cluster and be difficult to recover from. Having an
|
||||||
externally managed ring simplifies the process, allows full validation of data
|
externally managed ring simplifies the process, allows full validation of data
|
||||||
before it's shipped out to the servers, and guarantees each server is using a
|
before it's shipped out to the servers, and guarantees each server is using a
|
||||||
@ -385,16 +393,16 @@ like the current process but where servers could submit change requests to the
|
|||||||
ring server to have a new ring built and shipped back out to the servers. This
|
ring server to have a new ring built and shipped back out to the servers. This
|
||||||
was discarded due to project time constraints and because ring changes are
|
was discarded due to project time constraints and because ring changes are
|
||||||
currently infrequent enough that manual control was sufficient. However, lack
|
currently infrequent enough that manual control was sufficient. However, lack
|
||||||
of quick automatic ring changes did mean that other parts of the system had to
|
of quick automatic ring changes did mean that other components of the system
|
||||||
be coded to handle devices being unavailable for a period of hours until
|
had to be coded to handle devices being unavailable for a period of hours until
|
||||||
someone could manually update the ring.
|
someone could manually update the ring.
|
||||||
|
|
||||||
The current ring process has each replica of a partition independently assigned
|
The current ring process has each replica of a partition independently assigned
|
||||||
to a device. A version of the ring that used a third of the memory was tried,
|
to a device. A version of the ring that used a third of the memory was tried,
|
||||||
where the first replica of a partition was directly assigned and the other two
|
where the first replica of a partition was directly assigned and the other two
|
||||||
were determined by "walking" the ring until finding additional devices in other
|
were determined by "walking" the ring until finding additional devices in other
|
||||||
zones. This was discarded as control was lost as to how many replicas for a
|
zones. This was discarded due to the loss of control over how many replicas for
|
||||||
given partition moved at once. Keeping each replica independent allows for
|
a given partition moved at once. Keeping each replica independent allows for
|
||||||
moving only one partition replica within a given time window (except due to
|
moving only one partition replica within a given time window (except due to
|
||||||
device failures). Using the additional memory was deemed a good trade-off for
|
device failures). Using the additional memory was deemed a good trade-off for
|
||||||
moving data around the cluster much less often.
|
moving data around the cluster much less often.
|
||||||
@ -409,16 +417,16 @@ add up. In the end, the memory savings wasn't that great and more processing
|
|||||||
power was used, so the idea was discarded.
|
power was used, so the idea was discarded.
|
||||||
|
|
||||||
A completely non-partitioned ring was also tried but discarded as the
|
A completely non-partitioned ring was also tried but discarded as the
|
||||||
partitioning helps many other parts of the system, especially replication.
|
partitioning helps many other components of the system, especially replication.
|
||||||
Replication can be attempted and retried in a partition batch with the other
|
Replication can be attempted and retried in a partition batch with the other
|
||||||
replicas rather than each data item independently attempted and retried. Hashes
|
replicas rather than each data item independently attempted and retried. Hashes
|
||||||
of directory structures can be calculated and compared with other replicas to
|
of directory structures can be calculated and compared with other replicas to
|
||||||
reduce directory walking and network traffic.
|
reduce directory walking and network traffic.
|
||||||
|
|
||||||
Partitioning and independently assigning partition replicas also allowed for
|
Partitioning and independently assigning partition replicas also allowed for
|
||||||
the best balanced cluster. The best of the other strategies tended to give
|
the best-balanced cluster. The best of the other strategies tended to give
|
||||||
+-10% variance on device balance with devices of equal weight and +-15% with
|
±10% variance on device balance with devices of equal weight and ±15% with
|
||||||
devices of varying weights. The current strategy allows us to get +-3% and +-8%
|
devices of varying weights. The current strategy allows us to get ±3% and ±8%
|
||||||
respectively.
|
respectively.
|
||||||
|
|
||||||
Various hashing algorithms were tried. SHA offers better security, but the ring
|
Various hashing algorithms were tried. SHA offers better security, but the ring
|
||||||
@ -441,4 +449,5 @@ didn't always get it. After that, overload was added to the ring builder so
|
|||||||
that operators could choose a balance between dispersion and device weights.
|
that operators could choose a balance between dispersion and device weights.
|
||||||
In time the overload concept was improved and made more accurate.
|
In time the overload concept was improved and made more accurate.
|
||||||
|
|
||||||
For more background on consistent hashing rings, please see :doc:`ring_background`.
|
For more background on consistent hashing rings, please see
|
||||||
|
:doc:`ring_background`.
|
||||||
|
@ -6,7 +6,7 @@ Building a Consistent Hashing Ring
|
|||||||
Authored by Greg Holt
|
Authored by Greg Holt
|
||||||
---------------------
|
---------------------
|
||||||
|
|
||||||
This is compilation of five posts I made earlier discussing how to build
|
This is a compilation of five posts I made earlier discussing how to build
|
||||||
a consistent hashing ring. The posts seemed to be accessed quite frequently,
|
a consistent hashing ring. The posts seemed to be accessed quite frequently,
|
||||||
so I've gathered them all here on one page for easier reading.
|
so I've gathered them all here on one page for easier reading.
|
||||||
|
|
||||||
@ -227,7 +227,7 @@ be done by creating “virtual nodes” for each node. So 100 nodes might have
|
|||||||
90423 ids moved, 0.90%
|
90423 ids moved, 0.90%
|
||||||
|
|
||||||
There we go, we added 1% capacity and only moved 0.9% of existing data.
|
There we go, we added 1% capacity and only moved 0.9% of existing data.
|
||||||
The vnode_range_starts list seems a bit out of place though. It’s values
|
The vnode_range_starts list seems a bit out of place though. Its values
|
||||||
are calculated and never change for the lifetime of the cluster, so let’s
|
are calculated and never change for the lifetime of the cluster, so let’s
|
||||||
optimize that out.
|
optimize that out.
|
||||||
|
|
||||||
|
@ -19,15 +19,16 @@ domains, but does not provide any guarantees such as placing at least one
|
|||||||
replica of every partition into each region. Composite rings are intended to
|
replica of every partition into each region. Composite rings are intended to
|
||||||
provide operators with greater control over the dispersion of object replicas
|
provide operators with greater control over the dispersion of object replicas
|
||||||
or fragments across a cluster, in particular when there is a desire to
|
or fragments across a cluster, in particular when there is a desire to
|
||||||
guarantee that some replicas or fragments are placed in certain failure
|
have strict guarantees that some replicas or fragments are placed in certain
|
||||||
domains.
|
failure domains. This is particularly important for policies with duplicated
|
||||||
|
erasure-coded fragments.
|
||||||
|
|
||||||
A composite ring comprises two or more component rings that are combined to
|
A composite ring comprises two or more component rings that are combined to
|
||||||
form a single ring with a replica count equal to the sum of the component
|
form a single ring with a replica count equal to the sum of replica counts
|
||||||
rings. The component rings are built independently, using distinct devices in
|
from the component rings. The component rings are built independently, using
|
||||||
distinct regions, which means that the dispersion of replicas between the
|
distinct devices in distinct regions, which means that the dispersion of
|
||||||
components can be guaranteed. The composite_builder utilities may
|
replicas between the components can be guaranteed. The ``composite_builder``
|
||||||
then be used to combine components into a composite ring.
|
utilities may then be used to combine components into a composite ring.
|
||||||
|
|
||||||
For example, consider a normal ring ``ring0`` with replica count of 4 and
|
For example, consider a normal ring ``ring0`` with replica count of 4 and
|
||||||
devices in two regions ``r1`` and ``r2``. Despite the best efforts of the
|
devices in two regions ``r1`` and ``r2``. Despite the best efforts of the
|
||||||
@ -56,15 +57,16 @@ composite ring.
|
|||||||
For rings to be formed into a composite they must satisfy the following
|
For rings to be formed into a composite they must satisfy the following
|
||||||
requirements:
|
requirements:
|
||||||
|
|
||||||
* All component rings must have the same number of partitions
|
* All component rings must have the same part power (and therefore number of
|
||||||
|
partitions)
|
||||||
* All component rings must have an integer replica count
|
* All component rings must have an integer replica count
|
||||||
* Each region may only be used in one component ring
|
* Each region may only be used in one component ring
|
||||||
* Each device may only be used in one component ring
|
* Each device may only be used in one component ring
|
||||||
|
|
||||||
Under the hood, the composite ring has a replica2part2dev_id table that is the
|
Under the hood, the composite ring has a ``_replica2part2dev_id`` table that is
|
||||||
union of the tables from the component rings. Whenever the component rings are
|
the union of the tables from the component rings. Whenever the component rings
|
||||||
rebalanced, the composite ring must be rebuilt. There is no dynamic rebuilding
|
are rebalanced, the composite ring must be rebuilt. There is no dynamic
|
||||||
of the composite ring.
|
rebuilding of the composite ring.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
The order in which component rings are combined into a composite ring is
|
The order in which component rings are combined into a composite ring is
|
||||||
@ -77,11 +79,11 @@ of the composite ring.
|
|||||||
|
|
||||||
The ``id`` of each component RingBuilder is therefore stored in metadata of
|
The ``id`` of each component RingBuilder is therefore stored in metadata of
|
||||||
the composite and used to check for the component ordering when the same
|
the composite and used to check for the component ordering when the same
|
||||||
composite ring is re-composed. RingBuilder id's are only assigned when a
|
composite ring is re-composed. RingBuilder ``id``\s are normally assigned
|
||||||
RingBuilder instance is first saved. Older RingBuilders instances loaded
|
when a RingBuilder instance is first saved. Older RingBuilder instances
|
||||||
from file may not have an ``id`` assigned and will need to be saved before
|
loaded from file may not have an ``id`` assigned and will need to be saved
|
||||||
they can be used as components of a composite ring. This can be achieved
|
before they can be used as components of a composite ring. This can be
|
||||||
by, for example::
|
achieved by, for example::
|
||||||
|
|
||||||
swift-ring-builder <builder-file> rebalance --force
|
swift-ring-builder <builder-file> rebalance --force
|
||||||
|
|
||||||
@ -147,7 +149,7 @@ def pre_validate_all_builders(builders):
|
|||||||
regions_info = {}
|
regions_info = {}
|
||||||
for builder in builders:
|
for builder in builders:
|
||||||
regions_info[builder] = set(
|
regions_info[builder] = set(
|
||||||
[dev['region'] for dev in builder._iter_devs()])
|
dev['region'] for dev in builder._iter_devs())
|
||||||
for first_region_set, second_region_set in combinations(
|
for first_region_set, second_region_set in combinations(
|
||||||
regions_info.values(), 2):
|
regions_info.values(), 2):
|
||||||
inter = first_region_set & second_region_set
|
inter = first_region_set & second_region_set
|
||||||
|
Loading…
Reference in New Issue
Block a user