Ring doc cleanups

Change-Id: Ie51ea5c729341da793887e1e25c1e45301a96751
This commit is contained in:
Tim Burke 2017-05-16 14:33:31 -07:00
parent 41c8f1330f
commit b5ee8c88d0
3 changed files with 199 additions and 188 deletions

View File

@ -5,38 +5,40 @@ The Rings
The rings determine where data should reside in the cluster. There is a The rings determine where data should reside in the cluster. There is a
separate ring for account databases, container databases, and individual separate ring for account databases, container databases, and individual
object storage policies but each ring works in the same way. These rings are object storage policies but each ring works in the same way. These rings are
externally managed, in that the server processes themselves do not modify the externally managed. The server processes themselves do not modify the
rings, they are instead given new rings modified by other tools. rings; they are instead given new rings modified by other tools.
The ring uses a configurable number of bits from a path's MD5 hash as a The ring uses a configurable number of bits from the MD5 hash of an item's path
partition index that designates a device. The number of bits kept from the hash as a partition index that designates the device(s) on which that item should
is known as the partition power, and 2 to the partition power indicates the be stored. The number of bits kept from the hash is known as the partition
partition count. Partitioning the full MD5 hash ring allows other parts of the power, and 2 to the partition power indicates the partition count. Partitioning
cluster to work in batches of items at once which ends up either more efficient the full MD5 hash ring allows the cluster components to process resources in
or at least less complex than working with each item separately or the entire batches. This ends up either more efficient or at least less complex than
cluster all at once. working with each item separately or the entire cluster all at once.
Another configurable value is the replica count, which indicates how many of Another configurable value is the replica count, which indicates how many
the partition->device assignments comprise a single ring. For a given partition devices to assign for each partition in the ring. By having multiple devices
number, each replica will be assigned to a different device in the ring. responsible for each partition, the cluster can recover from drive or network
failures.
Devices are added to the ring to describe the capacity available for Devices are added to the ring to describe the capacity available for
part-replica assignment. Devices are placed into failure domains consisting partition replica assignments. Devices are placed into failure domains
of region, zone, and server. Regions can be used to describe geographical consisting of region, zone, and server. Regions can be used to describe
systems characterized by lower-bandwidth or higher latency between machines in geographical systems characterized by lower bandwidth or higher latency between
different regions. Many rings will consist of only a single region. Zones machines in different regions. Many rings will consist of only a single
can be used to group devices based on physical locations, power separations, region. Zones can be used to group devices based on physical locations, power
network separations, or any other attribute that would lessen multiple separations, network separations, or any other attribute that would lessen
replicas being unavailable at the same time. multiple replicas being unavailable at the same time.
Devices are given a weight which describes relative weight of the device in Devices are given a weight which describes the relative storage capacity
comparison to other devices. contributed by the device in comparison to other devices.
When building a ring all of each part's replicas will be assigned to devices When building a ring, replicas for each partition will be assigned to devices
according to their weight. Additionally, each replica of a part will attempt according to the devices' weights. Additionally, each replica of a partition
to be assigned to a device who's failure domain does not already have a will preferentially be assigned to a device whose failure domain does not
replica for the part. Only a single replica of a part may be assigned to each already have a replica for that partition. Only a single replica of a
device - you must have as many devices as replicas. partition may be assigned to each device - you must have at least as many
devices as replicas.
.. _ring_builder: .. _ring_builder:
@ -45,24 +47,24 @@ Ring Builder
------------ ------------
The rings are built and managed manually by a utility called the ring-builder. The rings are built and managed manually by a utility called the ring-builder.
The ring-builder assigns partitions to devices and writes an optimized Python The ring-builder assigns partitions to devices and writes an optimized
structure to a gzipped, serialized file on disk for shipping out to the servers. structure to a gzipped, serialized file on disk for shipping out to the
The server processes just check the modification time of the file occasionally servers. The server processes check the modification time of the file
and reload their in-memory copies of the ring structure as needed. Because of occasionally and reload their in-memory copies of the ring structure as needed.
how the ring-builder manages changes to the ring, using a slightly older ring Because of how the ring-builder manages changes to the ring, using a slightly
usually just means one of the three replicas for a subset of the partitions older ring usually just means that for a subset of the partitions the device
will be incorrect, which can be easily worked around. for one of the replicas will be incorrect, which can be easily worked around.
The ring-builder also keeps its own builder file with the ring information and The ring-builder also keeps a separate builder file which includes the ring
additional data required to build future rings. It is very important to keep information as well as additional data required to build future rings. It is
multiple backup copies of these builder files. One option is to copy the very important to keep multiple backup copies of these builder files. One
builder files out to every server while copying the ring files themselves. option is to copy the builder files out to every server while copying the ring
Another is to upload the builder files into the cluster itself. Complete loss files themselves. Another is to upload the builder files into the cluster
of a builder file will mean creating a new ring from scratch, nearly all itself. Complete loss of a builder file will mean creating a new ring from
partitions will end up assigned to different devices, and therefore nearly all scratch, nearly all partitions will end up assigned to different devices, and
data stored will have to be replicated to new locations. So, recovery from a therefore nearly all data stored will have to be replicated to new locations.
builder file loss is possible, but data will definitely be unreachable for an So, recovery from a builder file loss is possible, but data will definitely be
extended time. unreachable for an extended time.
------------------- -------------------
Ring Data Structure Ring Data Structure
@ -77,13 +79,13 @@ to calculate the partition for the hash.
List of Devices List of Devices
*************** ***************
The list of devices is known internally to the Ring class as devs. Each item in The list of devices is known internally to the Ring class as ``devs``. Each
the list of devices is a dictionary with the following keys: item in the list of devices is a dictionary with the following keys:
====== ======= ============================================================== ====== ======= ==============================================================
id integer The index into the list devices. id integer The index into the list of devices.
zone integer The zone the device resides in. zone integer The zone in which the device resides.
region integer The region the zone resides in. region integer The region in which the zone resides.
weight float The relative weight of the device in comparison to other weight float The relative weight of the device in comparison to other
devices. This usually corresponds directly to the amount of devices. This usually corresponds directly to the amount of
disk space the device has compared to other devices. For disk space the device has compared to other devices. For
@ -94,10 +96,10 @@ weight float The relative weight of the device in comparison to other
data than desired over time. A good average weight of 100.0 data than desired over time. A good average weight of 100.0
allows flexibility in lowering the weight later if necessary. allows flexibility in lowering the weight later if necessary.
ip string The IP address or hostname of the server containing the device. ip string The IP address or hostname of the server containing the device.
port int The TCP port the listening server process uses that serves port int The TCP port on which the server process listens to serve
requests for the device. requests for the device.
device string The on disk name of the device on the server. device string The on-disk name of the device on the server.
For example: sdb1 For example: ``sdb1``
meta string A general-use field for storing additional information for the meta string A general-use field for storing additional information for the
device. This information isn't used directly by the server device. This information isn't used directly by the server
processes, but can be useful in debugging. For example, the processes, but can be useful in debugging. For example, the
@ -105,47 +107,54 @@ meta string A general-use field for storing additional information for the
be stored here. be stored here.
====== ======= ============================================================== ====== ======= ==============================================================
Note: The list of devices may contain holes, or indexes set to None, for .. note::
devices that have been removed from the cluster. However, device ids are The list of devices may contain holes, or indexes set to ``None``, for
reused. Device ids are reused to avoid potentially running out of device id devices that have been removed from the cluster. However, device ids are
slots when there are available slots (from prior removal of devices). A reused. Device ids are reused to avoid potentially running out of device id
consequence of this device id reuse is that the device id (integer value) does slots when there are available slots (from prior removal of devices). A
not necessarily correspond with the chronology of when the device was added to consequence of this device id reuse is that the device id (integer value)
the ring. Also, some devices may be temporarily disabled by setting their does not necessarily correspond with the chronology of when the device was
weight to 0.0. To obtain a list of active devices (for uptime polling, for added to the ring. Also, some devices may be temporarily disabled by
example) the Python code would look like: ``devices = list(self._iter_devs())`` setting their weight to ``0.0``. To obtain a list of active devices (for
uptime polling, for example) the Python code would look like::
devices = list(self._iter_devs())
************************* *************************
Partition Assignment List Partition Assignment List
************************* *************************
This is a list of array('H') of devices ids. The outermost list contains an The partition assignment list is known internally to the Ring class as
array('H') for each replica. Each array('H') has a length equal to the ``_replica2part2dev_id``. This is a list of ``array('H')``\s, one for each
partition count for the ring. Each integer in the array('H') is an index into replica. Each ``array('H')`` has a length equal to the partition count for the
the above list of devices. The partition list is known internally to the Ring ring. Each integer in the ``array('H')`` is an index into the above list of
class as _replica2part2dev_id. devices.
So, to create a list of device dictionaries assigned to a partition, the Python So, to create a list of device dictionaries assigned to a partition, the Python
code would look like: ``devices = [self.devs[part2dev_id[partition]] for code would look like::
part2dev_id in self._replica2part2dev_id]``
array('H') is used for memory conservation as there may be millions of devices = [self.devs[part2dev_id[partition]]
for part2dev_id in self._replica2part2dev_id]
``array('H')`` is used for memory conservation as there may be millions of
partitions. partitions.
********************* *********************
Partition Shift Value Partition Shift Value
********************* *********************
The partition shift value is known internally to the Ring class as _part_shift. The partition shift value is known internally to the Ring class as
This value used to shift an MD5 hash to calculate the partition on which the ``_part_shift``. This value is used to shift an MD5 hash of an item's path to
data for that hash should reside. Only the top four bytes of the hash is used calculate the partition on which the data for that item should reside. Only the
in this process. For example, to compute the partition for the path top four bytes of the hash are used in this process. For example, to compute
/account/container/object the Python code might look like: ``partition = the partition for the path ``/account/container/object``, the Python code might
unpack_from('>I', md5('/account/container/object').digest())[0] >> look like::
self._part_shift``
For a ring generated with part_power P, the partition shift value is objhash = md5('/account/container/object').digest()
32 - P. partition = struct.unpack_from('>I', objhash)[0] >> self._part_shift
For a ring generated with partition power ``P``, the partition shift value is
``32 - P``.
******************* *******************
Fractional Replicas Fractional Replicas
@ -155,11 +164,12 @@ A ring is not restricted to having an integer number of replicas. In order to
support the gradual changing of replica counts, the ring is able to have a real support the gradual changing of replica counts, the ring is able to have a real
number of replicas. number of replicas.
When the number of replicas is not an integer, then the last element of When the number of replicas is not an integer, the last element of
_replica2part2dev_id will have a length that is less than the partition count ``_replica2part2dev_id`` will have a length that is less than the partition
for the ring. This means that some partitions will have more replicas than count for the ring. This means that some partitions will have more replicas
others. For example, if a ring has 3.25 replicas, then 25% of its partitions than others. For example, if a ring has ``3.25`` replicas, then 25% of its
will have four replicas, while the remaining 75% will have just three. partitions will have four replicas, while the remaining 75% will have just
three.
.. _ring_dispersion: .. _ring_dispersion:
@ -186,24 +196,24 @@ Overload
The ring builder tries to keep replicas as far apart as possible while The ring builder tries to keep replicas as far apart as possible while
still respecting device weights. When it can't do both, the overload still respecting device weights. When it can't do both, the overload
factor determines what happens. Each device will take some extra factor determines what happens. Each device may take some extra
fraction of its desired partitions to allow for replica dispersion; fraction of its desired partitions to allow for replica dispersion;
once that extra fraction is exhausted, replicas will be placed closer once that extra fraction is exhausted, replicas will be placed closer
together than optimal. together than is optimal for durability.
Essentially, the overload factor lets the operator trade off replica Essentially, the overload factor lets the operator trade off replica
dispersion (durability) against data dispersion (uniform disk usage). dispersion (durability) against device balance (uniform disk usage).
The default overload factor is 0, so device weights will be strictly The default overload factor is ``0``, so device weights will be strictly
followed. followed.
With an overload factor of 0.1, each device will accept 10% more With an overload factor of ``0.1``, each device will accept 10% more
partitions than it otherwise would, but only if needed to maintain partitions than it otherwise would, but only if needed to maintain
partition dispersion. dispersion.
Example: Consider a 3-node cluster of machines with equal-size disks; Example: Consider a 3-node cluster of machines with equal-size disks;
let node A have 12 disks, node B have 12 disks, and node C have only let node A have 12 disks, node B have 12 disks, and node C have only
11 disks. Let the ring have an overload factor of 0.1 (10%). 11 disks. Let the ring have an overload factor of ``0.1`` (10%).
Without the overload, some partitions would end up with replicas only Without the overload, some partitions would end up with replicas only
on nodes A and B. However, with the overload, every device is willing on nodes A and B. However, with the overload, every device is willing
@ -228,65 +238,63 @@ number of "nodes" to which keys in the keyspace must be assigned. Swift calls
these ranges `partitions` - they are partitions of the total keyspace. these ranges `partitions` - they are partitions of the total keyspace.
Each partition will have multiple replicas. Every replica of each partition Each partition will have multiple replicas. Every replica of each partition
must be assigned to a device in the ring. When a describing a specific must be assigned to a device in the ring. When describing a specific replica
replica of a partition (like when it's assigned a device) it is described as a of a partition (like when it's assigned a device) it is described as a
`part-replica` in that it is a specific `replica` of the specific `partition`. `part-replica` in that it is a specific `replica` of the specific `partition`.
A single device may be assigned different replicas from many parts, but it may A single device will likely be assigned different replicas from many
not be assigned multiple replicas of a single part. partitions, but it may not be assigned multiple replicas of a single partition.
The total number of partitions in a ring is calculated as ``2 ** The total number of partitions in a ring is calculated as ``2 **
<part-power>``. The total number of part-replicas in a ring is calculated as <part-power>``. The total number of part-replicas in a ring is calculated as
``<replica-count> * 2 ** <part-power>``. ``<replica-count> * 2 ** <part-power>``.
When considering a device's `weight` it is useful to describe the number of When considering a device's `weight` it is useful to describe the number of
part-replicas it would like to be assigned. A single device regardless of part-replicas it would like to be assigned. A single device, regardless of
weight will never hold more than ``2 ** <part-power>`` part-replicas because weight, will never hold more than ``2 ** <part-power>`` part-replicas because
it can not have more than one replica of any part assigned. The number of it can not have more than one replica of any partition assigned. The number of
part-replicas a device can take by weights is calculated as it's part-replicas a device can take by weights is calculated as its `parts-wanted`.
`parts_wanted`. The true number of part-replicas assigned to a device can be The true number of part-replicas assigned to a device can be compared to its
compared to it's parts wanted similarly to a calculation of percentage error - parts-wanted similarly to a calculation of percentage error - this deviation in
this deviation in the observed result from the idealized target is called a the observed result from the idealized target is called a device's `balance`.
devices `balance`.
When considering a device's `failure domain` it is useful to describe the When considering a device's `failure domain` it is useful to describe the number
number of part-replicas it would like to be assigned. The number of of part-replicas it would like to be assigned. The number of part-replicas
part-replicas wanted in a failure domain of a tier is the sum of the wanted in a failure domain of a tier is the sum of the part-replicas wanted in
part-replicas wanted in the failure domains of it's sub-tier. However, the failure domains of its sub-tier. However, collectively when the total
collectively when the total number of part-replicas in a failure domain number of part-replicas in a failure domain exceeds or is equal to ``2 **
exceeds or is equal to ``2 ** <part-power>`` it is most obvious that it's no <part-power>`` it is most obvious that it's no longer sufficient to consider
longer sufficient to consider only the number of total part-replicas, but only the number of total part-replicas, but rather the fraction of each
rather the fraction of each replica's partitions. Consider for example a ring replica's partitions. Consider for example a ring with 3 replicas and 3
with ``3`` replicas and ``3`` servers, while it's necessary for dispersion servers: while dispersion requires that each server hold only ⅓ of the total
that each server hold only ``1/3`` of the total part-replicas it is part-replicas, placement is additionally constrained to require ``1.0`` replica
additionally constrained to require ``1.0`` replica of *each* partition. It of *each* partition per server. It would not be sufficient to satisfy
would not be sufficient to satisfy dispersion if two devices on one of the dispersion if two devices on one of the servers each held a replica of a single
servers each held a replica of a single partition, while another server held partition, while another server held none. By considering a decimal fraction
none. By considering a decimal fraction of one replica's worth of parts in a of one replica's worth of partitions in a failure domain we can derive the
failure domain we can derive the total part-replicas wanted in a failure total part-replicas wanted in a failure domain (``1.0 * 2 ** <part-power>``).
domain (``1.0 * 2 ** <part-power>``). Additionally we infer more about Additionally we infer more about `which` part-replicas must go in the failure
`which` part-replicas must go in the failure domain. Consider a ring with domain. Consider a ring with three replicas and two zones, each with two
three replicas, and two zones, each with two servers (four servers total). servers (four servers total). The three replicas worth of partitions will be
The three replicas worth of partitions will be assigned into two failure assigned into two failure domains at the zone tier. Each zone must hold more
domains at the zone tier. Each zone must hold more than one replica of some than one replica of some partitions. We represent this improper fraction of a
parts. We represent this improper faction of a replica's worth of partitions replica's worth of partitions in decimal form as ``1.5`` (``3.0 / 2``). This
in decimal form as ``1.5`` (``3.0 / 2``). This tells us not only the *number* tells us not only the *number* of total partitions (``1.5 * 2 **
of total parts (``1.5 * 2 ** <part-power>``) but also that *each* partition <part-power>``) but also that *each* partition must have `at least` one replica
must have `at least` one replica in this failure domain (in fact ``0.5`` of in this failure domain (in fact ``0.5`` of the partitions will have 2
the partitions will have ``2`` replicas). Within each zone the two servers replicas). Within each zone the two servers will hold ``0.75`` of a replica's
will hold ``0.75`` of a replica's worth of partitions - this is equal both to worth of partitions - this is equal both to "the fraction of a replica's worth
"the fraction of a replica's worth of partitions assigned to each zone of partitions assigned to each zone (``1.5``) divided evenly among the number
(``1.5``) divided evenly among the number of failure domain's in it's sub-tier of failure domains in its sub-tier (2 servers in each zone, i.e. ``1.5 / 2``)"
(``2`` servers in each zone, i.e. ``1.5 / 2``)" but *also* "the total number but *also* "the total number of replicas (``3.0``) divided evenly among the
of replicas (``3.0``) divided evenly among the total number of failure domains total number of failure domains in the server tier (2 servers × 2 zones = 4,
in the server tier (``2`` servers x ``2`` zones = ``4``, i.e. ``3.0 / 4``)". i.e. ``3.0 / 4``)". It is useful to consider that each server in this ring
It is useful to consider that each server in this ring will hold only ``0.75`` will hold only ``0.75`` of a replica's worth of partitions which tells that any
of a replica's worth of partitions which tells that any server should have `at server should have `at most` one replica of a given partition assigned. In the
most` one replica of a given part assigned. In the interests of brevity, some interests of brevity, some variable names will often refer to the concept
variable names will often refer to the concept representing the fraction of a representing the fraction of a replica's worth of partitions in decimal form as
replica's worth of partitions in decimal form as *replicanths* - this is meant *replicanths* - this is meant to invoke connotations similar to ordinal numbers
to invoke connotations similar to ordinal numbers as applied to fractions, but as applied to fractions, but generalized to a replica instead of a four\*th* or
generalized to a replica instead of four*th* or a fif*th*. The 'n' was a fif\*th*. The "n" was probably thrown in because of Blade Runner.
probably thrown in because of Blade Runner.
----------------- -----------------
Building the Ring Building the Ring
@ -298,8 +306,8 @@ ring's topology based on weight.
Then the ring builder calculates the replicanths wanted at each tier in the Then the ring builder calculates the replicanths wanted at each tier in the
ring's topology based on dispersion. ring's topology based on dispersion.
Then the ring calculates the maximum deviation on a single device between it's Then the ring builder calculates the maximum deviation on a single device
weighted replicanths and wanted replicanths. between its weighted replicanths and wanted replicanths.
Next we interpolate between the two replicanth values (weighted & wanted) at Next we interpolate between the two replicanth values (weighted & wanted) at
each tier using the specified overload (up to the maximum required overload). each tier using the specified overload (up to the maximum required overload).
@ -309,19 +317,19 @@ calculate the intersection of the line with the desired overload. This
becomes the target. becomes the target.
From the target we calculate the minimum and maximum number of replicas any From the target we calculate the minimum and maximum number of replicas any
part may have in a tier. This becomes the replica_plan. partition may have in a tier. This becomes the `replica-plan`.
Finally, we calculate the number of partitions that should ideally be assigned Finally, we calculate the number of partitions that should ideally be assigned
to each device based the replica_plan. to each device based the replica-plan.
On initial balance, the first time partitions are placed to generate a ring, On initial balance (i.e., the first time partitions are placed to generate a
we must assign each replica of each partition to the device that desires the ring) we must assign each replica of each partition to the device that desires
most partitions excluding any devices that already have their maximum number the most partitions excluding any devices that already have their maximum
of replicas of that part assigned to some parent tier of that device's failure number of replicas of that partition assigned to some parent tier of that
domain. device's failure domain.
When building a new ring based on an old ring, the desired number of When building a new ring based on an old ring, the desired number of
partitions each device wants is recalculated from the current replica_plan. partitions each device wants is recalculated from the current replica-plan.
Next the partitions to be reassigned are gathered up. Any removed devices have Next the partitions to be reassigned are gathered up. Any removed devices have
all their assigned partitions unassigned and added to the gathered list. Any all their assigned partitions unassigned and added to the gathered list. Any
partition replicas that (due to the addition of new devices) can be spread out partition replicas that (due to the addition of new devices) can be spread out
@ -335,21 +343,16 @@ Whenever a partition has a replica reassigned, the time of the reassignment is
recorded. This is taken into account when gathering partitions to reassign so recorded. This is taken into account when gathering partitions to reassign so
that no partition is moved twice in a configurable amount of time. This that no partition is moved twice in a configurable amount of time. This
configurable amount of time is known internally to the RingBuilder class as configurable amount of time is known internally to the RingBuilder class as
min_part_hours. This restriction is ignored for replicas of partitions on ``min_part_hours``. This restriction is ignored for replicas of partitions on
devices that have been removed, as removing a device only happens on device devices that have been removed, as device removal should only happens on device
failure and there's no choice but to make a reassignment. failure and there's no choice but to make a reassignment.
The above processes don't always perfectly rebalance a ring due to the random The above processes don't always perfectly rebalance a ring due to the random
nature of gathering partitions for reassignment. To help reach a more balanced nature of gathering partitions for reassignment. To help reach a more balanced
ring, the rebalance process is repeated a fixed number of times until the ring, the rebalance process is repeated a fixed number of times until the
replica_plan is fulfilled or unable to be fulfilled (indicating we probably replica-plan is fulfilled or unable to be fulfilled (indicating we probably
can't get perfect balance due to too many partitions recently moved). can't get perfect balance due to too many partitions recently moved).
---------------------
Ring Builder Analyzer
---------------------
.. automodule:: swift.cli.ring_builder_analyzer
.. _composite_rings: .. _composite_rings:
@ -358,6 +361,11 @@ Composite Rings
--------------- ---------------
.. automodule:: swift.common.ring.composite_builder .. automodule:: swift.common.ring.composite_builder
---------------------
Ring Builder Analyzer
---------------------
.. automodule:: swift.cli.ring_builder_analyzer
------- -------
History History
------- -------
@ -371,7 +379,7 @@ discarded.
A "live ring" option was considered where each server could maintain its own A "live ring" option was considered where each server could maintain its own
copy of the ring and the servers would use a gossip protocol to communicate the copy of the ring and the servers would use a gossip protocol to communicate the
changes they made. This was discarded as too complex and error prone to code changes they made. This was discarded as too complex and error prone to code
correctly in the project time span available. One bug could easily gossip bad correctly in the project timespan available. One bug could easily gossip bad
data out to the entire cluster and be difficult to recover from. Having an data out to the entire cluster and be difficult to recover from. Having an
externally managed ring simplifies the process, allows full validation of data externally managed ring simplifies the process, allows full validation of data
before it's shipped out to the servers, and guarantees each server is using a before it's shipped out to the servers, and guarantees each server is using a
@ -385,16 +393,16 @@ like the current process but where servers could submit change requests to the
ring server to have a new ring built and shipped back out to the servers. This ring server to have a new ring built and shipped back out to the servers. This
was discarded due to project time constraints and because ring changes are was discarded due to project time constraints and because ring changes are
currently infrequent enough that manual control was sufficient. However, lack currently infrequent enough that manual control was sufficient. However, lack
of quick automatic ring changes did mean that other parts of the system had to of quick automatic ring changes did mean that other components of the system
be coded to handle devices being unavailable for a period of hours until had to be coded to handle devices being unavailable for a period of hours until
someone could manually update the ring. someone could manually update the ring.
The current ring process has each replica of a partition independently assigned The current ring process has each replica of a partition independently assigned
to a device. A version of the ring that used a third of the memory was tried, to a device. A version of the ring that used a third of the memory was tried,
where the first replica of a partition was directly assigned and the other two where the first replica of a partition was directly assigned and the other two
were determined by "walking" the ring until finding additional devices in other were determined by "walking" the ring until finding additional devices in other
zones. This was discarded as control was lost as to how many replicas for a zones. This was discarded due to the loss of control over how many replicas for
given partition moved at once. Keeping each replica independent allows for a given partition moved at once. Keeping each replica independent allows for
moving only one partition replica within a given time window (except due to moving only one partition replica within a given time window (except due to
device failures). Using the additional memory was deemed a good trade-off for device failures). Using the additional memory was deemed a good trade-off for
moving data around the cluster much less often. moving data around the cluster much less often.
@ -409,16 +417,16 @@ add up. In the end, the memory savings wasn't that great and more processing
power was used, so the idea was discarded. power was used, so the idea was discarded.
A completely non-partitioned ring was also tried but discarded as the A completely non-partitioned ring was also tried but discarded as the
partitioning helps many other parts of the system, especially replication. partitioning helps many other components of the system, especially replication.
Replication can be attempted and retried in a partition batch with the other Replication can be attempted and retried in a partition batch with the other
replicas rather than each data item independently attempted and retried. Hashes replicas rather than each data item independently attempted and retried. Hashes
of directory structures can be calculated and compared with other replicas to of directory structures can be calculated and compared with other replicas to
reduce directory walking and network traffic. reduce directory walking and network traffic.
Partitioning and independently assigning partition replicas also allowed for Partitioning and independently assigning partition replicas also allowed for
the best balanced cluster. The best of the other strategies tended to give the best-balanced cluster. The best of the other strategies tended to give
+-10% variance on device balance with devices of equal weight and +-15% with ±10% variance on device balance with devices of equal weight and ±15% with
devices of varying weights. The current strategy allows us to get +-3% and +-8% devices of varying weights. The current strategy allows us to get ±3% and ±8%
respectively. respectively.
Various hashing algorithms were tried. SHA offers better security, but the ring Various hashing algorithms were tried. SHA offers better security, but the ring
@ -441,4 +449,5 @@ didn't always get it. After that, overload was added to the ring builder so
that operators could choose a balance between dispersion and device weights. that operators could choose a balance between dispersion and device weights.
In time the overload concept was improved and made more accurate. In time the overload concept was improved and made more accurate.
For more background on consistent hashing rings, please see :doc:`ring_background`. For more background on consistent hashing rings, please see
:doc:`ring_background`.

View File

@ -6,7 +6,7 @@ Building a Consistent Hashing Ring
Authored by Greg Holt Authored by Greg Holt
--------------------- ---------------------
This is compilation of five posts I made earlier discussing how to build This is a compilation of five posts I made earlier discussing how to build
a consistent hashing ring. The posts seemed to be accessed quite frequently, a consistent hashing ring. The posts seemed to be accessed quite frequently,
so I've gathered them all here on one page for easier reading. so I've gathered them all here on one page for easier reading.
@ -227,7 +227,7 @@ be done by creating “virtual nodes” for each node. So 100 nodes might have
90423 ids moved, 0.90% 90423 ids moved, 0.90%
There we go, we added 1% capacity and only moved 0.9% of existing data. There we go, we added 1% capacity and only moved 0.9% of existing data.
The vnode_range_starts list seems a bit out of place though. Its values The vnode_range_starts list seems a bit out of place though. Its values
are calculated and never change for the lifetime of the cluster, so lets are calculated and never change for the lifetime of the cluster, so lets
optimize that out. optimize that out.

View File

@ -19,15 +19,16 @@ domains, but does not provide any guarantees such as placing at least one
replica of every partition into each region. Composite rings are intended to replica of every partition into each region. Composite rings are intended to
provide operators with greater control over the dispersion of object replicas provide operators with greater control over the dispersion of object replicas
or fragments across a cluster, in particular when there is a desire to or fragments across a cluster, in particular when there is a desire to
guarantee that some replicas or fragments are placed in certain failure have strict guarantees that some replicas or fragments are placed in certain
domains. failure domains. This is particularly important for policies with duplicated
erasure-coded fragments.
A composite ring comprises two or more component rings that are combined to A composite ring comprises two or more component rings that are combined to
form a single ring with a replica count equal to the sum of the component form a single ring with a replica count equal to the sum of replica counts
rings. The component rings are built independently, using distinct devices in from the component rings. The component rings are built independently, using
distinct regions, which means that the dispersion of replicas between the distinct devices in distinct regions, which means that the dispersion of
components can be guaranteed. The composite_builder utilities may replicas between the components can be guaranteed. The ``composite_builder``
then be used to combine components into a composite ring. utilities may then be used to combine components into a composite ring.
For example, consider a normal ring ``ring0`` with replica count of 4 and For example, consider a normal ring ``ring0`` with replica count of 4 and
devices in two regions ``r1`` and ``r2``. Despite the best efforts of the devices in two regions ``r1`` and ``r2``. Despite the best efforts of the
@ -56,15 +57,16 @@ composite ring.
For rings to be formed into a composite they must satisfy the following For rings to be formed into a composite they must satisfy the following
requirements: requirements:
* All component rings must have the same number of partitions * All component rings must have the same part power (and therefore number of
partitions)
* All component rings must have an integer replica count * All component rings must have an integer replica count
* Each region may only be used in one component ring * Each region may only be used in one component ring
* Each device may only be used in one component ring * Each device may only be used in one component ring
Under the hood, the composite ring has a replica2part2dev_id table that is the Under the hood, the composite ring has a ``_replica2part2dev_id`` table that is
union of the tables from the component rings. Whenever the component rings are the union of the tables from the component rings. Whenever the component rings
rebalanced, the composite ring must be rebuilt. There is no dynamic rebuilding are rebalanced, the composite ring must be rebuilt. There is no dynamic
of the composite ring. rebuilding of the composite ring.
.. note:: .. note::
The order in which component rings are combined into a composite ring is The order in which component rings are combined into a composite ring is
@ -77,11 +79,11 @@ of the composite ring.
The ``id`` of each component RingBuilder is therefore stored in metadata of The ``id`` of each component RingBuilder is therefore stored in metadata of
the composite and used to check for the component ordering when the same the composite and used to check for the component ordering when the same
composite ring is re-composed. RingBuilder id's are only assigned when a composite ring is re-composed. RingBuilder ``id``\s are normally assigned
RingBuilder instance is first saved. Older RingBuilders instances loaded when a RingBuilder instance is first saved. Older RingBuilder instances
from file may not have an ``id`` assigned and will need to be saved before loaded from file may not have an ``id`` assigned and will need to be saved
they can be used as components of a composite ring. This can be achieved before they can be used as components of a composite ring. This can be
by, for example:: achieved by, for example::
swift-ring-builder <builder-file> rebalance --force swift-ring-builder <builder-file> rebalance --force
@ -147,7 +149,7 @@ def pre_validate_all_builders(builders):
regions_info = {} regions_info = {}
for builder in builders: for builder in builders:
regions_info[builder] = set( regions_info[builder] = set(
[dev['region'] for dev in builder._iter_devs()]) dev['region'] for dev in builder._iter_devs())
for first_region_set, second_region_set in combinations( for first_region_set, second_region_set in combinations(
regions_info.values(), 2): regions_info.values(), 2):
inter = first_region_set & second_region_set inter = first_region_set & second_region_set