Adding overload concept information to ring builder Change-Id: Id801f587c02da2b8f083677ad96f61826d89579d Closes-bug: #1409127
9.6 KiB
Ring-builder
Use the swift-ring-builder utility to build and manage rings. This utility assigns partitions to devices and writes an optimized Python structure to a gzipped, serialized file on disk for transmission to the servers. The server processes occasionally check the modification time of the file and reload in-memory copies of the ring structure as needed. If you use a slightly older version of the ring, one of the three replicas for a partition subset will be incorrect because of the way the ring-builder manages changes to the ring. You can work around this issue.
The ring-builder also keeps its own builder file with the ring information and additional data required to build future rings. It is very important to keep multiple backup copies of these builder files. One option is to copy the builder files out to every server while copying the ring files themselves. Another is to upload the builder files into the cluster itself. If you lose the builder file, you have to create a new ring from scratch. Nearly all partitions would be assigned to different devices and, therefore, nearly all of the stored data would have to be replicated to new locations. So, recovery from a builder file loss is possible, but data would be unreachable for an extended time.
Ring data structure
The ring data structure consists of three top level fields: a list of devices in the cluster, a list of lists of device ids indicating partition to device assignments, and an integer indicating the number of bits to shift an MD5 hash to calculate the partition for the hash.
Partition assignment list
This is a list of array('H')
of devices ids. The
outermost list contains an array('H')
for each replica.
Each array('H')
has a length equal to the partition count
for the ring. Each integer in the array('H')
is an index
into the above list of devices. The partition list is known internally
to the Ring class as _replica2part2dev_id
.
So, to create a list of device dictionaries assigned to a partition, the Python code would look like:
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]
That code is a little simplistic because it does not account for the removal of duplicate devices. If a ring has more replicas than devices, a partition will have more than one replica on a device.
array('H')
is used for memory conservation as there may
be millions of partitions.
Overload
The ring builder tries to keep replicas as far apart as possible while still respecting device weights. When it can not do both, the overload factor determines what happens. Each device takes an extra fraction of its desired partitions to allow for replica dispersion; after that extra fraction is exhausted, replicas are placed closer together than optimal.
The overload factor lets the operator trade off replica dispersion (durability) against data dispersion (uniform disk usage).
The default overload factor is 0, so device weights are strictly followed.
With an overload factor of 0.1, each device accepts 10% more partitions than it otherwise would, but only if it needs to maintain partition dispersion.
For example, consider a 3-node cluster of machines with equal-size disks; node A has 12 disks, node B has 12 disks, and node C has 11 disks. The ring has an overload factor of 0.1 (10%).
Without the overload, some partitions would end up with replicas only on nodes A and B. However, with the overload, every device can accept up to 10% more partitions for the sake of dispersion. The missing disk in C means there is one disk's worth of partitions to spread across the remaining 11 disks, which gives each disk in C an extra 9.09% load. Since this is less than the 10% overload, there is one replica of each partition on each node.
However, this does mean that the disks in node C have more data than the disks in nodes A and B. If 80% full is the warning threshold for the cluster, node C's disks reach 80% full while A and B's disks are only 72.7% full.
Replica counts
To support the gradual change in replica counts, a ring can have a real number of replicas and is not restricted to an integer number of replicas.
A fractional replica count is for the whole ring and not for individual partitions. It indicates the average number of replicas for each partition. For example, a replica count of 3.2 means that 20 percent of partitions have four replicas and 80 percent have three replicas.
The replica count is adjustable.
Example:
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder account.builder rebalance
You must rebalance the replica ring in globally distributed clusters. Operators of these clusters generally want an equal number of replicas and regions. Therefore, when an operator adds or removes a region, the operator adds or removes a replica. Removing unneeded replicas saves on the cost of disks.
You can gradually increase the replica count at a rate that does not adversely affect cluster performance.
For example:
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Changes take effect after the ring is rebalanced. Therefore, if you intend to change from 3 replicas to 3.01 but you accidentally type 2.01, no data is lost.
Additionally, the swift-ring-builder X.builder create
command can now take a decimal argument for the number of replicas.
Partition shift value
The partition shift value is known internally to the Ring class as
_part_shift
. This value is used to shift an MD5 hash to
calculate the partition where the data for that hash should reside. Only
the top four bytes of the hash is used in this process. For example, to
compute the partition for the /account/container/object
path using Python:
partition = unpack_from('>I',
md5('/account/container/object').digest())[0] >>
self._part_shift
For a ring generated with part_power P, the partition shift value is
32 - P
.
Build the ring
The ring builder process includes these high-level steps:
The utility calculates the number of partitions to assign to each device based on the weight of the device. For example, for a partition at the power of 20, the ring has 1,048,576 partitions. One thousand devices of equal weight each want 1,048.576 partitions. The devices are sorted by the number of partitions they desire and kept in order throughout the initialization process.
Note
Each device is also assigned a random tiebreaker value that is used when two devices desire the same number of partitions. This tiebreaker is not stored on disk anywhere, and so two different rings created with the same parameters will have different partition assignments. For repeatable partition assignments,
RingBuilder.rebalance()
takes an optional seed value that seeds the Python pseudo-random number generator.The ring builder assigns each partition replica to the device that requires most partitions at that point while keeping it as far away as possible from other replicas. The ring builder prefers to assign a replica to a device in a region that does not already have a replica. If no such region is available, the ring builder searches for a device in a different zone, or on a different server. If it does not find one, it looks for a device with no replicas. Finally, if all options are exhausted, the ring builder assigns the replica to the device that has the fewest replicas already assigned.
Note
The ring builder assigns multiple replicas to one device only if the ring has fewer devices than it has replicas.
When building a new ring from an old ring, the ring builder recalculates the desired number of partitions that each device wants.
The ring builder unassigns partitions and gathers these partitions for reassignment, as follows:
- The ring builder unassigns any assigned partitions from any removed devices and adds these partitions to the gathered list.
- The ring builder unassigns any partition replicas that can be spread out for better durability and adds these partitions to the gathered list.
- The ring builder unassigns random partitions from any devices that have more partitions than they need and adds these partitions to the gathered list.
The ring builder reassigns the gathered partitions to devices by using a similar method to the one described previously.
When the ring builder reassigns a replica to a partition, the ring builder records the time of the reassignment. The ring builder uses this value when it gathers partitions for reassignment so that no partition is moved twice in a configurable amount of time. The RingBuilder class knows this configurable amount of time as
min_part_hours
. The ring builder ignores this restriction for replicas of partitions on removed devices because removal of a device happens on device failure only, and reassignment is the only choice.
These steps do not always perfectly rebalance a ring due to the random nature of gathering partitions for reassignment. To help reach a more balanced ring, the rebalance process is repeated until near perfect (less than 1 percent off) or when the balance does not improve by at least 1 percent (indicating we probably cannot get perfect balance due to wildly imbalanced zones or too many partitions recently moved).