openstack-manuals/doc/training-guide/module003-ch005-the-ring.xml

<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="module003-ch005-the-ring">
    <title>Ring Builder</title>
    <para>The rings are built and managed manually by a utility called
        the ring-builder. The ring-builder assigns partitions to
        devices and writes an optimized Python structure to a gzipped,
        serialized file on disk for shipping out to the servers. The
        server processes just check the modification time of the file
        occasionally and reload their in-memory copies of the ring
        structure as needed. Because of how the ring-builder manages
        changes to the ring, using a slightly older ring usually just
        means one of the three replicas for a subset of the partitions
        will be incorrect, which can be easily worked around.</para>
    <para>The ring-builder also keeps its own builder file with the
        ring information and additional data required to build future
        rings. It is very important to keep multiple backup copies of
        these builder files. One option is to copy the builder files
        out to every server while copying the ring files themselves.
        Another is to upload the builder files into the cluster
        itself. Complete loss of a builder file will mean creating a
        new ring from scratch, nearly all partitions will end up
        assigned to different devices, and therefore nearly all data
        stored will have to be replicated to new locations. So,
        recovery from a builder file loss is possible, but data will
        definitely be unreachable for an extended time.</para>
        <para><guilabel>Ring Data Structure</guilabel></para>
        <para>The ring data structure consists of three top level
            fields: a list of devices in the cluster, a list of lists
            of device ids indicating partition to device assignments,
            and an integer indicating the number of bits to shift an
            MD5 hash to calculate the partition for the hash.</para>
            <para><guilabel>Partition Assignment
        List</guilabel></para>
            <para>This is a list of array(‘H’) of devices ids. The
                outermost list contains an array(‘H’) for each
                replica. Each array(‘H’) has a length equal to the
                partition count for the ring. Each integer in the
                array(‘H’) is an index into the above list of devices.
                The partition list is known internally to the Ring
                class as _replica2part2dev_id.</para>
            <para>So, to create a list of device dictionaries assigned
                to a partition, the Python code would look like:
                devices = [self.devs[part2dev_id[partition]] for
                part2dev_id in self._replica2part2dev_id]</para>
            <para>That code is a little simplistic, as it does not
                account for the removal of duplicate devices. If a
                ring has more replicas than devices, then a partition
                will have more than one replica on one device; that’s
                simply the pigeonhole principle at work.</para>
            <para>array(‘H’) is used for memory conservation as there
                may be millions of partitions.</para>
            <para><guilabel>Fractional Replicas</guilabel></para>
            <para>A ring is not restricted to having an integer number
                of replicas. In order to support the gradual changing
                of replica counts, the ring is able to have a real
                number of replicas.</para>
            <para>When the number of replicas is not an integer, then
                the last element of _replica2part2dev_id will have a
                length that is less than the partition count for the
                ring. This means that some partitions will have more
                replicas than others. For example, if a ring has 3.25
                replicas, then 25% of its partitions will have four
                replicas, while the remaining 75% will have just
                three.</para>
            <para><guilabel>Partition Shift Value</guilabel></para>
            <para>The partition shift value is known internally to the
                Ring class as _part_shift. This value used to shift an
                MD5 hash to calculate the partition on which the data
                for that hash should reside. Only the top four bytes
                of the hash is used in this process. For example, to
                compute the partition for the path
                /account/container/object the Python code might look
                like: partition = unpack_from('&gt;I',
                md5('/account/container/object').digest())[0] &gt;&gt;
                self._part_shift</para>
            <para>For a ring generated with part_power P, the
                partition shift value is 32 - P.</para>
        <para><guilabel>Building the Ring</guilabel></para>
        <para>The initial building of the ring first calculates the
            number of partitions that should ideally be assigned to
            each device based the device’s weight. For example, given
            a partition power of 20, the ring will have 1,048,576
            partitions. If there are 1,000 devices of equal weight
            they will each desire 1,048.576 partitions. The devices
            are then sorted by the number of partitions they desire
            and kept in order throughout the initialization
            process.</para>
        <para>Note: each device is also assigned a random tiebreaker
            value that is used when two devices desire the same number
            of partitions. This tiebreaker is not stored on disk
            anywhere, and so two different rings created with the same
            parameters will have different partition assignments. For
            repeatable partition assignments, RingBuilder.rebalance()
            takes an optional seed value that will be used to seed
            Python’s pseudo-random number generator.</para>
        <para>Then, the ring builder assigns each replica of each
            partition to the device that desires the most partitions
            at that point while keeping it as far away as possible
            from other replicas. The ring builder prefers to assign a
            replica to a device in a regions that has no replicas
            already; should there be no such region available, the
            ring builder will try to find a device in a different
            zone; if not possible, it will look on a different server;
            failing that, it will just look for a device that has no
            replicas; finally, if all other options are exhausted, the
            ring builder will assign the replica to the device that
            has the fewest replicas already assigned. Note that
            assignment of multiple replicas to one device will only
            happen if the ring has fewer devices than it has
            replicas.</para>
        <para>When building a new ring based on an old ring, the
            desired number of partitions each device wants is
            recalculated. Next the partitions to be reassigned are
            gathered up. Any removed devices have all their assigned
            partitions unassigned and added to the gathered list. Any
            partition replicas that (due to the addition of new
            devices) can be spread out for better durability are
            unassigned and added to the gathered list. Any devices
            that have more partitions than they now desire have random
            partitions unassigned from them and added to the gathered
            list. Lastly, the gathered partitions are then reassigned
            to devices using a similar method as in the initial
            assignment described above.</para>
        <para>Whenever a partition has a replica reassigned, the time
            of the reassignment is recorded. This is taken into
            account when gathering partitions to reassign so that no
            partition is moved twice in a configurable amount of time.
            This configurable amount of time is known internally to
            the RingBuilder class as min_part_hours. This restriction
            is ignored for replicas of partitions on devices that have
            been removed, as removing a device only happens on device
            failure and there’s no choice but to make a
            reassignment.</para>
        <para>The above processes don’t always perfectly rebalance a
            ring due to the random nature of gathering partitions for
            reassignment. To help reach a more balanced ring, the
            rebalance process is repeated until near perfect (less 1%
            off) or when the balance doesn’t improve by at least 1%
            (indicating we probably can’t get perfect balance due to
            wildly imbalanced zones or too many partitions recently
            moved).</para>
</chapter>