openstack-manuals/doc/training-guide/module003-ch004-swift-building-blocks.xml

<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
    xmlns:xi="http://www.w3.org/2001/XInclude"
    xmlns:xlink="http://www.w3.org/1999/xlink"
    version="5.0"
    xml:id="module003-ch004-swift-building-blocks">
    <title>Building Blocks of Swift</title>
        <para>The components that enable Swift to deliver high
            availability, high durability and high concurrency
            are:</para>
        <itemizedlist>
            <listitem>
                <para><emphasis role="bold">Proxy
                Servers:</emphasis>Handles all incoming API
                requests.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Rings:</emphasis>Maps
                logical names of data to locations on particular
                disks.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Zones:</emphasis>Each Zone
                isolates data from other Zones. A failure in one Zone
                doesn’t impact the rest of the cluster because data is
                replicated across the Zones.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Accounts &amp;
                    Containers:</emphasis>Each Account and Container
                are individual databases that are distributed across
                the cluster. An Account database contains the list of
                Containers in that Account. A Container database
                contains the list of Objects in that Container</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Objects:</emphasis>The
                data itself.</para>
            </listitem>
            <listitem>
                <para><emphasis role="bold">Partitions:</emphasis>A
                Partition stores Objects, Account databases and
                Container databases. It’s an intermediate 'bucket'
                that helps manage locations where data lives in the
                cluster.</para>
            </listitem>
        </itemizedlist>
    <figure>
        <title>Building Blocks</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image40.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        <para><guilabel>Proxy Servers</guilabel></para>
        <para>The Proxy Servers are the public face of Swift and
            handle all incoming API requests. Once a Proxy Server
            receive a request, it will determine the storage node
            based on the URL of the object, e.g.
            https://swift.example.com/v1/account/container/object. The
            Proxy Servers also coordinates responses, handles failures
            and coordinates timestamps.</para>
        <para>Proxy servers use a shared-nothing architecture and can
            be scaled as needed based on projected workloads. A
            minimum of two Proxy Servers should be deployed for
            redundancy. Should one proxy server fail, the others will
            take over.</para>
    <para><guilabel>The Ring</guilabel></para>
    <para>A ring represents a mapping between the names of entities
        stored on disk and their physical location. There are separate
        rings for accounts, containers, and objects. When other
        components need to perform any operation on an object,
        container, or account, they need to interact with the
        appropriate ring to determine its location in the
        cluster.</para>
    <para>The Ring maintains this mapping using zones, devices,
        partitions, and replicas. Each partition in the ring is
        replicated, by default, 3 times across the cluster, and the
        locations for a partition are stored in the mapping maintained
        by the ring. The ring is also responsible for determining
        which devices are used for hand off in failure
        scenarios.</para>
        <para>Data can be isolated with the concept of zones in the
            ring. Each replica of a partition is guaranteed to reside
            in a different zone. A zone could represent a drive, a
            server, a cabinet, a switch, or even a data center.</para>
        <para>The partitions of the ring are equally divided among all
            the devices in the OpenStack Object Storage installation.
            When partitions need to be moved around (for example if a
            device is added to the cluster), the ring ensures that a
            minimum number of partitions are moved at a time, and only
            one replica of a partition is moved at a time.</para>
        <para>Weights can be used to balance the distribution of
            partitions on drives across the cluster. This can be
            useful, for example, when different sized drives are used
            in a cluster.</para>
        <para>The ring is used by the Proxy server and several
            background processes (like replication).</para>
        <para>The Ring maps Partitions to physical locations on disk.
            When other components need to perform any operation on an
            object, container, or account, they need to interact with
            the Ring to determine its location in the cluster.</para>
        <para>The Ring maintains this mapping using zones, devices,
            partitions, and replicas. Each partition in the Ring is
            replicated three times by default across the cluster, and
            the locations for a partition are stored in the mapping
            maintained by the Ring. The Ring is also responsible for
            determining which devices are used for handoff should a
            failure occur.</para>
    <figure>
        <title>The Lord of the <emphasis role="bold"
            >Ring</emphasis>s</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image41.png"/>
            </imageobject>
        </mediaobject>
    </figure>
        <para>The Ring maps partitions to physical locations on
            disk.</para>
        <para>The rings determine where data should reside in the
            cluster. There is a separate ring for account databases,
            container databases, and individual objects but each ring
            works in the same way. These rings are externally managed,
            in that the server processes themselves do not modify the
            rings, they are instead given new rings modified by other
            tools.</para>
        <para>The ring uses a configurable number of bits from a
            path’s MD5 hash as a partition index that designates a
            device. The number of bits kept from the hash is known as
            the partition power, and 2 to the partition power
            indicates the partition count. Partitioning the full MD5
            hash ring allows other parts of the cluster to work in
            batches of items at once which ends up either more
            efficient or at least less complex than working with each
            item separately or the entire cluster all at once.</para>
        <para>Another configurable value is the replica count, which
        indicates how many of the partition-&gt;device assignments
        comprise a single ring. For a given partition number, each
        replica’s device will not be in the same zone as any other
        replica's device. Zones can be used to group devices based on
        physical locations, power separations, network separations, or
        any other attribute that would lessen multiple replicas being
        unavailable at the same time.</para>
        <para><guilabel>Zones: Failure Boundaries</guilabel></para>
            <para>Swift allows zones to be configured to isolate
                failure boundaries. Each replica of the data resides
                in a separate zone, if possible. At the smallest
                level, a zone could be a single drive or a grouping of
                a few drives. If there were five object storage
                servers, then each server would represent its own
                zone. Larger deployments would have an entire rack (or
                multiple racks) of object servers, each representing a
                zone. The goal of zones is to allow the cluster to
                tolerate significant outages of storage servers
                without losing all replicas of the data.</para>
            <para>As we learned earlier, everything in Swift is
                stored, by default, three times. Swift will place each
                replica "as-uniquely-as-possible" to ensure both high
                availability and high durability. This means that when
                chosing a replica location, Swift will choose a server
                in an unused zone before an unused server in a zone
                that already has a replica of the data.</para>
    <figure>
        <title>image33.png</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image42.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>When a disk fails, replica data is automatically
                distributed to the other zones to ensure there are
                three copies of the data</para>
            <para><guilabel>Accounts &amp;
                Containers</guilabel></para>
            <para>Each account and container is an individual SQLite
                database that is distributed across the cluster. An
                account database contains the list of containers in
                that account. A container database contains the list
                of objects in that container.</para>
    <figure>
        <title>Accounts and Containers</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image43.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>To keep track of object data location, each account
                in the system has a database that references all its
                containers, and each container database references
                each object</para>
            <para><guilabel>Partitions</guilabel></para>
            <para>A Partition is a collection of stored data,
                including Account databases, Container databases, and
                objects. Partitions are core to the replication
                system.</para>
            <para>Think of a Partition as a bin moving throughout a
                fulfillment center warehouse. Individual orders get
                thrown into the bin. The system treats that bin as a
                cohesive entity as it moves throughout the system. A
                bin full of things is easier to deal with than lots of
                little things. It makes for fewer moving parts
                throughout the system.</para>
            <para>The system replicators and object uploads/downloads
                operate on Partitions. As the system scales up,
                behavior continues to be predictable as the number of
                Partitions is a fixed number.</para>
            <para>The implementation of a Partition is conceptually
                simple -- a partition is just a directory sitting on a
                disk with a corresponding hash table of what it
                contains.</para>
    <figure>
        <title>Partitions</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image44.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>*Swift partitions contain all data in the
                system.</para>
            <para><guilabel>Replication</guilabel></para>
            <para>In order to ensure that there are three copies of
                the data everywhere, replicators continuously examine
                each Partition. For each local Partition, the
                replicator compares it against the replicated copies
                in the other Zones to see if there are any
                differences.</para>
            <para>How does the replicator know if replication needs to
                take place? It does this by examining hashes. A hash
                file is created for each Partition, which contains
                hashes of each directory in the Partition. Each of the
                three hash files is compared. For a given Partition,
                the hash files for each of the Partition's copies are
                compared. If the hashes are different, then it is time
                to replicate and the directory that needs to be
                replicated is copied over.</para>
            <para>This is where the Partitions come in handy. With
                fewer "things" in the system, larger chunks of data
                are transferred around (rather than lots of little TCP
                connections, which is inefficient) and there are a
                consistent number of hashes to compare.</para>
            <para>The cluster has eventually consistent behavior where
                the newest data wins.</para>
    <figure>
        <title>Replication</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image45.png"/>
            </imageobject>
        </mediaobject>
    </figure>
            <para>*If a zone goes down, one of the nodes containing a
                replica notices and proactively copies data to a
                handoff location.</para>
    <para>To describe how these pieces all come together, let's walk
        through a few scenarios and introduce the components.</para>
    <para><guilabel>Bird-eye View</guilabel></para>
    <para><emphasis role="bold">Upload</emphasis></para>

        <para>A client uses the REST API to make a HTTP request to PUT
            an object into an existing Container. The cluster receives
            the request. First, the system must figure out where the
            data is going to go. To do this, the Account name,
            Container name and Object name are all used to determine
            the Partition where this object should live.</para>
        <para>Then a lookup in the Ring figures out which storage
            nodes contain the Partitions in question.</para>
        <para>The data then is sent to each storage node where it is
            placed in the appropriate Partition. A quorum is required
            -- at least two of the three writes must be successful
            before the client is notified that the upload was
            successful.</para>
        <para>Next, the Container database is updated asynchronously
            to reflect that there is a new object in it.</para>
    <figure>
        <title>When End-User uses Swift</title>
        <mediaobject>
            <imageobject>
                <imagedata fileref="figures/image46.png"/>
            </imageobject>
        </mediaobject>
    </figure>
    <para><emphasis role="bold">Download</emphasis></para>
        <para>A request comes in for an Account/Container/object.
            Using the same consistent hashing, the Partition name is
            generated. A lookup in the Ring reveals which storage
            nodes contain that Partition. A request is made to one of
            the storage nodes to fetch the object and if that fails,
            requests are made to the other nodes.</para>

</chapter>