openstack-manuals/doc/training-guide/module003-ch004-swift-building-blocks.xml
Andreas Jaeger fac23cb047 Fix whitespace problems
Since we have now a niceness gate, let's remove the known failures.

Change-Id: I01684aee61e18e76f6523a466e28b8252e577f6f
2013-10-08 15:18:31 +02:00

296 lines
15 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="utf-8"?>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="module003-ch004-swift-building-blocks">
<title>Building Blocks of Swift</title>
<para>The components that enable Swift to deliver high
availability, high durability and high concurrency
are:</para>
<itemizedlist>
<listitem>
<para><emphasis role="bold">Proxy
Servers:</emphasis>Handles all incoming API
requests.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Rings:</emphasis>Maps
logical names of data to locations on particular
disks.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Zones:</emphasis>Each Zone
isolates data from other Zones. A failure in one Zone
doesnt impact the rest of the cluster because data is
replicated across the Zones.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Accounts &amp;
Containers:</emphasis>Each Account and Container
are individual databases that are distributed across
the cluster. An Account database contains the list of
Containers in that Account. A Container database
contains the list of Objects in that Container</para>
</listitem>
<listitem>
<para><emphasis role="bold">Objects:</emphasis>The
data itself.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Partitions:</emphasis>A
Partition stores Objects, Account databases and
Container databases. Its an intermediate 'bucket'
that helps manage locations where data lives in the
cluster.</para>
</listitem>
</itemizedlist>
<figure>
<title>Building Blocks</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image40.png"/>
</imageobject>
</mediaobject>
</figure>
<para><guilabel>Proxy Servers</guilabel></para>
<para>The Proxy Servers are the public face of Swift and
handle all incoming API requests. Once a Proxy Server
receive a request, it will determine the storage node
based on the URL of the object, e.g.
https://swift.example.com/v1/account/container/object. The
Proxy Servers also coordinates responses, handles failures
and coordinates timestamps.</para>
<para>Proxy servers use a shared-nothing architecture and can
be scaled as needed based on projected workloads. A
minimum of two Proxy Servers should be deployed for
redundancy. Should one proxy server fail, the others will
take over.</para>
<para><guilabel>The Ring</guilabel></para>
<para>A ring represents a mapping between the names of entities
stored on disk and their physical location. There are separate
rings for accounts, containers, and objects. When other
components need to perform any operation on an object,
container, or account, they need to interact with the
appropriate ring to determine its location in the
cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the ring is
replicated, by default, 3 times across the cluster, and the
locations for a partition are stored in the mapping maintained
by the ring. The ring is also responsible for determining
which devices are used for hand off in failure
scenarios.</para>
<para>Data can be isolated with the concept of zones in the
ring. Each replica of a partition is guaranteed to reside
in a different zone. A zone could represent a drive, a
server, a cabinet, a switch, or even a data center.</para>
<para>The partitions of the ring are equally divided among all
the devices in the OpenStack Object Storage installation.
When partitions need to be moved around (for example if a
device is added to the cluster), the ring ensures that a
minimum number of partitions are moved at a time, and only
one replica of a partition is moved at a time.</para>
<para>Weights can be used to balance the distribution of
partitions on drives across the cluster. This can be
useful, for example, when different sized drives are used
in a cluster.</para>
<para>The ring is used by the Proxy server and several
background processes (like replication).</para>
<para>The Ring maps Partitions to physical locations on disk.
When other components need to perform any operation on an
object, container, or account, they need to interact with
the Ring to determine its location in the cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the Ring is
replicated three times by default across the cluster, and
the locations for a partition are stored in the mapping
maintained by the Ring. The Ring is also responsible for
determining which devices are used for handoff should a
failure occur.</para>
<figure>
<title>The Lord of the <emphasis role="bold"
>Ring</emphasis>s</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image41.png"/>
</imageobject>
</mediaobject>
</figure>
<para>The Ring maps partitions to physical locations on
disk.</para>
<para>The rings determine where data should reside in the
cluster. There is a separate ring for account databases,
container databases, and individual objects but each ring
works in the same way. These rings are externally managed,
in that the server processes themselves do not modify the
rings, they are instead given new rings modified by other
tools.</para>
<para>The ring uses a configurable number of bits from a
paths MD5 hash as a partition index that designates a
device. The number of bits kept from the hash is known as
the partition power, and 2 to the partition power
indicates the partition count. Partitioning the full MD5
hash ring allows other parts of the cluster to work in
batches of items at once which ends up either more
efficient or at least less complex than working with each
item separately or the entire cluster all at once.</para>
<para>Another configurable value is the replica count, which
indicates how many of the partition-&gt;device assignments
comprise a single ring. For a given partition number, each
replicas device will not be in the same zone as any other
replica's device. Zones can be used to group devices based on
physical locations, power separations, network separations, or
any other attribute that would lessen multiple replicas being
unavailable at the same time.</para>
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
<para>Swift allows zones to be configured to isolate
failure boundaries. Each replica of the data resides
in a separate zone, if possible. At the smallest
level, a zone could be a single drive or a grouping of
a few drives. If there were five object storage
servers, then each server would represent its own
zone. Larger deployments would have an entire rack (or
multiple racks) of object servers, each representing a
zone. The goal of zones is to allow the cluster to
tolerate significant outages of storage servers
without losing all replicas of the data.</para>
<para>As we learned earlier, everything in Swift is
stored, by default, three times. Swift will place each
replica "as-uniquely-as-possible" to ensure both high
availability and high durability. This means that when
chosing a replica location, Swift will choose a server
in an unused zone before an unused server in a zone
that already has a replica of the data.</para>
<figure>
<title>image33.png</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image42.png"/>
</imageobject>
</mediaobject>
</figure>
<para>When a disk fails, replica data is automatically
distributed to the other zones to ensure there are
three copies of the data</para>
<para><guilabel>Accounts &amp;
Containers</guilabel></para>
<para>Each account and container is an individual SQLite
database that is distributed across the cluster. An
account database contains the list of containers in
that account. A container database contains the list
of objects in that container.</para>
<figure>
<title>Accounts and Containers</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image43.png"/>
</imageobject>
</mediaobject>
</figure>
<para>To keep track of object data location, each account
in the system has a database that references all its
containers, and each container database references
each object</para>
<para><guilabel>Partitions</guilabel></para>
<para>A Partition is a collection of stored data,
including Account databases, Container databases, and
objects. Partitions are core to the replication
system.</para>
<para>Think of a Partition as a bin moving throughout a
fulfillment center warehouse. Individual orders get
thrown into the bin. The system treats that bin as a
cohesive entity as it moves throughout the system. A
bin full of things is easier to deal with than lots of
little things. It makes for fewer moving parts
throughout the system.</para>
<para>The system replicators and object uploads/downloads
operate on Partitions. As the system scales up,
behavior continues to be predictable as the number of
Partitions is a fixed number.</para>
<para>The implementation of a Partition is conceptually
simple -- a partition is just a directory sitting on a
disk with a corresponding hash table of what it
contains.</para>
<figure>
<title>Partitions</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image44.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*Swift partitions contain all data in the
system.</para>
<para><guilabel>Replication</guilabel></para>
<para>In order to ensure that there are three copies of
the data everywhere, replicators continuously examine
each Partition. For each local Partition, the
replicator compares it against the replicated copies
in the other Zones to see if there are any
differences.</para>
<para>How does the replicator know if replication needs to
take place? It does this by examining hashes. A hash
file is created for each Partition, which contains
hashes of each directory in the Partition. Each of the
three hash files is compared. For a given Partition,
the hash files for each of the Partition's copies are
compared. If the hashes are different, then it is time
to replicate and the directory that needs to be
replicated is copied over.</para>
<para>This is where the Partitions come in handy. With
fewer "things" in the system, larger chunks of data
are transferred around (rather than lots of little TCP
connections, which is inefficient) and there are a
consistent number of hashes to compare.</para>
<para>The cluster has eventually consistent behavior where
the newest data wins.</para>
<figure>
<title>Replication</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image45.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*If a zone goes down, one of the nodes containing a
replica notices and proactively copies data to a
handoff location.</para>
<para>To describe how these pieces all come together, let's walk
through a few scenarios and introduce the components.</para>
<para><guilabel>Bird-eye View</guilabel></para>
<para><emphasis role="bold">Upload</emphasis></para>
<para>A client uses the REST API to make a HTTP request to PUT
an object into an existing Container. The cluster receives
the request. First, the system must figure out where the
data is going to go. To do this, the Account name,
Container name and Object name are all used to determine
the Partition where this object should live.</para>
<para>Then a lookup in the Ring figures out which storage
nodes contain the Partitions in question.</para>
<para>The data then is sent to each storage node where it is
placed in the appropriate Partition. A quorum is required
-- at least two of the three writes must be successful
before the client is notified that the upload was
successful.</para>
<para>Next, the Container database is updated asynchronously
to reflect that there is a new object in it.</para>
<figure>
<title>When End-User uses Swift</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image46.png"/>
</imageobject>
</mediaobject>
</figure>
<para><emphasis role="bold">Download</emphasis></para>
<para>A request comes in for an Account/Container/object.
Using the same consistent hashing, the Partition name is
generated. A lookup in the Ring reveals which storage
nodes contain that Partition. A request is made to one of
the storage nodes to fetch the object and if that fails,
requests are made to the other nodes.</para>
</chapter>