fac23cb047
Since we have now a niceness gate, let's remove the known failures. Change-Id: I01684aee61e18e76f6523a466e28b8252e577f6f
296 lines
15 KiB
XML
296 lines
15 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="module003-ch004-swift-building-blocks">
|
||
<title>Building Blocks of Swift</title>
|
||
<para>The components that enable Swift to deliver high
|
||
availability, high durability and high concurrency
|
||
are:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para><emphasis role="bold">Proxy
|
||
Servers:</emphasis>Handles all incoming API
|
||
requests.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Rings:</emphasis>Maps
|
||
logical names of data to locations on particular
|
||
disks.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Zones:</emphasis>Each Zone
|
||
isolates data from other Zones. A failure in one Zone
|
||
doesn’t impact the rest of the cluster because data is
|
||
replicated across the Zones.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Accounts &
|
||
Containers:</emphasis>Each Account and Container
|
||
are individual databases that are distributed across
|
||
the cluster. An Account database contains the list of
|
||
Containers in that Account. A Container database
|
||
contains the list of Objects in that Container</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Objects:</emphasis>The
|
||
data itself.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Partitions:</emphasis>A
|
||
Partition stores Objects, Account databases and
|
||
Container databases. It’s an intermediate 'bucket'
|
||
that helps manage locations where data lives in the
|
||
cluster.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<figure>
|
||
<title>Building Blocks</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image40.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><guilabel>Proxy Servers</guilabel></para>
|
||
<para>The Proxy Servers are the public face of Swift and
|
||
handle all incoming API requests. Once a Proxy Server
|
||
receive a request, it will determine the storage node
|
||
based on the URL of the object, e.g.
|
||
https://swift.example.com/v1/account/container/object. The
|
||
Proxy Servers also coordinates responses, handles failures
|
||
and coordinates timestamps.</para>
|
||
<para>Proxy servers use a shared-nothing architecture and can
|
||
be scaled as needed based on projected workloads. A
|
||
minimum of two Proxy Servers should be deployed for
|
||
redundancy. Should one proxy server fail, the others will
|
||
take over.</para>
|
||
<para><guilabel>The Ring</guilabel></para>
|
||
<para>A ring represents a mapping between the names of entities
|
||
stored on disk and their physical location. There are separate
|
||
rings for accounts, containers, and objects. When other
|
||
components need to perform any operation on an object,
|
||
container, or account, they need to interact with the
|
||
appropriate ring to determine its location in the
|
||
cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the ring is
|
||
replicated, by default, 3 times across the cluster, and the
|
||
locations for a partition are stored in the mapping maintained
|
||
by the ring. The ring is also responsible for determining
|
||
which devices are used for hand off in failure
|
||
scenarios.</para>
|
||
<para>Data can be isolated with the concept of zones in the
|
||
ring. Each replica of a partition is guaranteed to reside
|
||
in a different zone. A zone could represent a drive, a
|
||
server, a cabinet, a switch, or even a data center.</para>
|
||
<para>The partitions of the ring are equally divided among all
|
||
the devices in the OpenStack Object Storage installation.
|
||
When partitions need to be moved around (for example if a
|
||
device is added to the cluster), the ring ensures that a
|
||
minimum number of partitions are moved at a time, and only
|
||
one replica of a partition is moved at a time.</para>
|
||
<para>Weights can be used to balance the distribution of
|
||
partitions on drives across the cluster. This can be
|
||
useful, for example, when different sized drives are used
|
||
in a cluster.</para>
|
||
<para>The ring is used by the Proxy server and several
|
||
background processes (like replication).</para>
|
||
<para>The Ring maps Partitions to physical locations on disk.
|
||
When other components need to perform any operation on an
|
||
object, container, or account, they need to interact with
|
||
the Ring to determine its location in the cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the Ring is
|
||
replicated three times by default across the cluster, and
|
||
the locations for a partition are stored in the mapping
|
||
maintained by the Ring. The Ring is also responsible for
|
||
determining which devices are used for handoff should a
|
||
failure occur.</para>
|
||
<figure>
|
||
<title>The Lord of the <emphasis role="bold"
|
||
>Ring</emphasis>s</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image41.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>The Ring maps partitions to physical locations on
|
||
disk.</para>
|
||
<para>The rings determine where data should reside in the
|
||
cluster. There is a separate ring for account databases,
|
||
container databases, and individual objects but each ring
|
||
works in the same way. These rings are externally managed,
|
||
in that the server processes themselves do not modify the
|
||
rings, they are instead given new rings modified by other
|
||
tools.</para>
|
||
<para>The ring uses a configurable number of bits from a
|
||
path’s MD5 hash as a partition index that designates a
|
||
device. The number of bits kept from the hash is known as
|
||
the partition power, and 2 to the partition power
|
||
indicates the partition count. Partitioning the full MD5
|
||
hash ring allows other parts of the cluster to work in
|
||
batches of items at once which ends up either more
|
||
efficient or at least less complex than working with each
|
||
item separately or the entire cluster all at once.</para>
|
||
<para>Another configurable value is the replica count, which
|
||
indicates how many of the partition->device assignments
|
||
comprise a single ring. For a given partition number, each
|
||
replica’s device will not be in the same zone as any other
|
||
replica's device. Zones can be used to group devices based on
|
||
physical locations, power separations, network separations, or
|
||
any other attribute that would lessen multiple replicas being
|
||
unavailable at the same time.</para>
|
||
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
|
||
<para>Swift allows zones to be configured to isolate
|
||
failure boundaries. Each replica of the data resides
|
||
in a separate zone, if possible. At the smallest
|
||
level, a zone could be a single drive or a grouping of
|
||
a few drives. If there were five object storage
|
||
servers, then each server would represent its own
|
||
zone. Larger deployments would have an entire rack (or
|
||
multiple racks) of object servers, each representing a
|
||
zone. The goal of zones is to allow the cluster to
|
||
tolerate significant outages of storage servers
|
||
without losing all replicas of the data.</para>
|
||
<para>As we learned earlier, everything in Swift is
|
||
stored, by default, three times. Swift will place each
|
||
replica "as-uniquely-as-possible" to ensure both high
|
||
availability and high durability. This means that when
|
||
chosing a replica location, Swift will choose a server
|
||
in an unused zone before an unused server in a zone
|
||
that already has a replica of the data.</para>
|
||
<figure>
|
||
<title>image33.png</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image42.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>When a disk fails, replica data is automatically
|
||
distributed to the other zones to ensure there are
|
||
three copies of the data</para>
|
||
<para><guilabel>Accounts &
|
||
Containers</guilabel></para>
|
||
<para>Each account and container is an individual SQLite
|
||
database that is distributed across the cluster. An
|
||
account database contains the list of containers in
|
||
that account. A container database contains the list
|
||
of objects in that container.</para>
|
||
<figure>
|
||
<title>Accounts and Containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image43.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>To keep track of object data location, each account
|
||
in the system has a database that references all its
|
||
containers, and each container database references
|
||
each object</para>
|
||
<para><guilabel>Partitions</guilabel></para>
|
||
<para>A Partition is a collection of stored data,
|
||
including Account databases, Container databases, and
|
||
objects. Partitions are core to the replication
|
||
system.</para>
|
||
<para>Think of a Partition as a bin moving throughout a
|
||
fulfillment center warehouse. Individual orders get
|
||
thrown into the bin. The system treats that bin as a
|
||
cohesive entity as it moves throughout the system. A
|
||
bin full of things is easier to deal with than lots of
|
||
little things. It makes for fewer moving parts
|
||
throughout the system.</para>
|
||
<para>The system replicators and object uploads/downloads
|
||
operate on Partitions. As the system scales up,
|
||
behavior continues to be predictable as the number of
|
||
Partitions is a fixed number.</para>
|
||
<para>The implementation of a Partition is conceptually
|
||
simple -- a partition is just a directory sitting on a
|
||
disk with a corresponding hash table of what it
|
||
contains.</para>
|
||
<figure>
|
||
<title>Partitions</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image44.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*Swift partitions contain all data in the
|
||
system.</para>
|
||
<para><guilabel>Replication</guilabel></para>
|
||
<para>In order to ensure that there are three copies of
|
||
the data everywhere, replicators continuously examine
|
||
each Partition. For each local Partition, the
|
||
replicator compares it against the replicated copies
|
||
in the other Zones to see if there are any
|
||
differences.</para>
|
||
<para>How does the replicator know if replication needs to
|
||
take place? It does this by examining hashes. A hash
|
||
file is created for each Partition, which contains
|
||
hashes of each directory in the Partition. Each of the
|
||
three hash files is compared. For a given Partition,
|
||
the hash files for each of the Partition's copies are
|
||
compared. If the hashes are different, then it is time
|
||
to replicate and the directory that needs to be
|
||
replicated is copied over.</para>
|
||
<para>This is where the Partitions come in handy. With
|
||
fewer "things" in the system, larger chunks of data
|
||
are transferred around (rather than lots of little TCP
|
||
connections, which is inefficient) and there are a
|
||
consistent number of hashes to compare.</para>
|
||
<para>The cluster has eventually consistent behavior where
|
||
the newest data wins.</para>
|
||
<figure>
|
||
<title>Replication</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image45.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*If a zone goes down, one of the nodes containing a
|
||
replica notices and proactively copies data to a
|
||
handoff location.</para>
|
||
<para>To describe how these pieces all come together, let's walk
|
||
through a few scenarios and introduce the components.</para>
|
||
<para><guilabel>Bird-eye View</guilabel></para>
|
||
<para><emphasis role="bold">Upload</emphasis></para>
|
||
|
||
<para>A client uses the REST API to make a HTTP request to PUT
|
||
an object into an existing Container. The cluster receives
|
||
the request. First, the system must figure out where the
|
||
data is going to go. To do this, the Account name,
|
||
Container name and Object name are all used to determine
|
||
the Partition where this object should live.</para>
|
||
<para>Then a lookup in the Ring figures out which storage
|
||
nodes contain the Partitions in question.</para>
|
||
<para>The data then is sent to each storage node where it is
|
||
placed in the appropriate Partition. A quorum is required
|
||
-- at least two of the three writes must be successful
|
||
before the client is notified that the upload was
|
||
successful.</para>
|
||
<para>Next, the Container database is updated asynchronously
|
||
to reflect that there is a new object in it.</para>
|
||
<figure>
|
||
<title>When End-User uses Swift</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image46.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><emphasis role="bold">Download</emphasis></para>
|
||
<para>A request comes in for an Account/Container/object.
|
||
Using the same consistent hashing, the Partition name is
|
||
generated. A lookup in the Ring reveals which storage
|
||
nodes contain that Partition. A request is made to one of
|
||
the storage nodes to fetch the object and if that fails,
|
||
requests are made to the other nodes.</para>
|
||
|
||
</chapter>
|