ab93e636b8
Brief Summary: Added Modules and Lab Sections of Aptira's Existing OpenStack Training Docs. Please do refer Full Summary for more details. For those who want to review this and save some time on building it, I have hosted the content on http://office.aptira.in Please talk to Sean Robetrs if you are concerened about repetition of Doc Content or similar issues like short URLs etc., this is supposed to be a rough patch and not final. Full Summary: Added the following modules. 1. Module001 - Introduction To OpenStack. - Brief Overview of OpenStack. - Basic Concepts - Detailed Description of Core Projects (Grizzly) under OpenStack. - All But Networking and Swift. 2. Module002 - OpenStack Networking In detail. 3. Module003 - OpenStack Object Storage In detail. 4. Lab001 - OpenStack Control Node and Compute Node. 5. Lab002 - OpenStack Network Node. Full Summary added due to the size of the commit. I Apologize for the size of this commit and will try not to commit huge content like in this patch. The reason for the size of this commit is to meet OpenStack Training Sprint day. bp/training-manuals Change-Id: Ie3c44527992868b4d9571b66cc1c048e558ec669
295 lines
15 KiB
XML
295 lines
15 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="module003-ch004-swift-building-blocks">
|
||
<title>Building Blocks of Swift</title>
|
||
<para>The components that enable Swift to deliver high
|
||
availability, high durability and high concurrency
|
||
are:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para><emphasis role="bold">Proxy
|
||
Servers:</emphasis>Handles all incoming API
|
||
requests.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Rings:</emphasis>Maps
|
||
logical names of data to locations on particular
|
||
disks.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Zones:</emphasis>Each Zone
|
||
isolates data from other Zones. A failure in one Zone
|
||
doesn’t impact the rest of the cluster because data is
|
||
replicated across the Zones.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Accounts &
|
||
Containers:</emphasis>Each Account and Container
|
||
are individual databases that are distributed across
|
||
the cluster. An Account database contains the list of
|
||
Containers in that Account. A Container database
|
||
contains the list of Objects in that Container</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Objects:</emphasis>The
|
||
data itself.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Partitions:</emphasis>A
|
||
Partition stores Objects, Account databases and
|
||
Container databases. It’s an intermediate 'bucket'
|
||
that helps manage locations where data lives in the
|
||
cluster.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<figure>
|
||
<title>Building Blocks</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image40.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><guilabel>Proxy Servers</guilabel></para>
|
||
<para>The Proxy Servers are the public face of Swift and
|
||
handle all incoming API requests. Once a Proxy Server
|
||
receive a request, it will determine the storage node
|
||
based on the URL of the object, e.g.
|
||
https://swift.example.com/v1/account/container/object. The
|
||
Proxy Servers also coordinates responses, handles failures
|
||
and coordinates timestamps.</para>
|
||
<para>Proxy servers use a shared-nothing architecture and can
|
||
be scaled as needed based on projected workloads. A
|
||
minimum of two Proxy Servers should be deployed for
|
||
redundancy. Should one proxy server fail, the others will
|
||
take over.</para>
|
||
<para><guilabel>The Ring</guilabel></para>
|
||
<para>A ring represents a mapping between the names of entities
|
||
stored on disk and their physical location. There are separate
|
||
rings for accounts, containers, and objects. When other
|
||
components need to perform any operation on an object,
|
||
container, or account, they need to interact with the
|
||
appropriate ring to determine its location in the
|
||
cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the ring is
|
||
replicated, by default, 3 times across the cluster, and the
|
||
locations for a partition are stored in the mapping maintained
|
||
by the ring. The ring is also responsible for determining
|
||
which devices are used for hand off in failure
|
||
scenarios.</para>
|
||
<para>Data can be isolated with the concept of zones in the
|
||
ring. Each replica of a partition is guaranteed to reside
|
||
in a different zone. A zone could represent a drive, a
|
||
server, a cabinet, a switch, or even a data center.</para>
|
||
<para>The partitions of the ring are equally divided among all
|
||
the devices in the OpenStack Object Storage installation.
|
||
When partitions need to be moved around (for example if a
|
||
device is added to the cluster), the ring ensures that a
|
||
minimum number of partitions are moved at a time, and only
|
||
one replica of a partition is moved at a time.</para>
|
||
<para>Weights can be used to balance the distribution of
|
||
partitions on drives across the cluster. This can be
|
||
useful, for example, when different sized drives are used
|
||
in a cluster.</para>
|
||
<para>The ring is used by the Proxy server and several
|
||
background processes (like replication).</para>
|
||
<para>The Ring maps Partitions to physical locations on disk.
|
||
When other components need to perform any operation on an
|
||
object, container, or account, they need to interact with
|
||
the Ring to determine its location in the cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the Ring is
|
||
replicated three times by default across the cluster, and
|
||
the locations for a partition are stored in the mapping
|
||
maintained by the Ring. The Ring is also responsible for
|
||
determining which devices are used for handoff should a
|
||
failure occur.</para>
|
||
<figure>
|
||
<title>The Lord of the <emphasis role="bold"
|
||
>Ring</emphasis>s</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image41.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>The Ring maps partitions to physical locations on
|
||
disk.</para>
|
||
<para>The rings determine where data should reside in the
|
||
cluster. There is a separate ring for account databases,
|
||
container databases, and individual objects but each ring
|
||
works in the same way. These rings are externally managed,
|
||
in that the server processes themselves do not modify the
|
||
rings, they are instead given new rings modified by other
|
||
tools.</para>
|
||
<para>The ring uses a configurable number of bits from a
|
||
path’s MD5 hash as a partition index that designates a
|
||
device. The number of bits kept from the hash is known as
|
||
the partition power, and 2 to the partition power
|
||
indicates the partition count. Partitioning the full MD5
|
||
hash ring allows other parts of the cluster to work in
|
||
batches of items at once which ends up either more
|
||
efficient or at least less complex than working with each
|
||
item separately or the entire cluster all at once.</para>
|
||
<para>Another configurable value is the replica count, which
|
||
indicates how many of the partition->device assignments
|
||
comprise a single ring. For a given partition number, each
|
||
replica’s device will not be in the same zone as any other
|
||
replica's device. Zones can be used to group devices based on
|
||
physical locations, power separations, network separations, or
|
||
any other attribute that would lessen multiple replicas being
|
||
unavailable at the same time.</para>
|
||
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
|
||
<para>Swift allows zones to be configured to isolate
|
||
failure boundaries. Each replica of the data resides
|
||
in a separate zone, if possible. At the smallest
|
||
level, a zone could be a single drive or a grouping of
|
||
a few drives. If there were five object storage
|
||
servers, then each server would represent its own
|
||
zone. Larger deployments would have an entire rack (or
|
||
multiple racks) of object servers, each representing a
|
||
zone. The goal of zones is to allow the cluster to
|
||
tolerate significant outages of storage servers
|
||
without losing all replicas of the data.</para>
|
||
<para>As we learned earlier, everything in Swift is
|
||
stored, by default, three times. Swift will place each
|
||
replica "as-uniquely-as-possible" to ensure both high
|
||
availability and high durability. This means that when
|
||
chosing a replica location, Swift will choose a server
|
||
in an unused zone before an unused server in a zone
|
||
that already has a replica of the data.</para>
|
||
<figure>
|
||
<title>image33.png</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image42.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>When a disk fails, replica data is automatically
|
||
distributed to the other zones to ensure there are
|
||
three copies of the data</para>
|
||
<para><guilabel>Accounts &
|
||
Containers</guilabel></para>
|
||
<para>Each account and container is an individual SQLite
|
||
database that is distributed across the cluster. An
|
||
account database contains the list of containers in
|
||
that account. A container database contains the list
|
||
of objects in that container.</para>
|
||
<figure>
|
||
<title>Accounts and Containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image43.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>To keep track of object data location, each account
|
||
in the system has a database that references all its
|
||
containers, and each container database references
|
||
each object</para>
|
||
<para><guilabel>Partitions</guilabel></para>
|
||
<para>A Partition is a collection of stored data,
|
||
including Account databases, Container databases, and
|
||
objects. Partitions are core to the replication
|
||
system.</para>
|
||
<para>Think of a Partition as a bin moving throughout a
|
||
fulfillment center warehouse. Individual orders get
|
||
thrown into the bin. The system treats that bin as a
|
||
cohesive entity as it moves throughout the system. A
|
||
bin full of things is easier to deal with than lots of
|
||
little things. It makes for fewer moving parts
|
||
throughout the system.</para>
|
||
<para>The system replicators and object uploads/downloads
|
||
operate on Partitions. As the system scales up,
|
||
behavior continues to be predictable as the number of
|
||
Partitions is a fixed number.</para>
|
||
<para>The implementation of a Partition is conceptually
|
||
simple -- a partition is just a directory sitting on a
|
||
disk with a corresponding hash table of what it
|
||
contains.</para>
|
||
<figure>
|
||
<title>Partitions</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image44.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*Swift partitions contain all data in the
|
||
system.</para>
|
||
<para><guilabel>Replication</guilabel></para>
|
||
<para>In order to ensure that there are three copies of
|
||
the data everywhere, replicators continuously examine
|
||
each Partition. For each local Partition, the
|
||
replicator compares it against the replicated copies
|
||
in the other Zones to see if there are any
|
||
differences.</para>
|
||
<para>How does the replicator know if replication needs to
|
||
take place? It does this by examining hashes. A hash
|
||
file is created for each Partition, which contains
|
||
hashes of each directory in the Partition. Each of the
|
||
three hash files is compared. For a given Partition,
|
||
the hash files for each of the Partition's copies are
|
||
compared. If the hashes are different, then it is time
|
||
to replicate and the directory that needs to be
|
||
replicated is copied over.</para>
|
||
<para>This is where the Partitions come in handy. With
|
||
fewer "things" in the system, larger chunks of data
|
||
are transferred around (rather than lots of little TCP
|
||
connections, which is inefficient) and there are a
|
||
consistent number of hashes to compare.</para>
|
||
<para>The cluster has eventually consistent behavior where
|
||
the newest data wins.</para>
|
||
<figure>
|
||
<title>Replication</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image45.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*If a zone goes down, one of the nodes containing a
|
||
replica notices and proactively copies data to a
|
||
handoff location.</para>
|
||
<para>To describe how these pieces all come together, let's walk
|
||
through a few scenarios and introduce the components.</para>
|
||
<para><guilabel>Bird-eye View</guilabel></para>
|
||
<para><emphasis role="bold">Upload</emphasis></para>
|
||
|
||
<para>A client uses the REST API to make a HTTP request to PUT
|
||
an object into an existing Container. The cluster receives
|
||
the request. First, the system must figure out where the
|
||
data is going to go. To do this, the Account name,
|
||
Container name and Object name are all used to determine
|
||
the Partition where this object should live.</para>
|
||
<para>Then a lookup in the Ring figures out which storage
|
||
nodes contain the Partitions in question.</para>
|
||
<para>The data then is sent to each storage node where it is
|
||
placed in the appropriate Partition. A quorum is required
|
||
-- at least two of the three writes must be successful
|
||
before the client is notified that the upload was
|
||
successful.</para>
|
||
<para>Next, the Container database is updated asynchronously
|
||
to reflect that there is a new object in it.</para>
|
||
<figure>
|
||
<title>When End-User uses Swift</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image46.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><emphasis role="bold">Download</emphasis></para>
|
||
<para>A request comes in for an Account/Container/object.
|
||
Using the same consistent hashing, the Partition name is
|
||
generated. A lookup in the Ring reveals which storage
|
||
nodes contain that Partition. A request is made to one of
|
||
the storage nodes to fetch the object and if that fails,
|
||
requests are made to the other nodes.</para>
|
||
|
||
</chapter> |