openstack-manuals/doc/training-guide/associate-storage-node-concept-swift.xml
Sean Roberts 51274c9fcb moved module-003 material into swift concepts chapter
this is part of the training guide restructuring effort.

implements bp/training-manual

Change-Id: I28c9c5317f55da4cef075bae89b3ee518aaeae74
2013-11-07 16:11:37 +08:00

1206 lines
65 KiB
XML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="utf-8"?>
<section xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
xml:id="associate-storage-node-concept-swift">
<title>Conceptual Swift</title>
<section xml:id="intro-objectstore">
<title>Introduction to Object Storage</title>
<para>OpenStack Object Storage (code-named Swift) is open source
software for creating redundant, scalable data storage using
clusters of standardized servers to store petabytes of
accessible data. It is a long-term storage system for large
amounts of static data that can be retrieved, leveraged, and
updated. Object Storage uses a distributed architecture with
no central point of control, providing greater scalability,
redundancy and permanence. Objects are written to multiple
hardware devices, with the OpenStack software responsible for
ensuring data replication and integrity across the cluster.
Storage clusters scale horizontally by adding new nodes.
Should a node fail, OpenStack works to replicate its content
from other active nodes. Because OpenStack uses software logic
to ensure data replication and distribution across different
devices, inexpensive commodity hard drives and servers can be
used in lieu of more expensive equipment.</para>
<para>Object Storage is ideal for cost effective, scale-out
storage. It provides a fully distributed, API-accessible
storage platform that can be integrated directly into
applications or used for backup, archiving and data retention.
Block Storage allows block devices to be exposed and connected
to compute instances for expanded storage, better performance
and integration with enterprise storage platforms, such as
NetApp, Nexenta and SolidFire.</para>
</section>
<section xml:id="features-benefits">
<title>Features and Benifits</title>
<para>
<informaltable class="c19">
<tbody>
<tr>
<th rowspan="1" colspan="1">Features</th>
<th rowspan="1" colspan="1">Benefits</th>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Leverages commodity
hardware</emphasis></td>
<td rowspan="1" colspan="1"
>No
lock-in, lower
price/GB</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>HDD/node failure agnostic</emphasis></td>
<td rowspan="1" colspan="1"
>Self
healingReliability, data redundancy protecting
from
failures</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Unlimited storage</emphasis></td>
<td rowspan="1" colspan="1"
>Huge
&amp; flat namespace, highly scalable
read/write accessAbility to serve content
directly from storage
system</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Multi-dimensional scalability</emphasis>
(scale out architecture)Scale vertically and
horizontally-distributed storage</td>
<td rowspan="1" colspan="1"
>Backup
and archive large amounts of data with linear
performance</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Account/Container/Object
structure</emphasis>No nesting, not a
traditional file system</td>
<td rowspan="1" colspan="1"
>Optimized
for scaleScales to multiple petabytes,
billions of
objects</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Built-in replication3x+ data
redundancy</emphasis> compared to 2x on
RAID</td>
<td rowspan="1" colspan="1"
>Configurable
number of accounts, container and object
copies for high
availability</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Easily add capacity</emphasis> unlike
RAID resize</td>
<td rowspan="1" colspan="1"
>Elastic
data scaling with
ease</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>No central database</emphasis></td>
<td rowspan="1" colspan="1"
>Higher
performance, no
bottlenecks</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>RAID not required</emphasis></td>
<td rowspan="1" colspan="1"
>Handle
lots of small, random reads and writes
efficiently</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Built-in management
utilities</emphasis></td>
<td rowspan="1" colspan="1"
>Account
Management: Create, add, verify, delete
usersContainer Management: Upload, download,
verifyMonitoring: Capacity, host, network, log
trawling, cluster
health</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Drive auditing</emphasis></td>
<td rowspan="1" colspan="1"
>Detect
drive failures preempting data
corruption</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Expiring objects</emphasis></td>
<td rowspan="1" colspan="1"
>Users
can set an expiration time or a TTL on an
object to control
access</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Direct object access</emphasis></td>
<td rowspan="1" colspan="1"
>Enable
direct browser access to content, such as for
a control
panel</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Realtime visibility into client
requests</emphasis></td>
<td rowspan="1" colspan="1"
>Know
what users are
requesting</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Supports S3 API</emphasis></td>
<td rowspan="1" colspan="1"
>Utilize
tools that were designed for the popular S3
API</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Restrict containers per
account</emphasis></td>
<td rowspan="1" colspan="1"
>Limit
access to control usage by
user</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Support for NetApp, Nexenta,
SolidFire</emphasis></td>
<td rowspan="1" colspan="1"
>Unified
support for block volumes using a variety of
storage
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Snapshot and backup API for block
volumes</emphasis></td>
<td rowspan="1" colspan="1"
>Data
protection and recovery for VM
data</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Standalone volume API
available</emphasis></td>
<td rowspan="1" colspan="1"
>Separate
endpoint and API for integration with other
compute
systems</td>
</tr>
<tr>
<td rowspan="1" colspan="1"><emphasis role="bold"
>Integration with Compute</emphasis></td>
<td rowspan="1" colspan="1"
>Fully
integrated to Compute for attaching block
volumes and reporting on usage</td>
</tr>
</tbody>
</informaltable>
</para>
</section>
<section xml:id="obj-store-capabilities">
<title>Object Storage Capabilities</title>
<itemizedlist>
<listitem>
<para>OpenStack provides redundant, scalable object
storage using clusters of standardized servers capable
of storing petabytes of data</para>
</listitem>
<listitem>
<para>Object Storage is not a traditional file system, but
rather a distributed storage system for static data
such as virtual machine images, photo storage, email
storage, backups and archives. Having no central
"brain" or master point of control provides greater
scalability, redundancy and durability.</para>
</listitem>
<listitem>
<para>Objects and files are written to multiple disk
drives spread throughout servers in the data center,
with the OpenStack software responsible for ensuring
data replication and integrity across the
cluster.</para>
</listitem>
<listitem>
<para>Storage clusters scale horizontally simply by adding
new servers. Should a server or hard drive fail,
OpenStack replicates its content from other active
nodes to new locations in the cluster. Because
OpenStack uses software logic to ensure data
replication and distribution across different devices,
inexpensive commodity hard drives and servers can be
used in lieu of more expensive equipment.</para>
</listitem>
</itemizedlist>
<para><guilabel>Swift Characteristics</guilabel></para>
<para>The key characteristics of Swift include:</para>
<itemizedlist>
<listitem>
<para>All objects stored in Swift have a URL</para>
</listitem>
<listitem>
<para>All objects stored are replicated 3x in
as-unique-as-possible zones, which can be defined as a
group of drives, a node, a rack etc.</para>
</listitem>
<listitem>
<para>All objects have their own metadata</para>
</listitem>
<listitem>
<para>Developers interact with the object storage system
through a RESTful HTTP API</para>
</listitem>
<listitem>
<para>Object data can be located anywhere in the
cluster</para>
</listitem>
<listitem>
<para>The cluster scales by adding additional nodes --
without sacrificing performance, which allows a more
cost-effective linear storage expansion vs. fork-lift
upgrades</para>
</listitem>
<listitem>
<para>Data doesnt have to be migrated to an entirely new
storage system</para>
</listitem>
<listitem>
<para>New nodes can be added to the cluster without
downtime</para>
</listitem>
<listitem>
<para>Failed nodes and disks can be swapped out with no
downtime</para>
</listitem>
<listitem>
<para>Runs on industry-standard hardware, such as Dell,
HP, Supermicro etc.</para>
</listitem>
</itemizedlist>
<figure>
<title>Object Storage(Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image39.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Developers can either write directly to the Swift API or use
one of the many client libraries that exist for all popular
programming languages, such as Java, Python, Ruby and C#.
Amazon S3 and RackSpace Cloud Files users should feel very
familiar with Swift. For users who have not used an object
storage system before, it will require a different approach
and mindset than using a traditional filesystem.</para>
</section>
<section xml:id="swift-building-blocks">
<title>Building Blocks of Swift</title>
<para>The components that enable Swift to deliver high
availability, high durability and high concurrency
are:</para>
<itemizedlist>
<listitem>
<para><emphasis role="bold">Proxy
Servers:</emphasis>Handles all incoming API
requests.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Rings:</emphasis>Maps
logical names of data to locations on particular
disks.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Zones:</emphasis>Each Zone
isolates data from other Zones. A failure in one Zone
doesnt impact the rest of the cluster because data is
replicated across the Zones.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Accounts &amp;
Containers:</emphasis>Each Account and Container
are individual databases that are distributed across
the cluster. An Account database contains the list of
Containers in that Account. A Container database
contains the list of Objects in that Container</para>
</listitem>
<listitem>
<para><emphasis role="bold">Objects:</emphasis>The
data itself.</para>
</listitem>
<listitem>
<para><emphasis role="bold">Partitions:</emphasis>A
Partition stores Objects, Account databases and
Container databases. Its an intermediate 'bucket'
that helps manage locations where data lives in the
cluster.</para>
</listitem>
</itemizedlist>
<figure>
<title>Building Blocks</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image40.png"/>
</imageobject>
</mediaobject>
</figure>
<para><guilabel>Proxy Servers</guilabel></para>
<para>The Proxy Servers are the public face of Swift and
handle all incoming API requests. Once a Proxy Server
receive a request, it will determine the storage node
based on the URL of the object, e.g.
https://swift.example.com/v1/account/container/object. The
Proxy Servers also coordinates responses, handles failures
and coordinates timestamps.</para>
<para>Proxy servers use a shared-nothing architecture and can
be scaled as needed based on projected workloads. A
minimum of two Proxy Servers should be deployed for
redundancy. Should one proxy server fail, the others will
take over.</para>
<para><guilabel>The Ring</guilabel></para>
<para>A ring represents a mapping between the names of entities
stored on disk and their physical location. There are separate
rings for accounts, containers, and objects. When other
components need to perform any operation on an object,
container, or account, they need to interact with the
appropriate ring to determine its location in the
cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the ring is
replicated, by default, 3 times across the cluster, and the
locations for a partition are stored in the mapping maintained
by the ring. The ring is also responsible for determining
which devices are used for hand off in failure
scenarios.</para>
<para>Data can be isolated with the concept of zones in the
ring. Each replica of a partition is guaranteed to reside
in a different zone. A zone could represent a drive, a
server, a cabinet, a switch, or even a data center.</para>
<para>The partitions of the ring are equally divided among all
the devices in the OpenStack Object Storage installation.
When partitions need to be moved around (for example if a
device is added to the cluster), the ring ensures that a
minimum number of partitions are moved at a time, and only
one replica of a partition is moved at a time.</para>
<para>Weights can be used to balance the distribution of
partitions on drives across the cluster. This can be
useful, for example, when different sized drives are used
in a cluster.</para>
<para>The ring is used by the Proxy server and several
background processes (like replication).</para>
<para>The Ring maps Partitions to physical locations on disk.
When other components need to perform any operation on an
object, container, or account, they need to interact with
the Ring to determine its location in the cluster.</para>
<para>The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the Ring is
replicated three times by default across the cluster, and
the locations for a partition are stored in the mapping
maintained by the Ring. The Ring is also responsible for
determining which devices are used for handoff should a
failure occur.</para>
<figure>
<title>The Lord of the <emphasis role="bold"
>Ring</emphasis>s</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image41.png"/>
</imageobject>
</mediaobject>
</figure>
<para>The Ring maps partitions to physical locations on
disk.</para>
<para>The rings determine where data should reside in the
cluster. There is a separate ring for account databases,
container databases, and individual objects but each ring
works in the same way. These rings are externally managed,
in that the server processes themselves do not modify the
rings, they are instead given new rings modified by other
tools.</para>
<para>The ring uses a configurable number of bits from a
paths MD5 hash as a partition index that designates a
device. The number of bits kept from the hash is known as
the partition power, and 2 to the partition power
indicates the partition count. Partitioning the full MD5
hash ring allows other parts of the cluster to work in
batches of items at once which ends up either more
efficient or at least less complex than working with each
item separately or the entire cluster all at once.</para>
<para>Another configurable value is the replica count, which
indicates how many of the partition-&gt;device assignments
comprise a single ring. For a given partition number, each
replicas device will not be in the same zone as any other
replica's device. Zones can be used to group devices based on
physical locations, power separations, network separations, or
any other attribute that would lessen multiple replicas being
unavailable at the same time.</para>
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
<para>Swift allows zones to be configured to isolate
failure boundaries. Each replica of the data resides
in a separate zone, if possible. At the smallest
level, a zone could be a single drive or a grouping of
a few drives. If there were five object storage
servers, then each server would represent its own
zone. Larger deployments would have an entire rack (or
multiple racks) of object servers, each representing a
zone. The goal of zones is to allow the cluster to
tolerate significant outages of storage servers
without losing all replicas of the data.</para>
<para>As we learned earlier, everything in Swift is
stored, by default, three times. Swift will place each
replica "as-uniquely-as-possible" to ensure both high
availability and high durability. This means that when
chosing a replica location, Swift will choose a server
in an unused zone before an unused server in a zone
that already has a replica of the data.</para>
<figure>
<title>image33.png</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image42.png"/>
</imageobject>
</mediaobject>
</figure>
<para>When a disk fails, replica data is automatically
distributed to the other zones to ensure there are
three copies of the data</para>
<para><guilabel>Accounts &amp;
Containers</guilabel></para>
<para>Each account and container is an individual SQLite
database that is distributed across the cluster. An
account database contains the list of containers in
that account. A container database contains the list
of objects in that container.</para>
<figure>
<title>Accounts and Containers</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image43.png"/>
</imageobject>
</mediaobject>
</figure>
<para>To keep track of object data location, each account
in the system has a database that references all its
containers, and each container database references
each object</para>
<para><guilabel>Partitions</guilabel></para>
<para>A Partition is a collection of stored data,
including Account databases, Container databases, and
objects. Partitions are core to the replication
system.</para>
<para>Think of a Partition as a bin moving throughout a
fulfillment center warehouse. Individual orders get
thrown into the bin. The system treats that bin as a
cohesive entity as it moves throughout the system. A
bin full of things is easier to deal with than lots of
little things. It makes for fewer moving parts
throughout the system.</para>
<para>The system replicators and object uploads/downloads
operate on Partitions. As the system scales up,
behavior continues to be predictable as the number of
Partitions is a fixed number.</para>
<para>The implementation of a Partition is conceptually
simple -- a partition is just a directory sitting on a
disk with a corresponding hash table of what it
contains.</para>
<figure>
<title>Partitions</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image44.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*Swift partitions contain all data in the
system.</para>
<para><guilabel>Replication</guilabel></para>
<para>In order to ensure that there are three copies of
the data everywhere, replicators continuously examine
each Partition. For each local Partition, the
replicator compares it against the replicated copies
in the other Zones to see if there are any
differences.</para>
<para>How does the replicator know if replication needs to
take place? It does this by examining hashes. A hash
file is created for each Partition, which contains
hashes of each directory in the Partition. Each of the
three hash files is compared. For a given Partition,
the hash files for each of the Partition's copies are
compared. If the hashes are different, then it is time
to replicate and the directory that needs to be
replicated is copied over.</para>
<para>This is where the Partitions come in handy. With
fewer "things" in the system, larger chunks of data
are transferred around (rather than lots of little TCP
connections, which is inefficient) and there are a
consistent number of hashes to compare.</para>
<para>The cluster has eventually consistent behavior where
the newest data wins.</para>
<figure>
<title>Replication</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image45.png"/>
</imageobject>
</mediaobject>
</figure>
<para>*If a zone goes down, one of the nodes containing a
replica notices and proactively copies data to a
handoff location.</para>
<para>To describe how these pieces all come together, let's walk
through a few scenarios and introduce the components.</para>
<para><guilabel>Bird-eye View</guilabel></para>
<para><emphasis role="bold">Upload</emphasis></para>
<para>A client uses the REST API to make a HTTP request to PUT
an object into an existing Container. The cluster receives
the request. First, the system must figure out where the
data is going to go. To do this, the Account name,
Container name and Object name are all used to determine
the Partition where this object should live.</para>
<para>Then a lookup in the Ring figures out which storage
nodes contain the Partitions in question.</para>
<para>The data then is sent to each storage node where it is
placed in the appropriate Partition. A quorum is required
-- at least two of the three writes must be successful
before the client is notified that the upload was
successful.</para>
<para>Next, the Container database is updated asynchronously
to reflect that there is a new object in it.</para>
<figure>
<title>When End-User uses Swift</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image46.png"/>
</imageobject>
</mediaobject>
</figure>
<para><emphasis role="bold">Download</emphasis></para>
<para>A request comes in for an Account/Container/object.
Using the same consistent hashing, the Partition name is
generated. A lookup in the Ring reveals which storage
nodes contain that Partition. A request is made to one of
the storage nodes to fetch the object and if that fails,
requests are made to the other nodes.</para>
</section>
<section xml:id="the-ring">
<title>Ring Builder</title>
<para>The rings are built and managed manually by a utility called
the ring-builder. The ring-builder assigns partitions to
devices and writes an optimized Python structure to a gzipped,
serialized file on disk for shipping out to the servers. The
server processes just check the modification time of the file
occasionally and reload their in-memory copies of the ring
structure as needed. Because of how the ring-builder manages
changes to the ring, using a slightly older ring usually just
means one of the three replicas for a subset of the partitions
will be incorrect, which can be easily worked around.</para>
<para>The ring-builder also keeps its own builder file with the
ring information and additional data required to build future
rings. It is very important to keep multiple backup copies of
these builder files. One option is to copy the builder files
out to every server while copying the ring files themselves.
Another is to upload the builder files into the cluster
itself. Complete loss of a builder file will mean creating a
new ring from scratch, nearly all partitions will end up
assigned to different devices, and therefore nearly all data
stored will have to be replicated to new locations. So,
recovery from a builder file loss is possible, but data will
definitely be unreachable for an extended time.</para>
<para><guilabel>Ring Data Structure</guilabel></para>
<para>The ring data structure consists of three top level
fields: a list of devices in the cluster, a list of lists
of device ids indicating partition to device assignments,
and an integer indicating the number of bits to shift an
MD5 hash to calculate the partition for the hash.</para>
<para><guilabel>Partition Assignment
List</guilabel></para>
<para>This is a list of array(H) of devices ids. The
outermost list contains an array(H) for each
replica. Each array(H) has a length equal to the
partition count for the ring. Each integer in the
array(H) is an index into the above list of devices.
The partition list is known internally to the Ring
class as _replica2part2dev_id.</para>
<para>So, to create a list of device dictionaries assigned
to a partition, the Python code would look like:
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]</para>
<para>That code is a little simplistic, as it does not
account for the removal of duplicate devices. If a
ring has more replicas than devices, then a partition
will have more than one replica on one device; thats
simply the pigeonhole principle at work.</para>
<para>array(H) is used for memory conservation as there
may be millions of partitions.</para>
<para><guilabel>Fractional Replicas</guilabel></para>
<para>A ring is not restricted to having an integer number
of replicas. In order to support the gradual changing
of replica counts, the ring is able to have a real
number of replicas.</para>
<para>When the number of replicas is not an integer, then
the last element of _replica2part2dev_id will have a
length that is less than the partition count for the
ring. This means that some partitions will have more
replicas than others. For example, if a ring has 3.25
replicas, then 25% of its partitions will have four
replicas, while the remaining 75% will have just
three.</para>
<para><guilabel>Partition Shift Value</guilabel></para>
<para>The partition shift value is known internally to the
Ring class as _part_shift. This value used to shift an
MD5 hash to calculate the partition on which the data
for that hash should reside. Only the top four bytes
of the hash is used in this process. For example, to
compute the partition for the path
/account/container/object the Python code might look
like: partition = unpack_from('&gt;I',
md5('/account/container/object').digest())[0] &gt;&gt;
self._part_shift</para>
<para>For a ring generated with part_power P, the
partition shift value is 32 - P.</para>
<para><guilabel>Building the Ring</guilabel></para>
<para>The initial building of the ring first calculates the
number of partitions that should ideally be assigned to
each device based the devices weight. For example, given
a partition power of 20, the ring will have 1,048,576
partitions. If there are 1,000 devices of equal weight
they will each desire 1,048.576 partitions. The devices
are then sorted by the number of partitions they desire
and kept in order throughout the initialization
process.</para>
<para>Note: each device is also assigned a random tiebreaker
value that is used when two devices desire the same number
of partitions. This tiebreaker is not stored on disk
anywhere, and so two different rings created with the same
parameters will have different partition assignments. For
repeatable partition assignments, RingBuilder.rebalance()
takes an optional seed value that will be used to seed
Pythons pseudo-random number generator.</para>
<para>Then, the ring builder assigns each replica of each
partition to the device that desires the most partitions
at that point while keeping it as far away as possible
from other replicas. The ring builder prefers to assign a
replica to a device in a regions that has no replicas
already; should there be no such region available, the
ring builder will try to find a device in a different
zone; if not possible, it will look on a different server;
failing that, it will just look for a device that has no
replicas; finally, if all other options are exhausted, the
ring builder will assign the replica to the device that
has the fewest replicas already assigned. Note that
assignment of multiple replicas to one device will only
happen if the ring has fewer devices than it has
replicas.</para>
<para>When building a new ring based on an old ring, the
desired number of partitions each device wants is
recalculated. Next the partitions to be reassigned are
gathered up. Any removed devices have all their assigned
partitions unassigned and added to the gathered list. Any
partition replicas that (due to the addition of new
devices) can be spread out for better durability are
unassigned and added to the gathered list. Any devices
that have more partitions than they now desire have random
partitions unassigned from them and added to the gathered
list. Lastly, the gathered partitions are then reassigned
to devices using a similar method as in the initial
assignment described above.</para>
<para>Whenever a partition has a replica reassigned, the time
of the reassignment is recorded. This is taken into
account when gathering partitions to reassign so that no
partition is moved twice in a configurable amount of time.
This configurable amount of time is known internally to
the RingBuilder class as min_part_hours. This restriction
is ignored for replicas of partitions on devices that have
been removed, as removing a device only happens on device
failure and theres no choice but to make a
reassignment.</para>
<para>The above processes dont always perfectly rebalance a
ring due to the random nature of gathering partitions for
reassignment. To help reach a more balanced ring, the
rebalance process is repeated until near perfect (less 1%
off) or when the balance doesnt improve by at least 1%
(indicating we probably cant get perfect balance due to
wildly imbalanced zones or too many partitions recently
moved).</para>
</section>
<section xml:id="more-concepts">
<title>A Bit More On Swift</title>
<para><guilabel>Containers and Objects</guilabel></para>
<para>A container is a storage compartment for your data and
provides a way for you to organize your data. You can
think of a container as a folder in Windows or a
directory in UNIX. The primary difference between a
container and these other file system concepts is that
containers cannot be nested. You can, however, create an
unlimited number of containers within your account. Data
must be stored in a container so you must have at least
one container defined in your account prior to uploading
data.</para>
<para>The only restrictions on container names is that they
cannot contain a forward slash (/) or an ascii null (%00)
and must be less than 257 bytes in length. Please note
that the length restriction applies to the name after it
has been URL encoded. For example, a container name of
Course Docs would be URL encoded as Course%20Docs and
therefore be 13 bytes in length rather than the expected
11.</para>
<para>An object is the basic storage entity and any optional
metadata that represents the files you store in the
OpenStack Object Storage system. When you upload data to
OpenStack Object Storage, the data is stored as-is (no
compression or encryption) and consists of a location
(container), the object's name, and any metadata
consisting of key/value pairs. For instance, you may chose
to store a backup of your digital photos and organize them
into albums. In this case, each object could be tagged
with metadata such as Album : Caribbean Cruise or Album :
Aspen Ski Trip.</para>
<para>The only restriction on object names is that they must
be less than 1024 bytes in length after URL encoding. For
example, an object name of C++final(v2).txt should be URL
encoded as C%2B%2Bfinal%28v2%29.txt and therefore be 24
bytes in length rather than the expected 16.</para>
<para>The maximum allowable size for a storage object upon
upload is 5 gigabytes (GB) and the minimum is zero bytes.
You can use the built-in large object support and the
swift utility to retrieve objects larger than 5 GB.</para>
<para>For metadata, you should not exceed 90 individual
key/value pairs for any one object and the total byte
length of all key/value pairs should not exceed 4KB (4096
bytes).</para>
<para><guilabel>Language-Specific API
Bindings</guilabel></para>
<para>A set of supported API bindings in several popular
languages are available from the Rackspace Cloud Files
product, which uses OpenStack Object Storage code for its
implementation. These bindings provide a layer of
abstraction on top of the base REST API, allowing
programmers to work with a container and object model
instead of working directly with HTTP requests and
responses. These bindings are free (as in beer and as in
speech) to download, use, and modify. They are all
licensed under the MIT License as described in the COPYING
file packaged with each binding. If you do make any
improvements to an API, you are encouraged (but not
required) to submit those changes back to us.</para>
<para>The API bindings for Rackspace Cloud Files are hosted
at<link xlink:href="http://github.com/rackspace"
></link><link
xlink:href="http://github.com/rackspace"
>http://github.com/rackspace</link>. Feel free to
coordinate your changes through github or, if you prefer,
send your changes to cloudfiles@rackspacecloud.com. Just
make sure to indicate which language and version you
modified and send a unified diff.</para>
<para>Each binding includes its own documentation (either
HTML, PDF, or CHM). They also include code snippets and
examples to help you get started. The currently supported
API binding for OpenStack Object Storage are:</para>
<itemizedlist>
<listitem>
<para>PHP (requires 5.x and the modules: cURL,
FileInfo, mbstring)</para>
</listitem>
<listitem>
<para>Python (requires 2.4 or newer)</para>
</listitem>
<listitem>
<para>Java (requires JRE v1.5 or newer)</para>
</listitem>
<listitem>
<para>C#/.NET (requires .NET Framework v3.5)</para>
</listitem>
<listitem>
<para>Ruby (requires 1.8 or newer and mime-tools
module)</para>
</listitem>
</itemizedlist>
<para>There are no other supported language-specific bindings
at this time. You are welcome to create your own language
API bindings and we can help answer any questions during
development, host your code if you like, and give you full
credit for your work.</para>
<para><guilabel>Proxy Server</guilabel></para>
<para>The Proxy Server is responsible for tying together
the rest of the OpenStack Object Storage architecture.
For each request, it will look up the location of the
account, container, or object in the ring (see below)
and route the request accordingly. The public API is
also exposed through the Proxy Server.</para>
<para>A large number of failures are also handled in the
Proxy Server. For example, if a server is unavailable
for an object PUT, it will ask the ring for a hand-off
server and route there instead.</para>
<para>When objects are streamed to or from an object
server, they are streamed directly through the proxy
server to or from the user the proxy server does not
spool them.</para>
<para>You can use a proxy server with account management
enabled by configuring it in the proxy server
configuration file.</para>
<para><guilabel>Object Server</guilabel></para>
<para>The Object Server is a very simple blob storage
server that can store, retrieve and delete objects
stored on local devices. Objects are stored as binary
files on the filesystem with metadata stored in the
files extended attributes (xattrs). This requires
that the underlying filesystem choice for object
servers support xattrs on files. Some filesystems,
like ext3, have xattrs turned off by default.</para>
<para>Each object is stored using a path derived from the
object names hash and the operations timestamp. Last
write always wins, and ensures that the latest object
version will be served. A deletion is also treated as
a version of the file (a 0 byte file ending with
“.ts”, which stands for tombstone). This ensures that
deleted files are replicated correctly and older
versions dont magically reappear due to failure
scenarios.</para>
<para><guilabel>Container Server</guilabel></para>
<para>The Container Servers primary job is to handle
listings of objects. It does nott know where those
objects are, just what objects are in a specific
container. The listings are stored as sqlite database
files, and replicated across the cluster similar to
how objects are. Statistics are also tracked that
include the total number of objects, and total storage
usage for that container.</para>
<para><guilabel>Account Server</guilabel></para>
<para>The Account Server is very similar to the Container
Server, excepting that it is responsible for listings
of containers rather than objects.</para>
<para><guilabel>Replication</guilabel></para>
<para>Replication is designed to keep the system in a
consistent state in the face of temporary error
conditions like network outages or drive
failures.</para>
<para>The replication processes compare local data with
each remote copy to ensure they all contain the latest
version. Object replication uses a hash list to
quickly compare subsections of each partition, and
container and account replication use a combination of
hashes and shared high water marks.</para>
<para>Replication updates are push based. For object
replication, updating is just a matter of rsyncing
files to the peer. Account and container replication
push missing records over HTTP or rsync whole database
files.</para>
<para>The replicator also ensures that data is removed
from the system. When an item (object, container, or
account) is deleted, a tombstone is set as the latest
version of the item. The replicator will see the
tombstone and ensure that the item is removed from the
entire system.</para>
<para>To separate the cluster-internal replication traffic
from client traffic, separate replication servers can
be used. These replication servers are based on the
standard storage servers, but they listen on the
replication IP and only respond to REPLICATE requests.
Storage servers can serve REPLICATE requests, so an
operator can transition to using a separate
replication network with no cluster downtime.</para>
<para>Replication IP and port information is stored in the
ring on a per-node basis. These parameters will be
used if they are present, but they are not required.
If this information does not exist or is empty for a
particular node, the node's standard IP and port will
be used for replication.</para>
<para><guilabel>Updaters</guilabel></para>
<para>There are times when container or account data can
not be immediately updated. This usually occurs during
failure scenarios or periods of high load. If an
update fails, the update is queued locally on the file
system, and the updater will process the failed
updates. This is where an eventual consistency window
will most likely come in to play. For example, suppose
a container server is under load and a new object is
put in to the system. The object will be immediately
available for reads as soon as the proxy server
responds to the client with success. However, the
container server did not update the object listing,
and so the update would be queued for a later update.
Container listings, therefore, may not immediately
contain the object.</para>
<para>In practice, the consistency window is only as large
as the frequency at which the updater runs and may not
even be noticed as the proxy server will route listing
requests to the first container server which responds.
The server under load may not be the one that serves
subsequent listing requests one of the other two
replicas may handle the listing.</para>
<para><guilabel>Auditors</guilabel></para>
<para>Auditors crawl the local server checking the
integrity of the objects, containers, and accounts. If
corruption is found (in the case of bit rot, for
example), the file is quarantined, and replication
will replace the bad file from another replica. If
other errors are found they are logged (for example,
an objects listing cant be found on any container
server it should be).</para>
</section>
<section xml:id="cluster-architecture">
<title>Cluster Arch</title>
<para><guilabel>Access Tier</guilabel></para>
<figure>
<title>Swift Cluster Architecture</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image47.png"/>
</imageobject>
</mediaobject>
</figure>
<para>Large-scale deployments segment off an "Access Tier".
This tier is the “Grand Central” of the Object Storage
system. It fields incoming API requests from clients and
moves data in and out of the system. This tier is composed
of front-end load balancers, ssl- terminators,
authentication services, and it runs the (distributed)
brain of the object storage system — the proxy server
processes.</para>
<para>Having the access servers in their own tier enables
read/write access to be scaled out independently of
storage capacity. For example, if the cluster is on the
public Internet and requires ssl-termination and has high
demand for data access, many access servers can be
provisioned. However, if the cluster is on a private
network and it is being used primarily for archival
purposes, fewer access servers are needed.</para>
<para>As this is an HTTP addressable storage service, a load
balancer can be incorporated into the access tier.</para>
<para>Typically, this tier comprises a collection of 1U
servers. These machines use a moderate amount of RAM and
are network I/O intensive. As these systems field each
incoming API request, it is wise to provision them with
two high-throughput (10GbE) interfaces. One interface is
used for 'front-end' incoming requests and the other for
'back-end' access to the object storage nodes to put and
fetch data.</para>
<para><guilabel>Factors to Consider</guilabel></para>
<para>For most publicly facing deployments as well as
private deployments available across a wide-reaching
corporate network, SSL will be used to encrypt traffic
to the client. SSL adds significant processing load to
establish sessions between clients; more capacity in
the access layer will need to be provisioned. SSL may
not be required for private deployments on trusted
networks.</para>
<para><guilabel>Storage Nodes</guilabel></para>
<figure>
<title>Object Storage (Swift)</title>
<mediaobject>
<imageobject>
<imagedata fileref="figures/image48.png"/>
</imageobject>
</mediaobject>
</figure>
<para>The next component is the storage servers themselves.
Generally, most configurations should have each of the
five Zones with an equal amount of storage capacity.
Storage nodes use a reasonable amount of memory and CPU.
Metadata needs to be readily available to quickly return
objects. The object stores run services not only to field
incoming requests from the Access Tier, but to also run
replicators, auditors, and reapers. Object stores can be
provisioned with single gigabit or 10 gigabit network
interface depending on expected workload and desired
performance.</para>
<para>Currently 2TB or 3TB SATA disks deliver good
price/performance value. Desktop-grade drives can be used
where there are responsive remote hands in the datacenter,
and enterprise-grade drives can be used where this is not
the case.</para>
<para><guilabel>Factors to Consider</guilabel></para>
<para>Desired I/O performance for single-threaded requests
should be kept in mind. This system does not use RAID,
so each request for an object is handled by a single
disk. Disk performance impacts single-threaded
response rates.</para>
<para>To achieve apparent higher throughput, the object
storage system is designed with concurrent
uploads/downloads in mind. The network I/O capacity
(1GbE, bonded 1GbE pair, or 10GbE) should match your
desired concurrent throughput needs for reads and
writes.</para>
</section>
<section xml:id="account-reaper">
<title>Account Reaper</title>
<para>The Account Reaper removes data from deleted accounts in the
background.</para>
<para>An account is marked for deletion by a reseller issuing a
DELETE request on the accounts storage URL. This simply puts
the value DELETED into the status column of the account_stat
table in the account database (and replicas), indicating the
data for the account should be deleted later.</para>
<para>There is normally no set retention time and no undelete; it
is assumed the reseller will implement such features and only
call DELETE on the account once it is truly desired the
accounts data be removed. However, in order to protect the
Swift cluster accounts from an improper or mistaken delete
request, you can set a delay_reaping value in the
[account-reaper] section of the account-server.conf to delay
the actual deletion of data. At this time, there is no utility
to undelete an account; one would have to update the account
database replicas directly, setting the status column to an
empty string and updating the put_timestamp to be greater than
the delete_timestamp. (On the TODO list is writing a utility
to perform this task, preferably through a ReST call.)</para>
<para>The account reaper runs on each account server and scans the
server occasionally for account databases marked for deletion.
It will only trigger on accounts that server is the primary
node for, so that multiple account servers arent all trying
to do the same work at the same time. Using multiple servers
to delete one account might improve deletion speed, but
requires coordination so they arent duplicating effort. Speed
really isnt as much of a concern with data deletion and large
accounts arent deleted that often.</para>
<para>The deletion process for an account itself is pretty
straightforward. For each container in the account, each
object is deleted and then the container is deleted. Any
deletion requests that fail wont stop the overall process,
but will cause the overall process to fail eventually (for
example, if an object delete times out, the container wont be
able to be deleted later and therefore the account wont be
deleted either). The overall process continues even on a
failure so that it doesnt get hung up reclaiming cluster
space because of one troublesome spot. The account reaper will
keep trying to delete an account until it eventually becomes
empty, at which point the database reclaim process within the
db_replicator will eventually remove the database
files.</para>
<para>Sometimes a persistent error state can prevent some object
or container from being deleted. If this happens, you will see
a message such as “Account &lt;name&gt; has not been reaped
since &lt;date&gt;” in the log. You can control when this is
logged with the reap_warn_after value in the [account-reaper]
section of the account-server.conf file. By default this is 30
days.</para>
</section>
<section xml:id="replication">
<title>Replication</title>
<para>Because each replica in swift functions independently, and
clients generally require only a simple majority of nodes
responding to consider an operation successful, transient
failures like network partitions can quickly cause replicas to
diverge. These differences are eventually reconciled by
asynchronous, peer-to-peer replicator processes. The
replicator processes traverse their local filesystems,
concurrently performing operations in a manner that balances
load across physical disks.</para>
<para>Replication uses a push model, with records and files
generally only being copied from local to remote replicas.
This is important because data on the node may not belong
there (as in the case of handoffs and ring changes), and a
replicator cant know what data exists elsewhere in the
cluster that it should pull in. Its the duty of any node that
contains data to ensure that data gets to where it belongs.
Replica placement is handled by the ring.</para>
<para>Every deleted record or file in the system is marked by a
tombstone, so that deletions can be replicated alongside
creations. The replication process cleans up tombstones after
a time period known as the consistency window. The consistency
window encompasses replication duration and how long transient
failure can remove a node from the cluster. Tombstone cleanup
must be tied to replication to reach replica
convergence.</para>
<para>If a replicator detects that a remote drive has failed, the
replicator uses the get_more_nodes interface for the ring to
choose an alternate node with which to synchronize. The
replicator can maintain desired levels of replication in the
face of disk failures, though some replicas may not be in an
immediately usable location. Note that the replicator doesnt
maintain desired levels of replication when other failures,
such as entire node failures, occur because most failure are
transient.</para>
<para>Replication is an area of active development, and likely
rife with potential improvements to speed and
correctness.</para>
<para>There are two major classes of replicator - the db
replicator, which replicates accounts and containers, and the
object replicator, which replicates object data.</para>
<para><guilabel>DB Replication</guilabel></para>
<para>The first step performed by db replication is a low-cost
hash comparison to determine whether two replicas already
match. Under normal operation, this check is able to
verify that most databases in the system are already
synchronized very quickly. If the hashes differ, the
replicator brings the databases in sync by sharing records
added since the last sync point.</para>
<para>This sync point is a high water mark noting the last
record at which two databases were known to be in sync,
and is stored in each database as a tuple of the remote
database id and record id. Database ids are unique amongst
all replicas of the database, and record ids are
monotonically increasing integers. After all new records
have been pushed to the remote database, the entire sync
table of the local database is pushed, so the remote
database can guarantee that it is in sync with everything
with which the local database has previously
synchronized.</para>
<para>If a replica is found to be missing entirely, the whole
local database file is transmitted to the peer using
rsync(1) and vested with a new unique id.</para>
<para>In practice, DB replication can process hundreds of
databases per concurrency setting per second (up to the
number of available CPUs or disks) and is bound by the
number of DB transactions that must be performed.</para>
<para><guilabel>Object Replication</guilabel></para>
<para>The initial implementation of object replication simply
performed an rsync to push data from a local partition to
all remote servers it was expected to exist on. While this
performed adequately at small scale, replication times
skyrocketed once directory structures could no longer be
held in RAM. We now use a modification of this scheme in
which a hash of the contents for each suffix directory is
saved to a per-partition hashes file. The hash for a
suffix directory is invalidated when the contents of that
suffix directory are modified.</para>
<para>The object replication process reads in these hash
files, calculating any invalidated hashes. It then
transmits the hashes to each remote server that should
hold the partition, and only suffix directories with
differing hashes on the remote server are rsynced. After
pushing files to the remote server, the replication
process notifies it to recalculate hashes for the rsynced
suffix directories.</para>
<para>Performance of object replication is generally bound by
the number of uncached directories it has to traverse,
usually as a result of invalidated suffix directory
hashes. Using write volume and partition counts from our
running systems, it was designed so that around 2% of the
hash space on a normal node will be invalidated per day,
which has experimentally given us acceptable replication
speeds.</para>
</section>
</section>