51274c9fcb
this is part of the training guide restructuring effort. implements bp/training-manual Change-Id: I28c9c5317f55da4cef075bae89b3ee518aaeae74
1206 lines
65 KiB
XML
1206 lines
65 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<section xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||
xml:id="associate-storage-node-concept-swift">
|
||
<title>Conceptual Swift</title>
|
||
<section xml:id="intro-objectstore">
|
||
<title>Introduction to Object Storage</title>
|
||
<para>OpenStack Object Storage (code-named Swift) is open source
|
||
software for creating redundant, scalable data storage using
|
||
clusters of standardized servers to store petabytes of
|
||
accessible data. It is a long-term storage system for large
|
||
amounts of static data that can be retrieved, leveraged, and
|
||
updated. Object Storage uses a distributed architecture with
|
||
no central point of control, providing greater scalability,
|
||
redundancy and permanence. Objects are written to multiple
|
||
hardware devices, with the OpenStack software responsible for
|
||
ensuring data replication and integrity across the cluster.
|
||
Storage clusters scale horizontally by adding new nodes.
|
||
Should a node fail, OpenStack works to replicate its content
|
||
from other active nodes. Because OpenStack uses software logic
|
||
to ensure data replication and distribution across different
|
||
devices, inexpensive commodity hard drives and servers can be
|
||
used in lieu of more expensive equipment.</para>
|
||
<para>Object Storage is ideal for cost effective, scale-out
|
||
storage. It provides a fully distributed, API-accessible
|
||
storage platform that can be integrated directly into
|
||
applications or used for backup, archiving and data retention.
|
||
Block Storage allows block devices to be exposed and connected
|
||
to compute instances for expanded storage, better performance
|
||
and integration with enterprise storage platforms, such as
|
||
NetApp, Nexenta and SolidFire.</para>
|
||
</section>
|
||
<section xml:id="features-benefits">
|
||
<title>Features and Benifits</title>
|
||
<para>
|
||
<informaltable class="c19">
|
||
<tbody>
|
||
<tr>
|
||
<th rowspan="1" colspan="1">Features</th>
|
||
<th rowspan="1" colspan="1">Benefits</th>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Leverages commodity
|
||
hardware</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>No
|
||
lock-in, lower
|
||
price/GB</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>HDD/node failure agnostic</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Self
|
||
healingReliability, data redundancy protecting
|
||
from
|
||
failures</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Unlimited storage</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Huge
|
||
& flat namespace, highly scalable
|
||
read/write accessAbility to serve content
|
||
directly from storage
|
||
system</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Multi-dimensional scalability</emphasis>
|
||
(scale out architecture)Scale vertically and
|
||
horizontally-distributed storage</td>
|
||
<td rowspan="1" colspan="1"
|
||
>Backup
|
||
and archive large amounts of data with linear
|
||
performance</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Account/Container/Object
|
||
structure</emphasis>No nesting, not a
|
||
traditional file system</td>
|
||
<td rowspan="1" colspan="1"
|
||
>Optimized
|
||
for scaleScales to multiple petabytes,
|
||
billions of
|
||
objects</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Built-in replication3x+ data
|
||
redundancy</emphasis> compared to 2x on
|
||
RAID</td>
|
||
<td rowspan="1" colspan="1"
|
||
>Configurable
|
||
number of accounts, container and object
|
||
copies for high
|
||
availability</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Easily add capacity</emphasis> unlike
|
||
RAID resize</td>
|
||
<td rowspan="1" colspan="1"
|
||
>Elastic
|
||
data scaling with
|
||
ease</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>No central database</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Higher
|
||
performance, no
|
||
bottlenecks</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>RAID not required</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Handle
|
||
lots of small, random reads and writes
|
||
efficiently</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Built-in management
|
||
utilities</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Account
|
||
Management: Create, add, verify, delete
|
||
usersContainer Management: Upload, download,
|
||
verifyMonitoring: Capacity, host, network, log
|
||
trawling, cluster
|
||
health</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Drive auditing</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Detect
|
||
drive failures preempting data
|
||
corruption</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Expiring objects</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Users
|
||
can set an expiration time or a TTL on an
|
||
object to control
|
||
access</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Direct object access</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Enable
|
||
direct browser access to content, such as for
|
||
a control
|
||
panel</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Realtime visibility into client
|
||
requests</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Know
|
||
what users are
|
||
requesting</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Supports S3 API</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Utilize
|
||
tools that were designed for the popular S3
|
||
API</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Restrict containers per
|
||
account</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Limit
|
||
access to control usage by
|
||
user</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Support for NetApp, Nexenta,
|
||
SolidFire</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Unified
|
||
support for block volumes using a variety of
|
||
storage
|
||
systems</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Snapshot and backup API for block
|
||
volumes</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Data
|
||
protection and recovery for VM
|
||
data</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Standalone volume API
|
||
available</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Separate
|
||
endpoint and API for integration with other
|
||
compute
|
||
systems</td>
|
||
</tr>
|
||
<tr>
|
||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||
>Integration with Compute</emphasis></td>
|
||
<td rowspan="1" colspan="1"
|
||
>Fully
|
||
integrated to Compute for attaching block
|
||
volumes and reporting on usage</td>
|
||
</tr>
|
||
</tbody>
|
||
</informaltable>
|
||
</para>
|
||
</section>
|
||
<section xml:id="obj-store-capabilities">
|
||
<title>Object Storage Capabilities</title>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>OpenStack provides redundant, scalable object
|
||
storage using clusters of standardized servers capable
|
||
of storing petabytes of data</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Object Storage is not a traditional file system, but
|
||
rather a distributed storage system for static data
|
||
such as virtual machine images, photo storage, email
|
||
storage, backups and archives. Having no central
|
||
"brain" or master point of control provides greater
|
||
scalability, redundancy and durability.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Objects and files are written to multiple disk
|
||
drives spread throughout servers in the data center,
|
||
with the OpenStack software responsible for ensuring
|
||
data replication and integrity across the
|
||
cluster.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Storage clusters scale horizontally simply by adding
|
||
new servers. Should a server or hard drive fail,
|
||
OpenStack replicates its content from other active
|
||
nodes to new locations in the cluster. Because
|
||
OpenStack uses software logic to ensure data
|
||
replication and distribution across different devices,
|
||
inexpensive commodity hard drives and servers can be
|
||
used in lieu of more expensive equipment.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para><guilabel>Swift Characteristics</guilabel></para>
|
||
<para>The key characteristics of Swift include:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>All objects stored in Swift have a URL</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>All objects stored are replicated 3x in
|
||
as-unique-as-possible zones, which can be defined as a
|
||
group of drives, a node, a rack etc.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>All objects have their own metadata</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Developers interact with the object storage system
|
||
through a RESTful HTTP API</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Object data can be located anywhere in the
|
||
cluster</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The cluster scales by adding additional nodes --
|
||
without sacrificing performance, which allows a more
|
||
cost-effective linear storage expansion vs. fork-lift
|
||
upgrades</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Data doesn’t have to be migrated to an entirely new
|
||
storage system</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>New nodes can be added to the cluster without
|
||
downtime</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Failed nodes and disks can be swapped out with no
|
||
downtime</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Runs on industry-standard hardware, such as Dell,
|
||
HP, Supermicro etc.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<figure>
|
||
<title>Object Storage(Swift)</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image39.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>Developers can either write directly to the Swift API or use
|
||
one of the many client libraries that exist for all popular
|
||
programming languages, such as Java, Python, Ruby and C#.
|
||
Amazon S3 and RackSpace Cloud Files users should feel very
|
||
familiar with Swift. For users who have not used an object
|
||
storage system before, it will require a different approach
|
||
and mindset than using a traditional filesystem.</para>
|
||
</section>
|
||
<section xml:id="swift-building-blocks">
|
||
<title>Building Blocks of Swift</title>
|
||
<para>The components that enable Swift to deliver high
|
||
availability, high durability and high concurrency
|
||
are:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para><emphasis role="bold">Proxy
|
||
Servers:</emphasis>Handles all incoming API
|
||
requests.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Rings:</emphasis>Maps
|
||
logical names of data to locations on particular
|
||
disks.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Zones:</emphasis>Each Zone
|
||
isolates data from other Zones. A failure in one Zone
|
||
doesn’t impact the rest of the cluster because data is
|
||
replicated across the Zones.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Accounts &
|
||
Containers:</emphasis>Each Account and Container
|
||
are individual databases that are distributed across
|
||
the cluster. An Account database contains the list of
|
||
Containers in that Account. A Container database
|
||
contains the list of Objects in that Container</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Objects:</emphasis>The
|
||
data itself.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Partitions:</emphasis>A
|
||
Partition stores Objects, Account databases and
|
||
Container databases. It’s an intermediate 'bucket'
|
||
that helps manage locations where data lives in the
|
||
cluster.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<figure>
|
||
<title>Building Blocks</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image40.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><guilabel>Proxy Servers</guilabel></para>
|
||
<para>The Proxy Servers are the public face of Swift and
|
||
handle all incoming API requests. Once a Proxy Server
|
||
receive a request, it will determine the storage node
|
||
based on the URL of the object, e.g.
|
||
https://swift.example.com/v1/account/container/object. The
|
||
Proxy Servers also coordinates responses, handles failures
|
||
and coordinates timestamps.</para>
|
||
<para>Proxy servers use a shared-nothing architecture and can
|
||
be scaled as needed based on projected workloads. A
|
||
minimum of two Proxy Servers should be deployed for
|
||
redundancy. Should one proxy server fail, the others will
|
||
take over.</para>
|
||
<para><guilabel>The Ring</guilabel></para>
|
||
<para>A ring represents a mapping between the names of entities
|
||
stored on disk and their physical location. There are separate
|
||
rings for accounts, containers, and objects. When other
|
||
components need to perform any operation on an object,
|
||
container, or account, they need to interact with the
|
||
appropriate ring to determine its location in the
|
||
cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the ring is
|
||
replicated, by default, 3 times across the cluster, and the
|
||
locations for a partition are stored in the mapping maintained
|
||
by the ring. The ring is also responsible for determining
|
||
which devices are used for hand off in failure
|
||
scenarios.</para>
|
||
<para>Data can be isolated with the concept of zones in the
|
||
ring. Each replica of a partition is guaranteed to reside
|
||
in a different zone. A zone could represent a drive, a
|
||
server, a cabinet, a switch, or even a data center.</para>
|
||
<para>The partitions of the ring are equally divided among all
|
||
the devices in the OpenStack Object Storage installation.
|
||
When partitions need to be moved around (for example if a
|
||
device is added to the cluster), the ring ensures that a
|
||
minimum number of partitions are moved at a time, and only
|
||
one replica of a partition is moved at a time.</para>
|
||
<para>Weights can be used to balance the distribution of
|
||
partitions on drives across the cluster. This can be
|
||
useful, for example, when different sized drives are used
|
||
in a cluster.</para>
|
||
<para>The ring is used by the Proxy server and several
|
||
background processes (like replication).</para>
|
||
<para>The Ring maps Partitions to physical locations on disk.
|
||
When other components need to perform any operation on an
|
||
object, container, or account, they need to interact with
|
||
the Ring to determine its location in the cluster.</para>
|
||
<para>The Ring maintains this mapping using zones, devices,
|
||
partitions, and replicas. Each partition in the Ring is
|
||
replicated three times by default across the cluster, and
|
||
the locations for a partition are stored in the mapping
|
||
maintained by the Ring. The Ring is also responsible for
|
||
determining which devices are used for handoff should a
|
||
failure occur.</para>
|
||
<figure>
|
||
<title>The Lord of the <emphasis role="bold"
|
||
>Ring</emphasis>s</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image41.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>The Ring maps partitions to physical locations on
|
||
disk.</para>
|
||
<para>The rings determine where data should reside in the
|
||
cluster. There is a separate ring for account databases,
|
||
container databases, and individual objects but each ring
|
||
works in the same way. These rings are externally managed,
|
||
in that the server processes themselves do not modify the
|
||
rings, they are instead given new rings modified by other
|
||
tools.</para>
|
||
<para>The ring uses a configurable number of bits from a
|
||
path’s MD5 hash as a partition index that designates a
|
||
device. The number of bits kept from the hash is known as
|
||
the partition power, and 2 to the partition power
|
||
indicates the partition count. Partitioning the full MD5
|
||
hash ring allows other parts of the cluster to work in
|
||
batches of items at once which ends up either more
|
||
efficient or at least less complex than working with each
|
||
item separately or the entire cluster all at once.</para>
|
||
<para>Another configurable value is the replica count, which
|
||
indicates how many of the partition->device assignments
|
||
comprise a single ring. For a given partition number, each
|
||
replica’s device will not be in the same zone as any other
|
||
replica's device. Zones can be used to group devices based on
|
||
physical locations, power separations, network separations, or
|
||
any other attribute that would lessen multiple replicas being
|
||
unavailable at the same time.</para>
|
||
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
|
||
<para>Swift allows zones to be configured to isolate
|
||
failure boundaries. Each replica of the data resides
|
||
in a separate zone, if possible. At the smallest
|
||
level, a zone could be a single drive or a grouping of
|
||
a few drives. If there were five object storage
|
||
servers, then each server would represent its own
|
||
zone. Larger deployments would have an entire rack (or
|
||
multiple racks) of object servers, each representing a
|
||
zone. The goal of zones is to allow the cluster to
|
||
tolerate significant outages of storage servers
|
||
without losing all replicas of the data.</para>
|
||
<para>As we learned earlier, everything in Swift is
|
||
stored, by default, three times. Swift will place each
|
||
replica "as-uniquely-as-possible" to ensure both high
|
||
availability and high durability. This means that when
|
||
chosing a replica location, Swift will choose a server
|
||
in an unused zone before an unused server in a zone
|
||
that already has a replica of the data.</para>
|
||
<figure>
|
||
<title>image33.png</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image42.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>When a disk fails, replica data is automatically
|
||
distributed to the other zones to ensure there are
|
||
three copies of the data</para>
|
||
<para><guilabel>Accounts &
|
||
Containers</guilabel></para>
|
||
<para>Each account and container is an individual SQLite
|
||
database that is distributed across the cluster. An
|
||
account database contains the list of containers in
|
||
that account. A container database contains the list
|
||
of objects in that container.</para>
|
||
<figure>
|
||
<title>Accounts and Containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image43.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>To keep track of object data location, each account
|
||
in the system has a database that references all its
|
||
containers, and each container database references
|
||
each object</para>
|
||
<para><guilabel>Partitions</guilabel></para>
|
||
<para>A Partition is a collection of stored data,
|
||
including Account databases, Container databases, and
|
||
objects. Partitions are core to the replication
|
||
system.</para>
|
||
<para>Think of a Partition as a bin moving throughout a
|
||
fulfillment center warehouse. Individual orders get
|
||
thrown into the bin. The system treats that bin as a
|
||
cohesive entity as it moves throughout the system. A
|
||
bin full of things is easier to deal with than lots of
|
||
little things. It makes for fewer moving parts
|
||
throughout the system.</para>
|
||
<para>The system replicators and object uploads/downloads
|
||
operate on Partitions. As the system scales up,
|
||
behavior continues to be predictable as the number of
|
||
Partitions is a fixed number.</para>
|
||
<para>The implementation of a Partition is conceptually
|
||
simple -- a partition is just a directory sitting on a
|
||
disk with a corresponding hash table of what it
|
||
contains.</para>
|
||
<figure>
|
||
<title>Partitions</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image44.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*Swift partitions contain all data in the
|
||
system.</para>
|
||
<para><guilabel>Replication</guilabel></para>
|
||
<para>In order to ensure that there are three copies of
|
||
the data everywhere, replicators continuously examine
|
||
each Partition. For each local Partition, the
|
||
replicator compares it against the replicated copies
|
||
in the other Zones to see if there are any
|
||
differences.</para>
|
||
<para>How does the replicator know if replication needs to
|
||
take place? It does this by examining hashes. A hash
|
||
file is created for each Partition, which contains
|
||
hashes of each directory in the Partition. Each of the
|
||
three hash files is compared. For a given Partition,
|
||
the hash files for each of the Partition's copies are
|
||
compared. If the hashes are different, then it is time
|
||
to replicate and the directory that needs to be
|
||
replicated is copied over.</para>
|
||
<para>This is where the Partitions come in handy. With
|
||
fewer "things" in the system, larger chunks of data
|
||
are transferred around (rather than lots of little TCP
|
||
connections, which is inefficient) and there are a
|
||
consistent number of hashes to compare.</para>
|
||
<para>The cluster has eventually consistent behavior where
|
||
the newest data wins.</para>
|
||
<figure>
|
||
<title>Replication</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image45.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>*If a zone goes down, one of the nodes containing a
|
||
replica notices and proactively copies data to a
|
||
handoff location.</para>
|
||
<para>To describe how these pieces all come together, let's walk
|
||
through a few scenarios and introduce the components.</para>
|
||
<para><guilabel>Bird-eye View</guilabel></para>
|
||
<para><emphasis role="bold">Upload</emphasis></para>
|
||
|
||
<para>A client uses the REST API to make a HTTP request to PUT
|
||
an object into an existing Container. The cluster receives
|
||
the request. First, the system must figure out where the
|
||
data is going to go. To do this, the Account name,
|
||
Container name and Object name are all used to determine
|
||
the Partition where this object should live.</para>
|
||
<para>Then a lookup in the Ring figures out which storage
|
||
nodes contain the Partitions in question.</para>
|
||
<para>The data then is sent to each storage node where it is
|
||
placed in the appropriate Partition. A quorum is required
|
||
-- at least two of the three writes must be successful
|
||
before the client is notified that the upload was
|
||
successful.</para>
|
||
<para>Next, the Container database is updated asynchronously
|
||
to reflect that there is a new object in it.</para>
|
||
<figure>
|
||
<title>When End-User uses Swift</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image46.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para><emphasis role="bold">Download</emphasis></para>
|
||
<para>A request comes in for an Account/Container/object.
|
||
Using the same consistent hashing, the Partition name is
|
||
generated. A lookup in the Ring reveals which storage
|
||
nodes contain that Partition. A request is made to one of
|
||
the storage nodes to fetch the object and if that fails,
|
||
requests are made to the other nodes.</para>
|
||
</section>
|
||
<section xml:id="the-ring">
|
||
<title>Ring Builder</title>
|
||
<para>The rings are built and managed manually by a utility called
|
||
the ring-builder. The ring-builder assigns partitions to
|
||
devices and writes an optimized Python structure to a gzipped,
|
||
serialized file on disk for shipping out to the servers. The
|
||
server processes just check the modification time of the file
|
||
occasionally and reload their in-memory copies of the ring
|
||
structure as needed. Because of how the ring-builder manages
|
||
changes to the ring, using a slightly older ring usually just
|
||
means one of the three replicas for a subset of the partitions
|
||
will be incorrect, which can be easily worked around.</para>
|
||
<para>The ring-builder also keeps its own builder file with the
|
||
ring information and additional data required to build future
|
||
rings. It is very important to keep multiple backup copies of
|
||
these builder files. One option is to copy the builder files
|
||
out to every server while copying the ring files themselves.
|
||
Another is to upload the builder files into the cluster
|
||
itself. Complete loss of a builder file will mean creating a
|
||
new ring from scratch, nearly all partitions will end up
|
||
assigned to different devices, and therefore nearly all data
|
||
stored will have to be replicated to new locations. So,
|
||
recovery from a builder file loss is possible, but data will
|
||
definitely be unreachable for an extended time.</para>
|
||
<para><guilabel>Ring Data Structure</guilabel></para>
|
||
<para>The ring data structure consists of three top level
|
||
fields: a list of devices in the cluster, a list of lists
|
||
of device ids indicating partition to device assignments,
|
||
and an integer indicating the number of bits to shift an
|
||
MD5 hash to calculate the partition for the hash.</para>
|
||
<para><guilabel>Partition Assignment
|
||
List</guilabel></para>
|
||
<para>This is a list of array(‘H’) of devices ids. The
|
||
outermost list contains an array(‘H’) for each
|
||
replica. Each array(‘H’) has a length equal to the
|
||
partition count for the ring. Each integer in the
|
||
array(‘H’) is an index into the above list of devices.
|
||
The partition list is known internally to the Ring
|
||
class as _replica2part2dev_id.</para>
|
||
<para>So, to create a list of device dictionaries assigned
|
||
to a partition, the Python code would look like:
|
||
devices = [self.devs[part2dev_id[partition]] for
|
||
part2dev_id in self._replica2part2dev_id]</para>
|
||
<para>That code is a little simplistic, as it does not
|
||
account for the removal of duplicate devices. If a
|
||
ring has more replicas than devices, then a partition
|
||
will have more than one replica on one device; that’s
|
||
simply the pigeonhole principle at work.</para>
|
||
<para>array(‘H’) is used for memory conservation as there
|
||
may be millions of partitions.</para>
|
||
<para><guilabel>Fractional Replicas</guilabel></para>
|
||
<para>A ring is not restricted to having an integer number
|
||
of replicas. In order to support the gradual changing
|
||
of replica counts, the ring is able to have a real
|
||
number of replicas.</para>
|
||
<para>When the number of replicas is not an integer, then
|
||
the last element of _replica2part2dev_id will have a
|
||
length that is less than the partition count for the
|
||
ring. This means that some partitions will have more
|
||
replicas than others. For example, if a ring has 3.25
|
||
replicas, then 25% of its partitions will have four
|
||
replicas, while the remaining 75% will have just
|
||
three.</para>
|
||
<para><guilabel>Partition Shift Value</guilabel></para>
|
||
<para>The partition shift value is known internally to the
|
||
Ring class as _part_shift. This value used to shift an
|
||
MD5 hash to calculate the partition on which the data
|
||
for that hash should reside. Only the top four bytes
|
||
of the hash is used in this process. For example, to
|
||
compute the partition for the path
|
||
/account/container/object the Python code might look
|
||
like: partition = unpack_from('>I',
|
||
md5('/account/container/object').digest())[0] >>
|
||
self._part_shift</para>
|
||
<para>For a ring generated with part_power P, the
|
||
partition shift value is 32 - P.</para>
|
||
<para><guilabel>Building the Ring</guilabel></para>
|
||
<para>The initial building of the ring first calculates the
|
||
number of partitions that should ideally be assigned to
|
||
each device based the device’s weight. For example, given
|
||
a partition power of 20, the ring will have 1,048,576
|
||
partitions. If there are 1,000 devices of equal weight
|
||
they will each desire 1,048.576 partitions. The devices
|
||
are then sorted by the number of partitions they desire
|
||
and kept in order throughout the initialization
|
||
process.</para>
|
||
<para>Note: each device is also assigned a random tiebreaker
|
||
value that is used when two devices desire the same number
|
||
of partitions. This tiebreaker is not stored on disk
|
||
anywhere, and so two different rings created with the same
|
||
parameters will have different partition assignments. For
|
||
repeatable partition assignments, RingBuilder.rebalance()
|
||
takes an optional seed value that will be used to seed
|
||
Python’s pseudo-random number generator.</para>
|
||
<para>Then, the ring builder assigns each replica of each
|
||
partition to the device that desires the most partitions
|
||
at that point while keeping it as far away as possible
|
||
from other replicas. The ring builder prefers to assign a
|
||
replica to a device in a regions that has no replicas
|
||
already; should there be no such region available, the
|
||
ring builder will try to find a device in a different
|
||
zone; if not possible, it will look on a different server;
|
||
failing that, it will just look for a device that has no
|
||
replicas; finally, if all other options are exhausted, the
|
||
ring builder will assign the replica to the device that
|
||
has the fewest replicas already assigned. Note that
|
||
assignment of multiple replicas to one device will only
|
||
happen if the ring has fewer devices than it has
|
||
replicas.</para>
|
||
<para>When building a new ring based on an old ring, the
|
||
desired number of partitions each device wants is
|
||
recalculated. Next the partitions to be reassigned are
|
||
gathered up. Any removed devices have all their assigned
|
||
partitions unassigned and added to the gathered list. Any
|
||
partition replicas that (due to the addition of new
|
||
devices) can be spread out for better durability are
|
||
unassigned and added to the gathered list. Any devices
|
||
that have more partitions than they now desire have random
|
||
partitions unassigned from them and added to the gathered
|
||
list. Lastly, the gathered partitions are then reassigned
|
||
to devices using a similar method as in the initial
|
||
assignment described above.</para>
|
||
<para>Whenever a partition has a replica reassigned, the time
|
||
of the reassignment is recorded. This is taken into
|
||
account when gathering partitions to reassign so that no
|
||
partition is moved twice in a configurable amount of time.
|
||
This configurable amount of time is known internally to
|
||
the RingBuilder class as min_part_hours. This restriction
|
||
is ignored for replicas of partitions on devices that have
|
||
been removed, as removing a device only happens on device
|
||
failure and there’s no choice but to make a
|
||
reassignment.</para>
|
||
<para>The above processes don’t always perfectly rebalance a
|
||
ring due to the random nature of gathering partitions for
|
||
reassignment. To help reach a more balanced ring, the
|
||
rebalance process is repeated until near perfect (less 1%
|
||
off) or when the balance doesn’t improve by at least 1%
|
||
(indicating we probably can’t get perfect balance due to
|
||
wildly imbalanced zones or too many partitions recently
|
||
moved).</para>
|
||
</section>
|
||
<section xml:id="more-concepts">
|
||
<title>A Bit More On Swift</title>
|
||
<para><guilabel>Containers and Objects</guilabel></para>
|
||
<para>A container is a storage compartment for your data and
|
||
provides a way for you to organize your data. You can
|
||
think of a container as a folder in Windows or a
|
||
directory in UNIX. The primary difference between a
|
||
container and these other file system concepts is that
|
||
containers cannot be nested. You can, however, create an
|
||
unlimited number of containers within your account. Data
|
||
must be stored in a container so you must have at least
|
||
one container defined in your account prior to uploading
|
||
data.</para>
|
||
<para>The only restrictions on container names is that they
|
||
cannot contain a forward slash (/) or an ascii null (%00)
|
||
and must be less than 257 bytes in length. Please note
|
||
that the length restriction applies to the name after it
|
||
has been URL encoded. For example, a container name of
|
||
Course Docs would be URL encoded as Course%20Docs and
|
||
therefore be 13 bytes in length rather than the expected
|
||
11.</para>
|
||
<para>An object is the basic storage entity and any optional
|
||
metadata that represents the files you store in the
|
||
OpenStack Object Storage system. When you upload data to
|
||
OpenStack Object Storage, the data is stored as-is (no
|
||
compression or encryption) and consists of a location
|
||
(container), the object's name, and any metadata
|
||
consisting of key/value pairs. For instance, you may chose
|
||
to store a backup of your digital photos and organize them
|
||
into albums. In this case, each object could be tagged
|
||
with metadata such as Album : Caribbean Cruise or Album :
|
||
Aspen Ski Trip.</para>
|
||
<para>The only restriction on object names is that they must
|
||
be less than 1024 bytes in length after URL encoding. For
|
||
example, an object name of C++final(v2).txt should be URL
|
||
encoded as C%2B%2Bfinal%28v2%29.txt and therefore be 24
|
||
bytes in length rather than the expected 16.</para>
|
||
<para>The maximum allowable size for a storage object upon
|
||
upload is 5 gigabytes (GB) and the minimum is zero bytes.
|
||
You can use the built-in large object support and the
|
||
swift utility to retrieve objects larger than 5 GB.</para>
|
||
<para>For metadata, you should not exceed 90 individual
|
||
key/value pairs for any one object and the total byte
|
||
length of all key/value pairs should not exceed 4KB (4096
|
||
bytes).</para>
|
||
<para><guilabel>Language-Specific API
|
||
Bindings</guilabel></para>
|
||
<para>A set of supported API bindings in several popular
|
||
languages are available from the Rackspace Cloud Files
|
||
product, which uses OpenStack Object Storage code for its
|
||
implementation. These bindings provide a layer of
|
||
abstraction on top of the base REST API, allowing
|
||
programmers to work with a container and object model
|
||
instead of working directly with HTTP requests and
|
||
responses. These bindings are free (as in beer and as in
|
||
speech) to download, use, and modify. They are all
|
||
licensed under the MIT License as described in the COPYING
|
||
file packaged with each binding. If you do make any
|
||
improvements to an API, you are encouraged (but not
|
||
required) to submit those changes back to us.</para>
|
||
<para>The API bindings for Rackspace Cloud Files are hosted
|
||
at<link xlink:href="http://github.com/rackspace"
|
||
></link><link
|
||
xlink:href="http://github.com/rackspace"
|
||
>http://github.com/rackspace</link>. Feel free to
|
||
coordinate your changes through github or, if you prefer,
|
||
send your changes to cloudfiles@rackspacecloud.com. Just
|
||
make sure to indicate which language and version you
|
||
modified and send a unified diff.</para>
|
||
<para>Each binding includes its own documentation (either
|
||
HTML, PDF, or CHM). They also include code snippets and
|
||
examples to help you get started. The currently supported
|
||
API binding for OpenStack Object Storage are:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>PHP (requires 5.x and the modules: cURL,
|
||
FileInfo, mbstring)</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Python (requires 2.4 or newer)</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Java (requires JRE v1.5 or newer)</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>C#/.NET (requires .NET Framework v3.5)</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Ruby (requires 1.8 or newer and mime-tools
|
||
module)</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>There are no other supported language-specific bindings
|
||
at this time. You are welcome to create your own language
|
||
API bindings and we can help answer any questions during
|
||
development, host your code if you like, and give you full
|
||
credit for your work.</para>
|
||
<para><guilabel>Proxy Server</guilabel></para>
|
||
<para>The Proxy Server is responsible for tying together
|
||
the rest of the OpenStack Object Storage architecture.
|
||
For each request, it will look up the location of the
|
||
account, container, or object in the ring (see below)
|
||
and route the request accordingly. The public API is
|
||
also exposed through the Proxy Server.</para>
|
||
<para>A large number of failures are also handled in the
|
||
Proxy Server. For example, if a server is unavailable
|
||
for an object PUT, it will ask the ring for a hand-off
|
||
server and route there instead.</para>
|
||
<para>When objects are streamed to or from an object
|
||
server, they are streamed directly through the proxy
|
||
server to or from the user – the proxy server does not
|
||
spool them.</para>
|
||
<para>You can use a proxy server with account management
|
||
enabled by configuring it in the proxy server
|
||
configuration file.</para>
|
||
<para><guilabel>Object Server</guilabel></para>
|
||
<para>The Object Server is a very simple blob storage
|
||
server that can store, retrieve and delete objects
|
||
stored on local devices. Objects are stored as binary
|
||
files on the filesystem with metadata stored in the
|
||
file’s extended attributes (xattrs). This requires
|
||
that the underlying filesystem choice for object
|
||
servers support xattrs on files. Some filesystems,
|
||
like ext3, have xattrs turned off by default.</para>
|
||
<para>Each object is stored using a path derived from the
|
||
object name’s hash and the operation’s timestamp. Last
|
||
write always wins, and ensures that the latest object
|
||
version will be served. A deletion is also treated as
|
||
a version of the file (a 0 byte file ending with
|
||
“.ts”, which stands for tombstone). This ensures that
|
||
deleted files are replicated correctly and older
|
||
versions don’t magically reappear due to failure
|
||
scenarios.</para>
|
||
<para><guilabel>Container Server</guilabel></para>
|
||
<para>The Container Server’s primary job is to handle
|
||
listings of objects. It does not’t know where those
|
||
objects are, just what objects are in a specific
|
||
container. The listings are stored as sqlite database
|
||
files, and replicated across the cluster similar to
|
||
how objects are. Statistics are also tracked that
|
||
include the total number of objects, and total storage
|
||
usage for that container.</para>
|
||
<para><guilabel>Account Server</guilabel></para>
|
||
<para>The Account Server is very similar to the Container
|
||
Server, excepting that it is responsible for listings
|
||
of containers rather than objects.</para>
|
||
<para><guilabel>Replication</guilabel></para>
|
||
<para>Replication is designed to keep the system in a
|
||
consistent state in the face of temporary error
|
||
conditions like network outages or drive
|
||
failures.</para>
|
||
<para>The replication processes compare local data with
|
||
each remote copy to ensure they all contain the latest
|
||
version. Object replication uses a hash list to
|
||
quickly compare subsections of each partition, and
|
||
container and account replication use a combination of
|
||
hashes and shared high water marks.</para>
|
||
<para>Replication updates are push based. For object
|
||
replication, updating is just a matter of rsyncing
|
||
files to the peer. Account and container replication
|
||
push missing records over HTTP or rsync whole database
|
||
files.</para>
|
||
<para>The replicator also ensures that data is removed
|
||
from the system. When an item (object, container, or
|
||
account) is deleted, a tombstone is set as the latest
|
||
version of the item. The replicator will see the
|
||
tombstone and ensure that the item is removed from the
|
||
entire system.</para>
|
||
<para>To separate the cluster-internal replication traffic
|
||
from client traffic, separate replication servers can
|
||
be used. These replication servers are based on the
|
||
standard storage servers, but they listen on the
|
||
replication IP and only respond to REPLICATE requests.
|
||
Storage servers can serve REPLICATE requests, so an
|
||
operator can transition to using a separate
|
||
replication network with no cluster downtime.</para>
|
||
<para>Replication IP and port information is stored in the
|
||
ring on a per-node basis. These parameters will be
|
||
used if they are present, but they are not required.
|
||
If this information does not exist or is empty for a
|
||
particular node, the node's standard IP and port will
|
||
be used for replication.</para>
|
||
<para><guilabel>Updaters</guilabel></para>
|
||
<para>There are times when container or account data can
|
||
not be immediately updated. This usually occurs during
|
||
failure scenarios or periods of high load. If an
|
||
update fails, the update is queued locally on the file
|
||
system, and the updater will process the failed
|
||
updates. This is where an eventual consistency window
|
||
will most likely come in to play. For example, suppose
|
||
a container server is under load and a new object is
|
||
put in to the system. The object will be immediately
|
||
available for reads as soon as the proxy server
|
||
responds to the client with success. However, the
|
||
container server did not update the object listing,
|
||
and so the update would be queued for a later update.
|
||
Container listings, therefore, may not immediately
|
||
contain the object.</para>
|
||
<para>In practice, the consistency window is only as large
|
||
as the frequency at which the updater runs and may not
|
||
even be noticed as the proxy server will route listing
|
||
requests to the first container server which responds.
|
||
The server under load may not be the one that serves
|
||
subsequent listing requests – one of the other two
|
||
replicas may handle the listing.</para>
|
||
<para><guilabel>Auditors</guilabel></para>
|
||
<para>Auditors crawl the local server checking the
|
||
integrity of the objects, containers, and accounts. If
|
||
corruption is found (in the case of bit rot, for
|
||
example), the file is quarantined, and replication
|
||
will replace the bad file from another replica. If
|
||
other errors are found they are logged (for example,
|
||
an object’s listing can’t be found on any container
|
||
server it should be).</para>
|
||
</section>
|
||
<section xml:id="cluster-architecture">
|
||
<title>Cluster Arch</title>
|
||
<para><guilabel>Access Tier</guilabel></para>
|
||
<figure>
|
||
<title>Swift Cluster Architecture</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image47.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>Large-scale deployments segment off an "Access Tier".
|
||
This tier is the “Grand Central” of the Object Storage
|
||
system. It fields incoming API requests from clients and
|
||
moves data in and out of the system. This tier is composed
|
||
of front-end load balancers, ssl- terminators,
|
||
authentication services, and it runs the (distributed)
|
||
brain of the object storage system — the proxy server
|
||
processes.</para>
|
||
<para>Having the access servers in their own tier enables
|
||
read/write access to be scaled out independently of
|
||
storage capacity. For example, if the cluster is on the
|
||
public Internet and requires ssl-termination and has high
|
||
demand for data access, many access servers can be
|
||
provisioned. However, if the cluster is on a private
|
||
network and it is being used primarily for archival
|
||
purposes, fewer access servers are needed.</para>
|
||
<para>As this is an HTTP addressable storage service, a load
|
||
balancer can be incorporated into the access tier.</para>
|
||
<para>Typically, this tier comprises a collection of 1U
|
||
servers. These machines use a moderate amount of RAM and
|
||
are network I/O intensive. As these systems field each
|
||
incoming API request, it is wise to provision them with
|
||
two high-throughput (10GbE) interfaces. One interface is
|
||
used for 'front-end' incoming requests and the other for
|
||
'back-end' access to the object storage nodes to put and
|
||
fetch data.</para>
|
||
<para><guilabel>Factors to Consider</guilabel></para>
|
||
<para>For most publicly facing deployments as well as
|
||
private deployments available across a wide-reaching
|
||
corporate network, SSL will be used to encrypt traffic
|
||
to the client. SSL adds significant processing load to
|
||
establish sessions between clients; more capacity in
|
||
the access layer will need to be provisioned. SSL may
|
||
not be required for private deployments on trusted
|
||
networks.</para>
|
||
<para><guilabel>Storage Nodes</guilabel></para>
|
||
<figure>
|
||
<title>Object Storage (Swift)</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="figures/image48.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>The next component is the storage servers themselves.
|
||
Generally, most configurations should have each of the
|
||
five Zones with an equal amount of storage capacity.
|
||
Storage nodes use a reasonable amount of memory and CPU.
|
||
Metadata needs to be readily available to quickly return
|
||
objects. The object stores run services not only to field
|
||
incoming requests from the Access Tier, but to also run
|
||
replicators, auditors, and reapers. Object stores can be
|
||
provisioned with single gigabit or 10 gigabit network
|
||
interface depending on expected workload and desired
|
||
performance.</para>
|
||
<para>Currently 2TB or 3TB SATA disks deliver good
|
||
price/performance value. Desktop-grade drives can be used
|
||
where there are responsive remote hands in the datacenter,
|
||
and enterprise-grade drives can be used where this is not
|
||
the case.</para>
|
||
<para><guilabel>Factors to Consider</guilabel></para>
|
||
<para>Desired I/O performance for single-threaded requests
|
||
should be kept in mind. This system does not use RAID,
|
||
so each request for an object is handled by a single
|
||
disk. Disk performance impacts single-threaded
|
||
response rates.</para>
|
||
<para>To achieve apparent higher throughput, the object
|
||
storage system is designed with concurrent
|
||
uploads/downloads in mind. The network I/O capacity
|
||
(1GbE, bonded 1GbE pair, or 10GbE) should match your
|
||
desired concurrent throughput needs for reads and
|
||
writes.</para>
|
||
</section>
|
||
<section xml:id="account-reaper">
|
||
<title>Account Reaper</title>
|
||
<para>The Account Reaper removes data from deleted accounts in the
|
||
background.</para>
|
||
<para>An account is marked for deletion by a reseller issuing a
|
||
DELETE request on the account’s storage URL. This simply puts
|
||
the value DELETED into the status column of the account_stat
|
||
table in the account database (and replicas), indicating the
|
||
data for the account should be deleted later.</para>
|
||
<para>There is normally no set retention time and no undelete; it
|
||
is assumed the reseller will implement such features and only
|
||
call DELETE on the account once it is truly desired the
|
||
account’s data be removed. However, in order to protect the
|
||
Swift cluster accounts from an improper or mistaken delete
|
||
request, you can set a delay_reaping value in the
|
||
[account-reaper] section of the account-server.conf to delay
|
||
the actual deletion of data. At this time, there is no utility
|
||
to undelete an account; one would have to update the account
|
||
database replicas directly, setting the status column to an
|
||
empty string and updating the put_timestamp to be greater than
|
||
the delete_timestamp. (On the TODO list is writing a utility
|
||
to perform this task, preferably through a ReST call.)</para>
|
||
<para>The account reaper runs on each account server and scans the
|
||
server occasionally for account databases marked for deletion.
|
||
It will only trigger on accounts that server is the primary
|
||
node for, so that multiple account servers aren’t all trying
|
||
to do the same work at the same time. Using multiple servers
|
||
to delete one account might improve deletion speed, but
|
||
requires coordination so they aren’t duplicating effort. Speed
|
||
really isn’t as much of a concern with data deletion and large
|
||
accounts aren’t deleted that often.</para>
|
||
<para>The deletion process for an account itself is pretty
|
||
straightforward. For each container in the account, each
|
||
object is deleted and then the container is deleted. Any
|
||
deletion requests that fail won’t stop the overall process,
|
||
but will cause the overall process to fail eventually (for
|
||
example, if an object delete times out, the container won’t be
|
||
able to be deleted later and therefore the account won’t be
|
||
deleted either). The overall process continues even on a
|
||
failure so that it doesn’t get hung up reclaiming cluster
|
||
space because of one troublesome spot. The account reaper will
|
||
keep trying to delete an account until it eventually becomes
|
||
empty, at which point the database reclaim process within the
|
||
db_replicator will eventually remove the database
|
||
files.</para>
|
||
<para>Sometimes a persistent error state can prevent some object
|
||
or container from being deleted. If this happens, you will see
|
||
a message such as “Account <name> has not been reaped
|
||
since <date>” in the log. You can control when this is
|
||
logged with the reap_warn_after value in the [account-reaper]
|
||
section of the account-server.conf file. By default this is 30
|
||
days.</para>
|
||
</section>
|
||
<section xml:id="replication">
|
||
<title>Replication</title>
|
||
<para>Because each replica in swift functions independently, and
|
||
clients generally require only a simple majority of nodes
|
||
responding to consider an operation successful, transient
|
||
failures like network partitions can quickly cause replicas to
|
||
diverge. These differences are eventually reconciled by
|
||
asynchronous, peer-to-peer replicator processes. The
|
||
replicator processes traverse their local filesystems,
|
||
concurrently performing operations in a manner that balances
|
||
load across physical disks.</para>
|
||
<para>Replication uses a push model, with records and files
|
||
generally only being copied from local to remote replicas.
|
||
This is important because data on the node may not belong
|
||
there (as in the case of handoffs and ring changes), and a
|
||
replicator can’t know what data exists elsewhere in the
|
||
cluster that it should pull in. It’s the duty of any node that
|
||
contains data to ensure that data gets to where it belongs.
|
||
Replica placement is handled by the ring.</para>
|
||
<para>Every deleted record or file in the system is marked by a
|
||
tombstone, so that deletions can be replicated alongside
|
||
creations. The replication process cleans up tombstones after
|
||
a time period known as the consistency window. The consistency
|
||
window encompasses replication duration and how long transient
|
||
failure can remove a node from the cluster. Tombstone cleanup
|
||
must be tied to replication to reach replica
|
||
convergence.</para>
|
||
<para>If a replicator detects that a remote drive has failed, the
|
||
replicator uses the get_more_nodes interface for the ring to
|
||
choose an alternate node with which to synchronize. The
|
||
replicator can maintain desired levels of replication in the
|
||
face of disk failures, though some replicas may not be in an
|
||
immediately usable location. Note that the replicator doesn’t
|
||
maintain desired levels of replication when other failures,
|
||
such as entire node failures, occur because most failure are
|
||
transient.</para>
|
||
<para>Replication is an area of active development, and likely
|
||
rife with potential improvements to speed and
|
||
correctness.</para>
|
||
<para>There are two major classes of replicator - the db
|
||
replicator, which replicates accounts and containers, and the
|
||
object replicator, which replicates object data.</para>
|
||
<para><guilabel>DB Replication</guilabel></para>
|
||
<para>The first step performed by db replication is a low-cost
|
||
hash comparison to determine whether two replicas already
|
||
match. Under normal operation, this check is able to
|
||
verify that most databases in the system are already
|
||
synchronized very quickly. If the hashes differ, the
|
||
replicator brings the databases in sync by sharing records
|
||
added since the last sync point.</para>
|
||
<para>This sync point is a high water mark noting the last
|
||
record at which two databases were known to be in sync,
|
||
and is stored in each database as a tuple of the remote
|
||
database id and record id. Database ids are unique amongst
|
||
all replicas of the database, and record ids are
|
||
monotonically increasing integers. After all new records
|
||
have been pushed to the remote database, the entire sync
|
||
table of the local database is pushed, so the remote
|
||
database can guarantee that it is in sync with everything
|
||
with which the local database has previously
|
||
synchronized.</para>
|
||
<para>If a replica is found to be missing entirely, the whole
|
||
local database file is transmitted to the peer using
|
||
rsync(1) and vested with a new unique id.</para>
|
||
<para>In practice, DB replication can process hundreds of
|
||
databases per concurrency setting per second (up to the
|
||
number of available CPUs or disks) and is bound by the
|
||
number of DB transactions that must be performed.</para>
|
||
<para><guilabel>Object Replication</guilabel></para>
|
||
<para>The initial implementation of object replication simply
|
||
performed an rsync to push data from a local partition to
|
||
all remote servers it was expected to exist on. While this
|
||
performed adequately at small scale, replication times
|
||
skyrocketed once directory structures could no longer be
|
||
held in RAM. We now use a modification of this scheme in
|
||
which a hash of the contents for each suffix directory is
|
||
saved to a per-partition hashes file. The hash for a
|
||
suffix directory is invalidated when the contents of that
|
||
suffix directory are modified.</para>
|
||
<para>The object replication process reads in these hash
|
||
files, calculating any invalidated hashes. It then
|
||
transmits the hashes to each remote server that should
|
||
hold the partition, and only suffix directories with
|
||
differing hashes on the remote server are rsynced. After
|
||
pushing files to the remote server, the replication
|
||
process notifies it to recalculate hashes for the rsynced
|
||
suffix directories.</para>
|
||
<para>Performance of object replication is generally bound by
|
||
the number of uncached directories it has to traverse,
|
||
usually as a result of invalidated suffix directory
|
||
hashes. Using write volume and partition counts from our
|
||
running systems, it was designed so that around 2% of the
|
||
hash space on a normal node will be invalidated per day,
|
||
which has experimentally given us acceptable replication
|
||
speeds.</para>
|
||
</section>
|
||
</section>
|