b2235bf3fb
Execluded all XML files in the directory doc/common/tables because they are autogenerated. The XML root element of Docbook XML files should match the following format: <ELEMENT xmlns="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:id="THE_XML_ID_OF_THE_ELEMENT"> Change-Id: If12091be81ec8b2e6e53bfcb4c3a883a65e24736
236 lines
14 KiB
XML
236 lines
14 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<section xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="section_objectstorage-components">
|
||
<title>Components</title>
|
||
<para>The components that enable Object Storage to deliver high availability, high
|
||
durability, and high concurrency are:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para><emphasis role="bold">Proxy servers.</emphasis> Handle all of the incoming
|
||
API requests.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Rings.</emphasis> Map logical names of data to
|
||
locations on particular disks.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Zones.</emphasis> Isolate data from other zones. A
|
||
failure in one zone doesn’t impact the rest of the cluster because data is
|
||
replicated across zones.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Accounts and containers.</emphasis> Each account and
|
||
container are individual databases that are distributed across the cluster. An
|
||
account database contains the list of containers in that account. A container
|
||
database contains the list of objects in that container.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Objects.</emphasis> The data itself.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para><emphasis role="bold">Partitions.</emphasis> A partition stores objects,
|
||
account databases, and container databases and helps manage locations where data
|
||
lives in the cluster.</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<figure>
|
||
<title>Object Storage building blocks</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-buildingblocks.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<section xml:id="section_proxy-servers">
|
||
<title>Proxy servers</title>
|
||
<para>Proxy servers are the public face of Object Storage and handle all of the incoming API
|
||
requests. Once a proxy server receives a request, it determines the storage node based
|
||
on the object's URL, for example, https://swift.example.com/v1/account/container/object.
|
||
Proxy servers also coordinate responses, handle failures, and coordinate
|
||
timestamps.</para>
|
||
<para>Proxy servers use a shared-nothing architecture and can be scaled as needed based on
|
||
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
|
||
If one proxy server fails, the others take over.</para>
|
||
</section>
|
||
<section xml:id="section_ring">
|
||
<title>Rings</title>
|
||
<para>A ring represents a mapping between the names of entities stored on disk and their
|
||
physical locations. There are separate rings for accounts, containers, and objects. When
|
||
other components need to perform any operation on an object, container, or account, they
|
||
need to interact with the appropriate ring to determine their location in the
|
||
cluster.</para>
|
||
<para>The ring maintains this mapping using zones, devices, partitions, and replicas. Each
|
||
partition in the ring is replicated, by default, three times across the cluster, and
|
||
partition locations are stored in the mapping maintained by the ring. The ring is also
|
||
responsible for determining which devices are used for handoff in failure
|
||
scenarios.</para>
|
||
<para>Data can be isolated into zones in the ring. Each partition replica is guaranteed to
|
||
reside in a different zone. A zone could represent a drive, a server, a cabinet, a
|
||
switch, or even a data center.</para>
|
||
<para>The partitions of the ring are equally divided among all of the devices in the Object
|
||
Storage installation. When partitions need to be moved around (for example, if a device
|
||
is added to the cluster), the ring ensures that a minimum number of partitions are moved
|
||
at a time, and only one replica of a partition is moved at a time.</para>
|
||
<para>You can use weights to balance the distribution of partitions on drives across the
|
||
cluster. This can be useful, for example, when differently sized drives are used in a
|
||
cluster.</para>
|
||
<para>The ring is used by the proxy server and several background processes (like
|
||
replication).</para>
|
||
<figure>
|
||
<title>The <emphasis role="bold">ring</emphasis></title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-ring.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>These rings are externally managed, in that the server processes themselves do not
|
||
modify the rings, they are instead given new rings modified by other tools.</para>
|
||
<para>The ring uses a configurable number of bits from an
|
||
MD5 hash for a path as a partition index that designates a
|
||
device. The number of bits kept from the hash is known as
|
||
the partition power, and 2 to the partition power
|
||
indicates the partition count. Partitioning the full MD5
|
||
hash ring allows other parts of the cluster to work in
|
||
batches of items at once which ends up either more
|
||
efficient or at least less complex than working with each
|
||
item separately or the entire cluster all at once.</para>
|
||
<para>Another configurable value is the replica count, which indicates how many of the
|
||
partition-device assignments make up a single ring. For a given partition number, each
|
||
replica’s device will not be in the same zone as any other replica's device. Zones can
|
||
be used to group devices based on physical locations, power separations, network
|
||
separations, or any other attribute that would improve the availability of multiple
|
||
replicas at the same time.</para>
|
||
</section>
|
||
<section xml:id="section_zones">
|
||
<title>Zones</title>
|
||
<para>Object Storage allows configuring zones in order to isolate failure boundaries.
|
||
Each data replica resides in a separate zone, if possible. At the smallest level, a zone
|
||
could be a single drive or a grouping of a few drives. If there were five object storage
|
||
servers, then each server would represent its own zone. Larger deployments would have an
|
||
entire rack (or multiple racks) of object servers, each representing a zone. The goal of
|
||
zones is to allow the cluster to tolerate significant outages of storage servers without
|
||
losing all replicas of the data.</para>
|
||
<para>As mentioned earlier, everything in Object Storage is stored, by default, three
|
||
times. Swift will place each replica "as-uniquely-as-possible" to ensure both high
|
||
availability and high durability. This means that when chosing a replica location,
|
||
Object Storage chooses a server in an unused zone before an unused server in a zone that
|
||
already has a replica of the data.</para>
|
||
<figure>
|
||
<title>Zones</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-zones.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>When a disk fails, replica data is automatically distributed to the other zones to
|
||
ensure there are three copies of the data.</para>
|
||
</section>
|
||
<section xml:id="section_accounts-containers">
|
||
<title>Accounts and containers</title>
|
||
<para>Each account and container is an individual SQLite
|
||
database that is distributed across the cluster. An
|
||
account database contains the list of containers in
|
||
that account. A container database contains the list
|
||
of objects in that container.</para>
|
||
<figure>
|
||
<title>Accounts and containers</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-accountscontainers.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>To keep track of object data locations, each account in the system has a database
|
||
that references all of its containers, and each container database references each
|
||
object.</para>
|
||
</section>
|
||
<section xml:id="section_partitions">
|
||
<title>Partitions</title>
|
||
<para>A partition is a collection of stored data, including account databases, container
|
||
databases, and objects. Partitions are core to the replication system.</para>
|
||
<para>Think of a partition as a bin moving throughout a fulfillment center warehouse.
|
||
Individual orders get thrown into the bin. The system treats that bin as a cohesive
|
||
entity as it moves throughout the system. A bin is easier to deal with than many little
|
||
things. It makes for fewer moving parts throughout the system.</para>
|
||
<para>System replicators and object uploads/downloads operate on partitions. As the
|
||
system scales up, its behavior continues to be predictable because the number of
|
||
partitions is a fixed number.</para>
|
||
<para>Implementing a partition is conceptually simple, a partition is just a
|
||
directory sitting on a disk with a corresponding hash table of what it contains.</para>
|
||
<figure>
|
||
<title>Partitions</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-partitions.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
</section>
|
||
<section xml:id="section_replicators">
|
||
<title>Replicators</title>
|
||
<para>In order to ensure that there are three copies of the data everywhere, replicators
|
||
continuously examine each partition. For each local partition, the replicator compares
|
||
it against the replicated copies in the other zones to see if there are any
|
||
differences.</para>
|
||
<para>The replicator knows if replication needs to take place by examining hashes. A hash
|
||
file is created for each partition, which contains hashes of each directory in the
|
||
partition. Each of the three hash files is compared. For a given partition, the hash
|
||
files for each of the partition's copies are compared. If the hashes are different, then
|
||
it is time to replicate, and the directory that needs to be replicated is copied
|
||
over.</para>
|
||
<para>This is where partitions come in handy. With fewer things in the system, larger
|
||
chunks of data are transferred around (rather than lots of little TCP connections, which
|
||
is inefficient) and there is a consistent number of hashes to compare.</para>
|
||
<para>The cluster eventually has a consistent behavior where the newest data has a
|
||
priority.</para>
|
||
<figure>
|
||
<title>Replication</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-replication.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
<para>If a zone goes down, one of the nodes containing a replica notices and proactively
|
||
copies data to a handoff location.</para>
|
||
</section>
|
||
<section xml:id="section_usecases">
|
||
<title>Use cases</title>
|
||
<para>The following sections show use cases for object uploads and downloads and introduce the components.</para>
|
||
<section xml:id="upload">
|
||
<title>Upload</title>
|
||
<para>A client uses the REST API to make a HTTP request to PUT an object into an existing
|
||
container. The cluster receives the request. First, the system must figure out where
|
||
the data is going to go. To do this, the account name, container name, and object
|
||
name are all used to determine the partition where this object should live.</para>
|
||
<para>Then a lookup in the ring figures out which storage nodes contain the partitions in
|
||
question.</para>
|
||
<para>The data is then sent to each storage node where it is placed in the appropriate
|
||
partition. At least two of the three writes must be successful before the client is
|
||
notified that the upload was successful.</para>
|
||
<para>Next, the container database is updated asynchronously to reflect that there is a new
|
||
object in it.</para>
|
||
<figure>
|
||
<title>Object Storage in use</title>
|
||
<mediaobject>
|
||
<imageobject>
|
||
<imagedata fileref="../common/figures/objectstorage-usecase.png"/>
|
||
</imageobject>
|
||
</mediaobject>
|
||
</figure>
|
||
</section>
|
||
<section xml:id="section_swift-component-download">
|
||
<title>Download</title>
|
||
<para>A request comes in for an account/container/object. Using the same consistent hashing,
|
||
the partition name is generated. A lookup in the ring reveals which storage nodes
|
||
contain that partition. A request is made to one of the storage nodes to fetch the
|
||
object and, if that fails, requests are made to the other nodes.</para>
|
||
</section>
|
||
</section>
|
||
</section>
|