Building Blocks of SwiftThe components that enable Swift to deliver high
availability, high durability and high concurrency
are:Proxy
Servers:Handles all incoming API
requests.Rings:Maps
logical names of data to locations on particular
disks.Zones:Each Zone
isolates data from other Zones. A failure in one Zone
doesn’t impact the rest of the cluster because data is
replicated across the Zones.Accounts &
Containers:Each Account and Container
are individual databases that are distributed across
the cluster. An Account database contains the list of
Containers in that Account. A Container database
contains the list of Objects in that ContainerObjects:The
data itself.Partitions:A
Partition stores Objects, Account databases and
Container databases. It’s an intermediate 'bucket'
that helps manage locations where data lives in the
cluster.Proxy ServersThe Proxy Servers are the public face of Swift and
handle all incoming API requests. Once a Proxy Server
receive a request, it will determine the storage node
based on the URL of the object, such as
https://swift.example.com/v1/account/container/object
. The Proxy Servers also coordinates responses,
handles failures and coordinates timestamps.Proxy servers use a shared-nothing architecture and can
be scaled as needed based on projected workloads. A
minimum of two Proxy Servers should be deployed for
redundancy. Should one proxy server fail, the others will
take over.The RingA ring represents a mapping between the names of entities
stored on disk and their physical location. There are separate
rings for accounts, containers, and objects. When other
components need to perform any operation on an object,
container, or account, they need to interact with the
appropriate ring to determine its location in the
cluster.The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the ring is
replicated, by default, 3 times across the cluster, and the
locations for a partition are stored in the mapping maintained
by the ring. The ring is also responsible for determining
which devices are used for hand off in failure
scenarios.Data can be isolated with the concept of zones in the
ring. Each replica of a partition is guaranteed to reside
in a different zone. A zone could represent a drive, a
server, a cabinet, a switch, or even a data center.The partitions of the ring are equally divided among all
the devices in the OpenStack Object Storage installation.
When partitions need to be moved around, such as when a
device is added to the cluster, the ring ensures that a
minimum number of partitions are moved at a time, and only
one replica of a partition is moved at a time.Weights can be used to balance the distribution of
partitions on drives across the cluster. This can be
useful, for example, when different sized drives are used
in a cluster.The ring is used by the Proxy server and several
background processes (like replication).The Ring maps Partitions to physical locations on disk.
When other components need to perform any operation on an
object, container, or account, they need to interact with
the Ring to determine its location in the cluster.The Ring maintains this mapping using zones, devices,
partitions, and replicas. Each partition in the Ring is
replicated three times by default across the cluster, and
the locations for a partition are stored in the mapping
maintained by the Ring. The Ring is also responsible for
determining which devices are used for handoff should a
failure occur.The Ring maps partitions to physical locations on
disk.The rings determine where data should reside in the
cluster. There is a separate ring for account databases,
container databases, and individual objects but each ring
works in the same way. These rings are externally managed,
in that the server processes themselves do not modify the
rings, they are instead given new rings modified by other
tools.The ring uses a configurable number of bits from a
path’s MD5 hash as a partition index that designates a
device. The number of bits kept from the hash is known as
the partition power, and 2 to the partition power
indicates the partition count. Partitioning the full MD5
hash ring allows other parts of the cluster to work in
batches of items at once which ends up either more
efficient or at least less complex than working with each
item separately or the entire cluster all at once.Another configurable value is the replica count, which
indicates how many of the partition->device assignments
comprise a single ring. For a given partition number, each
replica’s device will not be in the same zone as any other
replica's device. Zones can be used to group devices based on
physical locations, power separations, network separations, or
any other attribute that would lessen multiple replicas being
unavailable at the same time.Zones: Failure BoundariesSwift allows zones to be configured to isolate
failure boundaries. Each replica of the data resides
in a separate zone, if possible. At the smallest
level, a zone could be a single drive or a grouping of
a few drives. If there were five object storage
servers, then each server would represent its own
zone. Larger deployments would have an entire rack (or
multiple racks) of object servers, each representing a
zone. The goal of zones is to allow the cluster to
tolerate significant outages of storage servers
without losing all replicas of the data.As we learned earlier, everything in Swift is
stored, by default, three times. Swift will place each
replica "as-uniquely-as-possible" to ensure both high
availability and high durability. This means that when
choosing a replica location, Swift will choose a server
in an unused zone before an unused server in a zone
that already has a replica of the data.When a disk fails, replica data is automatically
distributed to the other zones to ensure there are
three copies of the dataAccounts &
ContainersEach account and container is an individual SQLite
database that is distributed across the cluster. An
account database contains the list of containers in
that account. A container database contains the list
of objects in that container.To keep track of object data location, each account
in the system has a database that references all its
containers, and each container database references
each objectPartitionsA Partition is a collection of stored data,
including Account databases, Container databases, and
objects. Partitions are core to the replication
system.Think of a Partition as a bin moving throughout a
fulfillment center warehouse. Individual orders get
thrown into the bin. The system treats that bin as a
cohesive entity as it moves throughout the system. A
bin full of things is easier to deal with than lots of
little things. It makes for fewer moving parts
throughout the system.The system replicators and object uploads/downloads
operate on Partitions. As the system scales up,
behavior continues to be predictable as the number of
Partitions is a fixed number.The implementation of a Partition is conceptually
simple -- a partition is just a directory sitting on a
disk with a corresponding hash table of what it
contains.*Swift partitions contain all data in the
system.ReplicationIn order to ensure that there are three copies of
the data everywhere, replicators continuously examine
each Partition. For each local Partition, the
replicator compares it against the replicated copies
in the other Zones to see if there are any
differences.How does the replicator know if replication needs to
take place? It does this by examining hashes. A hash
file is created for each Partition, which contains
hashes of each directory in the Partition. Each of the
three hash files is compared. For a given Partition,
the hash files for each of the Partition's copies are
compared. If the hashes are different, then it is time
to replicate and the directory that needs to be
replicated is copied over.This is where the Partitions come in handy. With
fewer "things" in the system, larger chunks of data
are transferred around (rather than lots of little TCP
connections, which is inefficient) and there are a
consistent number of hashes to compare.The cluster has eventually consistent behavior where
the newest data wins.*If a zone goes down, one of the nodes containing a
replica notices and proactively copies data to a
handoff location.To describe how these pieces all come together, let's walk
through a few scenarios and introduce the components.Bird-eye ViewUploadA client uses the REST API to make a HTTP request to PUT
an object into an existing Container. The cluster receives
the request. First, the system must figure out where the
data is going to go. To do this, the Account name,
Container name and Object name are all used to determine
the Partition where this object should live.Then a lookup in the Ring figures out which storage
nodes contain the Partitions in question.The data then is sent to each storage node where it is
placed in the appropriate Partition. A quorum is required
-- at least two of the three writes must be successful
before the client is notified that the upload was
successful.Next, the Container database is updated asynchronously
to reflect that there is a new object in it.DownloadA request comes in for an Account/Container/object.
Using the same consistent hashing, the Partition name is
generated. A lookup in the Ring reveals which storage
nodes contain that Partition. A request is made to one of
the storage nodes to fetch the object and if that fails,
requests are made to the other nodes.