Building Blocks of Swift The components that enable Swift to deliver high availability, high durability and high concurrency are: Proxy Servers:Handles all incoming API requests. Rings:Maps logical names of data to locations on particular disks. Zones:Each Zone isolates data from other Zones. A failure in one Zone doesn’t impact the rest of the cluster because data is replicated across the Zones. Accounts & Containers:Each Account and Container are individual databases that are distributed across the cluster. An Account database contains the list of Containers in that Account. A Container database contains the list of Objects in that Container Objects:The data itself. Partitions:A Partition stores Objects, Account databases and Container databases. It’s an intermediate 'bucket' that helps manage locations where data lives in the cluster.
Building Blocks
Proxy Servers The Proxy Servers are the public face of Swift and handle all incoming API requests. Once a Proxy Server receive a request, it will determine the storage node based on the URL of the object, such as https://swift.example.com/v1/account/container/object . The Proxy Servers also coordinates responses, handles failures and coordinates timestamps. Proxy servers use a shared-nothing architecture and can be scaled as needed based on projected workloads. A minimum of two Proxy Servers should be deployed for redundancy. Should one proxy server fail, the others will take over. The Ring A ring represents a mapping between the names of entities stored on disk and their physical location. There are separate rings for accounts, containers, and objects. When other components need to perform any operation on an object, container, or account, they need to interact with the appropriate ring to determine its location in the cluster. The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the ring is replicated, by default, 3 times across the cluster, and the locations for a partition are stored in the mapping maintained by the ring. The ring is also responsible for determining which devices are used for hand off in failure scenarios. Data can be isolated with the concept of zones in the ring. Each replica of a partition is guaranteed to reside in a different zone. A zone could represent a drive, a server, a cabinet, a switch, or even a data center. The partitions of the ring are equally divided among all the devices in the OpenStack Object Storage installation. When partitions need to be moved around, such as when a device is added to the cluster, the ring ensures that a minimum number of partitions are moved at a time, and only one replica of a partition is moved at a time. Weights can be used to balance the distribution of partitions on drives across the cluster. This can be useful, for example, when different sized drives are used in a cluster. The ring is used by the Proxy server and several background processes (like replication). The Ring maps Partitions to physical locations on disk. When other components need to perform any operation on an object, container, or account, they need to interact with the Ring to determine its location in the cluster. The Ring maintains this mapping using zones, devices, partitions, and replicas. Each partition in the Ring is replicated three times by default across the cluster, and the locations for a partition are stored in the mapping maintained by the Ring. The Ring is also responsible for determining which devices are used for handoff should a failure occur.
The Lord of the <emphasis role="bold" >Ring</emphasis>s
The Ring maps partitions to physical locations on disk. The rings determine where data should reside in the cluster. There is a separate ring for account databases, container databases, and individual objects but each ring works in the same way. These rings are externally managed, in that the server processes themselves do not modify the rings, they are instead given new rings modified by other tools. The ring uses a configurable number of bits from a path’s MD5 hash as a partition index that designates a device. The number of bits kept from the hash is known as the partition power, and 2 to the partition power indicates the partition count. Partitioning the full MD5 hash ring allows other parts of the cluster to work in batches of items at once which ends up either more efficient or at least less complex than working with each item separately or the entire cluster all at once. Another configurable value is the replica count, which indicates how many of the partition->device assignments comprise a single ring. For a given partition number, each replica’s device will not be in the same zone as any other replica's device. Zones can be used to group devices based on physical locations, power separations, network separations, or any other attribute that would lessen multiple replicas being unavailable at the same time. Zones: Failure Boundaries Swift allows zones to be configured to isolate failure boundaries. Each replica of the data resides in a separate zone, if possible. At the smallest level, a zone could be a single drive or a grouping of a few drives. If there were five object storage servers, then each server would represent its own zone. Larger deployments would have an entire rack (or multiple racks) of object servers, each representing a zone. The goal of zones is to allow the cluster to tolerate significant outages of storage servers without losing all replicas of the data. As we learned earlier, everything in Swift is stored, by default, three times. Swift will place each replica "as-uniquely-as-possible" to ensure both high availability and high durability. This means that when choosing a replica location, Swift will choose a server in an unused zone before an unused server in a zone that already has a replica of the data.
image33.png
When a disk fails, replica data is automatically distributed to the other zones to ensure there are three copies of the data Accounts & Containers Each account and container is an individual SQLite database that is distributed across the cluster. An account database contains the list of containers in that account. A container database contains the list of objects in that container.
Accounts and Containers
To keep track of object data location, each account in the system has a database that references all its containers, and each container database references each object Partitions A Partition is a collection of stored data, including Account databases, Container databases, and objects. Partitions are core to the replication system. Think of a Partition as a bin moving throughout a fulfillment center warehouse. Individual orders get thrown into the bin. The system treats that bin as a cohesive entity as it moves throughout the system. A bin full of things is easier to deal with than lots of little things. It makes for fewer moving parts throughout the system. The system replicators and object uploads/downloads operate on Partitions. As the system scales up, behavior continues to be predictable as the number of Partitions is a fixed number. The implementation of a Partition is conceptually simple -- a partition is just a directory sitting on a disk with a corresponding hash table of what it contains.
Partitions
*Swift partitions contain all data in the system. Replication In order to ensure that there are three copies of the data everywhere, replicators continuously examine each Partition. For each local Partition, the replicator compares it against the replicated copies in the other Zones to see if there are any differences. How does the replicator know if replication needs to take place? It does this by examining hashes. A hash file is created for each Partition, which contains hashes of each directory in the Partition. Each of the three hash files is compared. For a given Partition, the hash files for each of the Partition's copies are compared. If the hashes are different, then it is time to replicate and the directory that needs to be replicated is copied over. This is where the Partitions come in handy. With fewer "things" in the system, larger chunks of data are transferred around (rather than lots of little TCP connections, which is inefficient) and there are a consistent number of hashes to compare. The cluster has eventually consistent behavior where the newest data wins.
Replication
*If a zone goes down, one of the nodes containing a replica notices and proactively copies data to a handoff location. To describe how these pieces all come together, let's walk through a few scenarios and introduce the components. Bird-eye View Upload A client uses the REST API to make a HTTP request to PUT an object into an existing Container. The cluster receives the request. First, the system must figure out where the data is going to go. To do this, the Account name, Container name and Object name are all used to determine the Partition where this object should live. Then a lookup in the Ring figures out which storage nodes contain the Partitions in question. The data then is sent to each storage node where it is placed in the appropriate Partition. A quorum is required -- at least two of the three writes must be successful before the client is notified that the upload was successful. Next, the Container database is updated asynchronously to reflect that there is a new object in it.
When End-User uses Swift
Download A request comes in for an Account/Container/object. Using the same consistent hashing, the Partition name is generated. A lookup in the Ring reveals which storage nodes contain that Partition. A request is made to one of the storage nodes to fetch the object and if that fails, requests are made to the other nodes.