ab93e636b8
Brief Summary: Added Modules and Lab Sections of Aptira's Existing OpenStack Training Docs. Please do refer Full Summary for more details. For those who want to review this and save some time on building it, I have hosted the content on http://office.aptira.in Please talk to Sean Robetrs if you are concerened about repetition of Doc Content or similar issues like short URLs etc., this is supposed to be a rough patch and not final. Full Summary: Added the following modules. 1. Module001 - Introduction To OpenStack. - Brief Overview of OpenStack. - Basic Concepts - Detailed Description of Core Projects (Grizzly) under OpenStack. - All But Networking and Swift. 2. Module002 - OpenStack Networking In detail. 3. Module003 - OpenStack Object Storage In detail. 4. Lab001 - OpenStack Control Node and Compute Node. 5. Lab002 - OpenStack Network Node. Full Summary added due to the size of the commit. I Apologize for the size of this commit and will try not to commit huge content like in this patch. The reason for the size of this commit is to meet OpenStack Training Sprint day. bp/training-manuals Change-Id: Ie3c44527992868b4d9571b66cc1c048e558ec669
101 lines
6.1 KiB
XML
101 lines
6.1 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
version="5.0"
|
||
xml:id="module003-ch009-replication">
|
||
<title>Replication</title>
|
||
<para>Because each replica in swift functions independently, and
|
||
clients generally require only a simple majority of nodes
|
||
responding to consider an operation successful, transient
|
||
failures like network partitions can quickly cause replicas to
|
||
diverge. These differences are eventually reconciled by
|
||
asynchronous, peer-to-peer replicator processes. The
|
||
replicator processes traverse their local filesystems,
|
||
concurrently performing operations in a manner that balances
|
||
load across physical disks.</para>
|
||
<para>Replication uses a push model, with records and files
|
||
generally only being copied from local to remote replicas.
|
||
This is important because data on the node may not belong
|
||
there (as in the case of handoffs and ring changes), and a
|
||
replicator can’t know what data exists elsewhere in the
|
||
cluster that it should pull in. It’s the duty of any node that
|
||
contains data to ensure that data gets to where it belongs.
|
||
Replica placement is handled by the ring.</para>
|
||
<para>Every deleted record or file in the system is marked by a
|
||
tombstone, so that deletions can be replicated alongside
|
||
creations. The replication process cleans up tombstones after
|
||
a time period known as the consistency window. The consistency
|
||
window encompasses replication duration and how long transient
|
||
failure can remove a node from the cluster. Tombstone cleanup
|
||
must be tied to replication to reach replica
|
||
convergence.</para>
|
||
<para>If a replicator detects that a remote drive has failed, the
|
||
replicator uses the get_more_nodes interface for the ring to
|
||
choose an alternate node with which to synchronize. The
|
||
replicator can maintain desired levels of replication in the
|
||
face of disk failures, though some replicas may not be in an
|
||
immediately usable location. Note that the replicator doesn’t
|
||
maintain desired levels of replication when other failures,
|
||
such as entire node failures, occur because most failure are
|
||
transient.</para>
|
||
<para>Replication is an area of active development, and likely
|
||
rife with potential improvements to speed and
|
||
correctness.</para>
|
||
<para>There are two major classes of replicator - the db
|
||
replicator, which replicates accounts and containers, and the
|
||
object replicator, which replicates object data.</para>
|
||
<para><guilabel>DB Replication</guilabel></para>
|
||
<para>The first step performed by db replication is a low-cost
|
||
hash comparison to determine whether two replicas already
|
||
match. Under normal operation, this check is able to
|
||
verify that most databases in the system are already
|
||
synchronized very quickly. If the hashes differ, the
|
||
replicator brings the databases in sync by sharing records
|
||
added since the last sync point.</para>
|
||
<para>This sync point is a high water mark noting the last
|
||
record at which two databases were known to be in sync,
|
||
and is stored in each database as a tuple of the remote
|
||
database id and record id. Database ids are unique amongst
|
||
all replicas of the database, and record ids are
|
||
monotonically increasing integers. After all new records
|
||
have been pushed to the remote database, the entire sync
|
||
table of the local database is pushed, so the remote
|
||
database can guarantee that it is in sync with everything
|
||
with which the local database has previously
|
||
synchronized.</para>
|
||
<para>If a replica is found to be missing entirely, the whole
|
||
local database file is transmitted to the peer using
|
||
rsync(1) and vested with a new unique id.</para>
|
||
<para>In practice, DB replication can process hundreds of
|
||
databases per concurrency setting per second (up to the
|
||
number of available CPUs or disks) and is bound by the
|
||
number of DB transactions that must be performed.</para>
|
||
<para><guilabel>Object Replication</guilabel></para>
|
||
<para>The initial implementation of object replication simply
|
||
performed an rsync to push data from a local partition to
|
||
all remote servers it was expected to exist on. While this
|
||
performed adequately at small scale, replication times
|
||
skyrocketed once directory structures could no longer be
|
||
held in RAM. We now use a modification of this scheme in
|
||
which a hash of the contents for each suffix directory is
|
||
saved to a per-partition hashes file. The hash for a
|
||
suffix directory is invalidated when the contents of that
|
||
suffix directory are modified.</para>
|
||
<para>The object replication process reads in these hash
|
||
files, calculating any invalidated hashes. It then
|
||
transmits the hashes to each remote server that should
|
||
hold the partition, and only suffix directories with
|
||
differing hashes on the remote server are rsynced. After
|
||
pushing files to the remote server, the replication
|
||
process notifies it to recalculate hashes for the rsynced
|
||
suffix directories.</para>
|
||
<para>Performance of object replication is generally bound by
|
||
the number of uncached directories it has to traverse,
|
||
usually as a result of invalidated suffix directory
|
||
hashes. Using write volume and partition counts from our
|
||
running systems, it was designed so that around 2% of the
|
||
hash space on a normal node will be invalidated per day,
|
||
which has experimentally given us acceptable replication
|
||
speeds.</para>
|
||
</chapter> |