70f9987c71
Found by niceness gate Change-Id: I9806c748981ddc64763bc209e95b3c4bf0dfa686
112 lines
6.3 KiB
XML
112 lines
6.3 KiB
XML
<?xml version="1.0" encoding="utf-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
|
xml:id="section_objectstorage-replication">
|
|
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
|
<title>Replication</title>
|
|
<para>Because each replica in Object Storage functions
|
|
independently and clients generally require only a simple
|
|
majority of nodes to respond to consider an operation
|
|
successful, transient failures like network partitions can
|
|
quickly cause replicas to diverge. These differences are
|
|
eventually reconciled by asynchronous, peer-to-peer replicator
|
|
processes. The replicator processes traverse their local file
|
|
systems and concurrently perform operations in a manner that
|
|
balances load across physical disks.</para>
|
|
<para>Replication uses a push model, with records and files
|
|
generally only being copied from local to remote replicas.
|
|
This is important because data on the node might not belong
|
|
there (as in the case of hand offs and ring changes), and a
|
|
replicator cannot know which data it should pull in from
|
|
elsewhere in the cluster. Any node that contains data must
|
|
ensure that data gets to where it belongs. The ring handles
|
|
replica placement.</para>
|
|
<para>To replicate deletions in addition to creations, every
|
|
deleted record or file in the system is marked by a tombstone.
|
|
The replication process cleans up tombstones after a time
|
|
period known as the <emphasis role="italic">consistency
|
|
window</emphasis>. This window defines the duration of the
|
|
replication and how long transient failure can remove a node
|
|
from the cluster. Tombstone cleanup must be tied to
|
|
replication to reach replica convergence.</para>
|
|
<para>If a replicator detects that a remote drive has failed, the
|
|
replicator uses the <literal>get_more_nodes</literal>
|
|
interface for the ring to choose an alternate node with which
|
|
to synchronize. The replicator can maintain desired levels of
|
|
replication during disk failures, though some replicas might
|
|
not be in an immediately usable location.</para>
|
|
<note>
|
|
<para>The replicator does not maintain desired levels of
|
|
replication when failures such as entire node failures
|
|
occur; most failures are transient.</para>
|
|
</note>
|
|
<para>The main replication types are:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><emphasis role="bold">Database
|
|
replication</emphasis>. Replicates containers and
|
|
objects.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para><emphasis role="bold">Object replication</emphasis>.
|
|
Replicates object data.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<section xml:id="section_database-replication">
|
|
<title>Database replication</title>
|
|
<para>Database replication completes a low-cost hash
|
|
comparison to determine whether two replicas already
|
|
match. Normally, this check can quickly verify that most
|
|
databases in the system are already synchronized. If the
|
|
hashes differ, the replicator synchronizes the databases
|
|
by sharing records added since the last synchronization
|
|
point.</para>
|
|
<para>This synchronization point is a high water mark that
|
|
notes the last record at which two databases were known to
|
|
be synchronized, and is stored in each database as a tuple
|
|
of the remote database ID and record ID. Database IDs are
|
|
unique across all replicas of the database, and record IDs
|
|
are monotonically increasing integers. After all new
|
|
records are pushed to the remote database, the entire
|
|
synchronization table of the local database is pushed, so
|
|
the remote database can guarantee that it is synchronized
|
|
with everything with which the local database was
|
|
previously synchronized.</para>
|
|
<para>If a replica is missing, the whole local database file
|
|
is transmitted to the peer by using rsync(1) and is
|
|
assigned a new unique ID.</para>
|
|
<para>In practice, database replication can process hundreds
|
|
of databases per concurrency setting per second (up to the
|
|
number of available CPUs or disks) and is bound by the
|
|
number of database transactions that must be
|
|
performed.</para>
|
|
</section>
|
|
<section xml:id="section_object-replication">
|
|
<title>Object replication</title>
|
|
<para>The initial implementation of object replication
|
|
performed an rsync to push data from a local partition to
|
|
all remote servers where it was expected to reside. While
|
|
this worked at small scale, replication times skyrocketed
|
|
once directory structures could no longer be held in RAM.
|
|
This scheme was modified to save a hash of the contents
|
|
for each suffix directory to a per-partition hashes file.
|
|
The hash for a suffix directory is no longer valid when
|
|
the contents of that suffix directory is modified.</para>
|
|
<para>The object replication process reads in hash files and
|
|
calculates any invalidated hashes. Then, it transmits the
|
|
hashes to each remote server that should hold the
|
|
partition, and only suffix directories with differing
|
|
hashes on the remote server are rsynced. After pushing
|
|
files to the remote server, the replication process
|
|
notifies it to recalculate hashes for the rsynced suffix
|
|
directories.</para>
|
|
<para>The number of uncached directories that object
|
|
replication must traverse, usually as a result of
|
|
invalidated suffix directory hashes, impedes performance.
|
|
To provide acceptable replication speeds, object
|
|
replication is designed to invalidate around 2 percent of
|
|
the hash space on a normal node each day.</para>
|
|
</section>
|
|
</section>
|