From 51274c9fcb2381b53079f9bcbb00376cfc470fbd Mon Sep 17 00:00:00 2001
From: Sean Roberts <seanrob@yahoo-inc.com>
Date: Thu, 7 Nov 2013 16:11:37 +0800
Subject: [PATCH] moved module-003 material into swift concepts chapter

this is part of the training guide restructuring effort.

implements bp/training-manual

Change-Id: I28c9c5317f55da4cef075bae89b3ee518aaeae74
---
 .../associate-storage-node-concept-swift.xml  | 1202 ++++++++++++++++-
 1 file changed, 1198 insertions(+), 4 deletions(-)
diff --git a/doc/training-guide/associate-storage-node-concept-swift.xml b/doc/training-guide/associate-storage-node-concept-swift.xml
index a385dd7742..bfaf38afd2 100644
--- a/doc/training-guide/associate-storage-node-concept-swift.xml
+++ b/doc/training-guide/associate-storage-node-concept-swift.xml
@@ -3,9 +3,1203 @@
   xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
   xml:id="associate-storage-node-concept-swift">
   <title>Conceptual Swift</title>
-    <para>foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar
-        foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar
-        foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar
-        foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar  foobar
+  <section xml:id="intro-objectstore">
+    <title>Introduction to Object Storage</title>
+    <para>OpenStack Object Storage (code-named Swift) is open source
+        software for creating redundant, scalable data storage using
+        clusters of standardized servers to store petabytes of
+        accessible data. It is a long-term storage system for large
+        amounts of static data that can be retrieved, leveraged, and
+        updated. Object Storage uses a distributed architecture with
+        no central point of control, providing greater scalability,
+        redundancy and permanence. Objects are written to multiple
+        hardware devices, with the OpenStack software responsible for
+        ensuring data replication and integrity across the cluster.
+        Storage clusters scale horizontally by adding new nodes.
+        Should a node fail, OpenStack works to replicate its content
+        from other active nodes. Because OpenStack uses software logic
+        to ensure data replication and distribution across different
+        devices, inexpensive commodity hard drives and servers can be
+        used in lieu of more expensive equipment.</para>
+    <para>Object Storage is ideal for cost effective, scale-out
+        storage. It provides a fully distributed, API-accessible
+        storage platform that can be integrated directly into
+        applications or used for backup, archiving and data retention.
+        Block Storage allows block devices to be exposed and connected
+        to compute instances for expanded storage, better performance
+        and integration with enterprise storage platforms, such as
+        NetApp, Nexenta and SolidFire.</para>
+      </section>
+    <section xml:id="features-benefits">
+    <title>Features and Benifits</title>
+    <para>
+        <informaltable class="c19">
+            <tbody>
+                <tr>
+                    <th rowspan="1" colspan="1">Features</th>
+                    <th rowspan="1" colspan="1">Benefits</th>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Leverages commodity
+                        hardware</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >No
+                        lock-in, lower
+                        price/GB</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >HDD/node failure agnostic</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Self
+                        healingReliability, data redundancy protecting
+                        from
+                        failures</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Unlimited storage</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Huge
+                        &amp; flat namespace, highly scalable
+                        read/write accessAbility to serve content
+                        directly from storage
+                        system</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Multi-dimensional scalability</emphasis>
+                        (scale out architecture)Scale vertically and
+                        horizontally-distributed storage</td>
+                    <td rowspan="1" colspan="1"
+                        >Backup
+                        and archive large amounts of data with linear
+                        performance</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Account/Container/Object
+                            structure</emphasis>No nesting, not a
+                        traditional file system</td>
+                    <td rowspan="1" colspan="1"
+                        >Optimized
+                        for scaleScales to multiple petabytes,
+                        billions of
+                        objects</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Built-in replication3x+ data
+                            redundancy</emphasis> compared to 2x on
+                        RAID</td>
+                    <td rowspan="1" colspan="1"
+                        >Configurable
+                        number of accounts, container and object
+                        copies for high
+                        availability</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Easily add capacity</emphasis> unlike
+                        RAID resize</td>
+                    <td rowspan="1" colspan="1"
+                        >Elastic
+                        data scaling with
+                        ease</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >No central database</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Higher
+                        performance, no
+                        bottlenecks</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >RAID not required</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Handle
+                        lots of small, random reads and writes
+                        efficiently</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Built-in management
+                        utilities</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Account
+                        Management: Create, add, verify, delete
+                        usersContainer Management: Upload, download,
+                        verifyMonitoring: Capacity, host, network, log
+                        trawling, cluster
+                        health</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Drive auditing</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Detect
+                        drive failures preempting data
+                        corruption</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Expiring objects</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Users
+                        can set an expiration time or a TTL on an
+                        object to control
+                        access</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Direct object access</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Enable
+                        direct browser access to content, such as for
+                        a control
+                        panel</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Realtime visibility into client
+                            requests</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Know
+                        what users are
+                        requesting</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Supports S3 API</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Utilize
+                        tools that were designed for the popular S3
+                        API</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Restrict containers per
+                            account</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Limit
+                        access to control usage by
+                        user</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Support for NetApp, Nexenta,
+                            SolidFire</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Unified
+                        support for block volumes using a variety of
+                        storage
+                        systems</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Snapshot and backup API for block
+                            volumes</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Data
+                        protection and recovery for VM
+                        data</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Standalone volume API
+                            available</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Separate
+                        endpoint and API for integration with other
+                        compute
+                        systems</td>
+                </tr>
+                <tr>
+                    <td rowspan="1" colspan="1"><emphasis role="bold"
+                            >Integration with Compute</emphasis></td>
+                    <td rowspan="1" colspan="1"
+                        >Fully
+                        integrated to Compute for attaching block
+                        volumes and reporting on usage</td>
+                </tr>
+            </tbody>
+        </informaltable>
     </para>
+    </section>
+    <section xml:id="obj-store-capabilities">
+    <title>Object Storage Capabilities</title>
+    <itemizedlist>
+        <listitem>
+            <para>OpenStack provides redundant, scalable object
+                storage using clusters of standardized servers capable
+                of storing petabytes of data</para>
+        </listitem>
+        <listitem>
+            <para>Object Storage is not a traditional file system, but
+                rather a distributed storage system for static data
+                such as virtual machine images, photo storage, email
+                storage, backups and archives. Having no central
+                "brain" or master point of control provides greater
+                scalability, redundancy and durability.</para>
+        </listitem>
+        <listitem>
+            <para>Objects and files are written to multiple disk
+                drives spread throughout servers in the data center,
+                with the OpenStack software responsible for ensuring
+                data replication and integrity across the
+                cluster.</para>
+        </listitem>
+        <listitem>
+            <para>Storage clusters scale horizontally simply by adding
+                new servers. Should a server or hard drive fail,
+                OpenStack replicates its content from other active
+                nodes to new locations in the cluster. Because
+                OpenStack uses software logic to ensure data
+                replication and distribution across different devices,
+                inexpensive commodity hard drives and servers can be
+                used in lieu of more expensive equipment.</para>
+        </listitem>
+    </itemizedlist>
+    <para><guilabel>Swift Characteristics</guilabel></para>
+    <para>The key characteristics of Swift include:</para>
+    <itemizedlist>
+        <listitem>
+            <para>All objects stored in Swift have a URL</para>
+        </listitem>
+        <listitem>
+            <para>All objects stored are replicated 3x in
+                as-unique-as-possible zones, which can be defined as a
+                group of drives, a node, a rack etc.</para>
+        </listitem>
+        <listitem>
+            <para>All objects have their own metadata</para>
+        </listitem>
+        <listitem>
+            <para>Developers interact with the object storage system
+                through a RESTful HTTP API</para>
+        </listitem>
+        <listitem>
+            <para>Object data can be located anywhere in the
+                cluster</para>
+        </listitem>
+        <listitem>
+            <para>The cluster scales by adding additional nodes --
+                without sacrificing performance, which allows a more
+                cost-effective linear storage expansion vs. fork-lift
+                upgrades</para>
+        </listitem>
+        <listitem>
+            <para>Data doesn’t have to be migrated to an entirely new
+                storage system</para>
+        </listitem>
+        <listitem>
+            <para>New nodes can be added to the cluster without
+                downtime</para>
+        </listitem>
+        <listitem>
+            <para>Failed nodes and disks can be swapped out with no
+                downtime</para>
+        </listitem>
+        <listitem>
+            <para>Runs on industry-standard hardware, such as Dell,
+                HP, Supermicro etc.</para>
+        </listitem>
+    </itemizedlist>
+    <figure>
+        <title>Object Storage(Swift)</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image39.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+    <para>Developers can either write directly to the Swift API or use
+        one of the many client libraries that exist for all popular
+        programming languages, such as Java, Python, Ruby and C#.
+        Amazon S3 and RackSpace Cloud Files users should feel very
+        familiar with Swift. For users who have not used an object
+        storage system before, it will require a different approach
+        and mindset than using a traditional filesystem.</para>
+    </section>
+    <section xml:id="swift-building-blocks">
+    <title>Building Blocks of Swift</title>
+        <para>The components that enable Swift to deliver high
+            availability, high durability and high concurrency
+            are:</para>
+        <itemizedlist>
+            <listitem>
+                <para><emphasis role="bold">Proxy
+                Servers:</emphasis>Handles all incoming API
+                requests.</para>
+            </listitem>
+            <listitem>
+                <para><emphasis role="bold">Rings:</emphasis>Maps
+                logical names of data to locations on particular
+                disks.</para>
+            </listitem>
+            <listitem>
+                <para><emphasis role="bold">Zones:</emphasis>Each Zone
+                isolates data from other Zones. A failure in one Zone
+                doesn’t impact the rest of the cluster because data is
+                replicated across the Zones.</para>
+            </listitem>
+            <listitem>
+                <para><emphasis role="bold">Accounts &amp;
+                    Containers:</emphasis>Each Account and Container
+                are individual databases that are distributed across
+                the cluster. An Account database contains the list of
+                Containers in that Account. A Container database
+                contains the list of Objects in that Container</para>
+            </listitem>
+            <listitem>
+                <para><emphasis role="bold">Objects:</emphasis>The
+                data itself.</para>
+            </listitem>
+            <listitem>
+                <para><emphasis role="bold">Partitions:</emphasis>A
+                Partition stores Objects, Account databases and
+                Container databases. It’s an intermediate 'bucket'
+                that helps manage locations where data lives in the
+                cluster.</para>
+            </listitem>
+        </itemizedlist>
+    <figure>
+        <title>Building Blocks</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image40.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+        <para><guilabel>Proxy Servers</guilabel></para>
+        <para>The Proxy Servers are the public face of Swift and
+            handle all incoming API requests. Once a Proxy Server
+            receive a request, it will determine the storage node
+            based on the URL of the object, e.g.
+            https://swift.example.com/v1/account/container/object. The
+            Proxy Servers also coordinates responses, handles failures
+            and coordinates timestamps.</para>
+        <para>Proxy servers use a shared-nothing architecture and can
+            be scaled as needed based on projected workloads. A
+            minimum of two Proxy Servers should be deployed for
+            redundancy. Should one proxy server fail, the others will
+            take over.</para>
+    <para><guilabel>The Ring</guilabel></para>
+    <para>A ring represents a mapping between the names of entities
+        stored on disk and their physical location. There are separate
+        rings for accounts, containers, and objects. When other
+        components need to perform any operation on an object,
+        container, or account, they need to interact with the
+        appropriate ring to determine its location in the
+        cluster.</para>
+    <para>The Ring maintains this mapping using zones, devices,
+        partitions, and replicas. Each partition in the ring is
+        replicated, by default, 3 times across the cluster, and the
+        locations for a partition are stored in the mapping maintained
+        by the ring. The ring is also responsible for determining
+        which devices are used for hand off in failure
+        scenarios.</para>
+        <para>Data can be isolated with the concept of zones in the
+            ring. Each replica of a partition is guaranteed to reside
+            in a different zone. A zone could represent a drive, a
+            server, a cabinet, a switch, or even a data center.</para>
+        <para>The partitions of the ring are equally divided among all
+            the devices in the OpenStack Object Storage installation.
+            When partitions need to be moved around (for example if a
+            device is added to the cluster), the ring ensures that a
+            minimum number of partitions are moved at a time, and only
+            one replica of a partition is moved at a time.</para>
+        <para>Weights can be used to balance the distribution of
+            partitions on drives across the cluster. This can be
+            useful, for example, when different sized drives are used
+            in a cluster.</para>
+        <para>The ring is used by the Proxy server and several
+            background processes (like replication).</para>
+        <para>The Ring maps Partitions to physical locations on disk.
+            When other components need to perform any operation on an
+            object, container, or account, they need to interact with
+            the Ring to determine its location in the cluster.</para>
+        <para>The Ring maintains this mapping using zones, devices,
+            partitions, and replicas. Each partition in the Ring is
+            replicated three times by default across the cluster, and
+            the locations for a partition are stored in the mapping
+            maintained by the Ring. The Ring is also responsible for
+            determining which devices are used for handoff should a
+            failure occur.</para>
+    <figure>
+        <title>The Lord of the <emphasis role="bold"
+            >Ring</emphasis>s</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image41.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+        <para>The Ring maps partitions to physical locations on
+            disk.</para>
+        <para>The rings determine where data should reside in the
+            cluster. There is a separate ring for account databases,
+            container databases, and individual objects but each ring
+            works in the same way. These rings are externally managed,
+            in that the server processes themselves do not modify the
+            rings, they are instead given new rings modified by other
+            tools.</para>
+        <para>The ring uses a configurable number of bits from a
+            path’s MD5 hash as a partition index that designates a
+            device. The number of bits kept from the hash is known as
+            the partition power, and 2 to the partition power
+            indicates the partition count. Partitioning the full MD5
+            hash ring allows other parts of the cluster to work in
+            batches of items at once which ends up either more
+            efficient or at least less complex than working with each
+            item separately or the entire cluster all at once.</para>
+        <para>Another configurable value is the replica count, which
+        indicates how many of the partition-&gt;device assignments
+        comprise a single ring. For a given partition number, each
+        replica’s device will not be in the same zone as any other
+        replica's device. Zones can be used to group devices based on
+        physical locations, power separations, network separations, or
+        any other attribute that would lessen multiple replicas being
+        unavailable at the same time.</para>
+        <para><guilabel>Zones: Failure Boundaries</guilabel></para>
+            <para>Swift allows zones to be configured to isolate
+                failure boundaries. Each replica of the data resides
+                in a separate zone, if possible. At the smallest
+                level, a zone could be a single drive or a grouping of
+                a few drives. If there were five object storage
+                servers, then each server would represent its own
+                zone. Larger deployments would have an entire rack (or
+                multiple racks) of object servers, each representing a
+                zone. The goal of zones is to allow the cluster to
+                tolerate significant outages of storage servers
+                without losing all replicas of the data.</para>
+            <para>As we learned earlier, everything in Swift is
+                stored, by default, three times. Swift will place each
+                replica "as-uniquely-as-possible" to ensure both high
+                availability and high durability. This means that when
+                chosing a replica location, Swift will choose a server
+                in an unused zone before an unused server in a zone
+                that already has a replica of the data.</para>
+    <figure>
+        <title>image33.png</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image42.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+            <para>When a disk fails, replica data is automatically
+                distributed to the other zones to ensure there are
+                three copies of the data</para>
+            <para><guilabel>Accounts &amp;
+                Containers</guilabel></para>
+            <para>Each account and container is an individual SQLite
+                database that is distributed across the cluster. An
+                account database contains the list of containers in
+                that account. A container database contains the list
+                of objects in that container.</para>
+    <figure>
+        <title>Accounts and Containers</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image43.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+            <para>To keep track of object data location, each account
+                in the system has a database that references all its
+                containers, and each container database references
+                each object</para>
+            <para><guilabel>Partitions</guilabel></para>
+            <para>A Partition is a collection of stored data,
+                including Account databases, Container databases, and
+                objects. Partitions are core to the replication
+                system.</para>
+            <para>Think of a Partition as a bin moving throughout a
+                fulfillment center warehouse. Individual orders get
+                thrown into the bin. The system treats that bin as a
+                cohesive entity as it moves throughout the system. A
+                bin full of things is easier to deal with than lots of
+                little things. It makes for fewer moving parts
+                throughout the system.</para>
+            <para>The system replicators and object uploads/downloads
+                operate on Partitions. As the system scales up,
+                behavior continues to be predictable as the number of
+                Partitions is a fixed number.</para>
+            <para>The implementation of a Partition is conceptually
+                simple -- a partition is just a directory sitting on a
+                disk with a corresponding hash table of what it
+                contains.</para>
+    <figure>
+        <title>Partitions</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image44.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+            <para>*Swift partitions contain all data in the
+                system.</para>
+            <para><guilabel>Replication</guilabel></para>
+            <para>In order to ensure that there are three copies of
+                the data everywhere, replicators continuously examine
+                each Partition. For each local Partition, the
+                replicator compares it against the replicated copies
+                in the other Zones to see if there are any
+                differences.</para>
+            <para>How does the replicator know if replication needs to
+                take place? It does this by examining hashes. A hash
+                file is created for each Partition, which contains
+                hashes of each directory in the Partition. Each of the
+                three hash files is compared. For a given Partition,
+                the hash files for each of the Partition's copies are
+                compared. If the hashes are different, then it is time
+                to replicate and the directory that needs to be
+                replicated is copied over.</para>
+            <para>This is where the Partitions come in handy. With
+                fewer "things" in the system, larger chunks of data
+                are transferred around (rather than lots of little TCP
+                connections, which is inefficient) and there are a
+                consistent number of hashes to compare.</para>
+            <para>The cluster has eventually consistent behavior where
+                the newest data wins.</para>
+    <figure>
+        <title>Replication</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image45.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+            <para>*If a zone goes down, one of the nodes containing a
+                replica notices and proactively copies data to a
+                handoff location.</para>
+    <para>To describe how these pieces all come together, let's walk
+        through a few scenarios and introduce the components.</para>
+    <para><guilabel>Bird-eye View</guilabel></para>
+    <para><emphasis role="bold">Upload</emphasis></para>
+
+        <para>A client uses the REST API to make a HTTP request to PUT
+            an object into an existing Container. The cluster receives
+            the request. First, the system must figure out where the
+            data is going to go. To do this, the Account name,
+            Container name and Object name are all used to determine
+            the Partition where this object should live.</para>
+        <para>Then a lookup in the Ring figures out which storage
+            nodes contain the Partitions in question.</para>
+        <para>The data then is sent to each storage node where it is
+            placed in the appropriate Partition. A quorum is required
+            -- at least two of the three writes must be successful
+            before the client is notified that the upload was
+            successful.</para>
+        <para>Next, the Container database is updated asynchronously
+            to reflect that there is a new object in it.</para>
+    <figure>
+        <title>When End-User uses Swift</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image46.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+    <para><emphasis role="bold">Download</emphasis></para>
+        <para>A request comes in for an Account/Container/object.
+            Using the same consistent hashing, the Partition name is
+            generated. A lookup in the Ring reveals which storage
+            nodes contain that Partition. A request is made to one of
+            the storage nodes to fetch the object and if that fails,
+            requests are made to the other nodes.</para>
+    </section>
+    <section xml:id="the-ring">
+    <title>Ring Builder</title>
+    <para>The rings are built and managed manually by a utility called
+        the ring-builder. The ring-builder assigns partitions to
+        devices and writes an optimized Python structure to a gzipped,
+        serialized file on disk for shipping out to the servers. The
+        server processes just check the modification time of the file
+        occasionally and reload their in-memory copies of the ring
+        structure as needed. Because of how the ring-builder manages
+        changes to the ring, using a slightly older ring usually just
+        means one of the three replicas for a subset of the partitions
+        will be incorrect, which can be easily worked around.</para>
+    <para>The ring-builder also keeps its own builder file with the
+        ring information and additional data required to build future
+        rings. It is very important to keep multiple backup copies of
+        these builder files. One option is to copy the builder files
+        out to every server while copying the ring files themselves.
+        Another is to upload the builder files into the cluster
+        itself. Complete loss of a builder file will mean creating a
+        new ring from scratch, nearly all partitions will end up
+        assigned to different devices, and therefore nearly all data
+        stored will have to be replicated to new locations. So,
+        recovery from a builder file loss is possible, but data will
+        definitely be unreachable for an extended time.</para>
+        <para><guilabel>Ring Data Structure</guilabel></para>
+        <para>The ring data structure consists of three top level
+            fields: a list of devices in the cluster, a list of lists
+            of device ids indicating partition to device assignments,
+            and an integer indicating the number of bits to shift an
+            MD5 hash to calculate the partition for the hash.</para>
+            <para><guilabel>Partition Assignment
+        List</guilabel></para>
+            <para>This is a list of array(‘H’) of devices ids. The
+                outermost list contains an array(‘H’) for each
+                replica. Each array(‘H’) has a length equal to the
+                partition count for the ring. Each integer in the
+                array(‘H’) is an index into the above list of devices.
+                The partition list is known internally to the Ring
+                class as _replica2part2dev_id.</para>
+            <para>So, to create a list of device dictionaries assigned
+                to a partition, the Python code would look like:
+                devices = [self.devs[part2dev_id[partition]] for
+                part2dev_id in self._replica2part2dev_id]</para>
+            <para>That code is a little simplistic, as it does not
+                account for the removal of duplicate devices. If a
+                ring has more replicas than devices, then a partition
+                will have more than one replica on one device; that’s
+                simply the pigeonhole principle at work.</para>
+            <para>array(‘H’) is used for memory conservation as there
+                may be millions of partitions.</para>
+            <para><guilabel>Fractional Replicas</guilabel></para>
+            <para>A ring is not restricted to having an integer number
+                of replicas. In order to support the gradual changing
+                of replica counts, the ring is able to have a real
+                number of replicas.</para>
+            <para>When the number of replicas is not an integer, then
+                the last element of _replica2part2dev_id will have a
+                length that is less than the partition count for the
+                ring. This means that some partitions will have more
+                replicas than others. For example, if a ring has 3.25
+                replicas, then 25% of its partitions will have four
+                replicas, while the remaining 75% will have just
+                three.</para>
+            <para><guilabel>Partition Shift Value</guilabel></para>
+            <para>The partition shift value is known internally to the
+                Ring class as _part_shift. This value used to shift an
+                MD5 hash to calculate the partition on which the data
+                for that hash should reside. Only the top four bytes
+                of the hash is used in this process. For example, to
+                compute the partition for the path
+                /account/container/object the Python code might look
+                like: partition = unpack_from('&gt;I',
+                md5('/account/container/object').digest())[0] &gt;&gt;
+                self._part_shift</para>
+            <para>For a ring generated with part_power P, the
+                partition shift value is 32 - P.</para>
+        <para><guilabel>Building the Ring</guilabel></para>
+        <para>The initial building of the ring first calculates the
+            number of partitions that should ideally be assigned to
+            each device based the device’s weight. For example, given
+            a partition power of 20, the ring will have 1,048,576
+            partitions. If there are 1,000 devices of equal weight
+            they will each desire 1,048.576 partitions. The devices
+            are then sorted by the number of partitions they desire
+            and kept in order throughout the initialization
+            process.</para>
+        <para>Note: each device is also assigned a random tiebreaker
+            value that is used when two devices desire the same number
+            of partitions. This tiebreaker is not stored on disk
+            anywhere, and so two different rings created with the same
+            parameters will have different partition assignments. For
+            repeatable partition assignments, RingBuilder.rebalance()
+            takes an optional seed value that will be used to seed
+            Python’s pseudo-random number generator.</para>
+        <para>Then, the ring builder assigns each replica of each
+            partition to the device that desires the most partitions
+            at that point while keeping it as far away as possible
+            from other replicas. The ring builder prefers to assign a
+            replica to a device in a regions that has no replicas
+            already; should there be no such region available, the
+            ring builder will try to find a device in a different
+            zone; if not possible, it will look on a different server;
+            failing that, it will just look for a device that has no
+            replicas; finally, if all other options are exhausted, the
+            ring builder will assign the replica to the device that
+            has the fewest replicas already assigned. Note that
+            assignment of multiple replicas to one device will only
+            happen if the ring has fewer devices than it has
+            replicas.</para>
+        <para>When building a new ring based on an old ring, the
+            desired number of partitions each device wants is
+            recalculated. Next the partitions to be reassigned are
+            gathered up. Any removed devices have all their assigned
+            partitions unassigned and added to the gathered list. Any
+            partition replicas that (due to the addition of new
+            devices) can be spread out for better durability are
+            unassigned and added to the gathered list. Any devices
+            that have more partitions than they now desire have random
+            partitions unassigned from them and added to the gathered
+            list. Lastly, the gathered partitions are then reassigned
+            to devices using a similar method as in the initial
+            assignment described above.</para>
+        <para>Whenever a partition has a replica reassigned, the time
+            of the reassignment is recorded. This is taken into
+            account when gathering partitions to reassign so that no
+            partition is moved twice in a configurable amount of time.
+            This configurable amount of time is known internally to
+            the RingBuilder class as min_part_hours. This restriction
+            is ignored for replicas of partitions on devices that have
+            been removed, as removing a device only happens on device
+            failure and there’s no choice but to make a
+            reassignment.</para>
+        <para>The above processes don’t always perfectly rebalance a
+            ring due to the random nature of gathering partitions for
+            reassignment. To help reach a more balanced ring, the
+            rebalance process is repeated until near perfect (less 1%
+            off) or when the balance doesn’t improve by at least 1%
+            (indicating we probably can’t get perfect balance due to
+            wildly imbalanced zones or too many partitions recently
+            moved).</para>
+    </section>
+    <section xml:id="more-concepts">
+    <title>A Bit More On Swift</title>
+        <para><guilabel>Containers and Objects</guilabel></para>
+        <para>A container is a storage compartment for your data and
+            provides a way for you to organize your data. You can
+            think of a container as a folder in Windows or a
+            directory in UNIX. The primary difference between a
+            container and these other file system concepts is that
+            containers cannot be nested. You can, however, create an
+            unlimited number of containers within your account. Data
+            must be stored in a container so you must have at least
+            one container defined in your account prior to uploading
+            data.</para>
+        <para>The only restrictions on container names is that they
+            cannot contain a forward slash (/) or an ascii null (%00)
+            and must be less than 257 bytes in length. Please note
+            that the length restriction applies to the name after it
+            has been URL encoded. For example, a container name of
+            Course Docs would be URL encoded as Course%20Docs and
+            therefore be 13 bytes in length rather than the expected
+            11.</para>
+        <para>An object is the basic storage entity and any optional
+            metadata that represents the files you store in the
+            OpenStack Object Storage system. When you upload data to
+            OpenStack Object Storage, the data is stored as-is (no
+            compression or encryption) and consists of a location
+            (container), the object's name, and any metadata
+            consisting of key/value pairs. For instance, you may chose
+            to store a backup of your digital photos and organize them
+            into albums. In this case, each object could be tagged
+            with metadata such as Album : Caribbean Cruise or Album :
+            Aspen Ski Trip.</para>
+        <para>The only restriction on object names is that they must
+            be less than 1024 bytes in length after URL encoding. For
+            example, an object name of C++final(v2).txt should be URL
+            encoded as C%2B%2Bfinal%28v2%29.txt and therefore be 24
+            bytes in length rather than the expected 16.</para>
+        <para>The maximum allowable size for a storage object upon
+            upload is 5 gigabytes (GB) and the minimum is zero bytes.
+            You can use the built-in large object support and the
+            swift utility to retrieve objects larger than 5 GB.</para>
+        <para>For metadata, you should not exceed 90 individual
+            key/value pairs for any one object and the total byte
+            length of all key/value pairs should not exceed 4KB (4096
+            bytes).</para>
+        <para><guilabel>Language-Specific API
+        Bindings</guilabel></para>
+        <para>A set of supported API bindings in several popular
+            languages are available from the Rackspace Cloud Files
+            product, which uses OpenStack Object Storage code for its
+            implementation. These bindings provide a layer of
+            abstraction on top of the base REST API, allowing
+            programmers to work with a container and object model
+            instead of working directly with HTTP requests and
+            responses. These bindings are free (as in beer and as in
+            speech) to download, use, and modify. They are all
+            licensed under the MIT License as described in the COPYING
+            file packaged with each binding. If you do make any
+            improvements to an API, you are encouraged (but not
+            required) to submit those changes back to us.</para>
+        <para>The API bindings for Rackspace Cloud Files are hosted
+                at<link xlink:href="http://github.com/rackspace"
+                ></link><link
+                xlink:href="http://github.com/rackspace"
+                >http://github.com/rackspace</link>. Feel free to
+            coordinate your changes through github or, if you prefer,
+            send your changes to cloudfiles@rackspacecloud.com. Just
+            make sure to indicate which language and version you
+            modified and send a unified diff.</para>
+        <para>Each binding includes its own documentation (either
+            HTML, PDF, or CHM). They also include code snippets and
+            examples to help you get started. The currently supported
+            API binding for OpenStack Object Storage are:</para>
+        <itemizedlist>
+            <listitem>
+                <para>PHP (requires 5.x and the modules: cURL,
+                    FileInfo, mbstring)</para>
+            </listitem>
+            <listitem>
+                <para>Python (requires 2.4 or newer)</para>
+            </listitem>
+            <listitem>
+                <para>Java (requires JRE v1.5 or newer)</para>
+            </listitem>
+            <listitem>
+                <para>C#/.NET (requires .NET Framework v3.5)</para>
+            </listitem>
+            <listitem>
+                <para>Ruby (requires 1.8 or newer and mime-tools
+                    module)</para>
+            </listitem>
+        </itemizedlist>
+        <para>There are no other supported language-specific bindings
+            at this time. You are welcome to create your own language
+            API bindings and we can help answer any questions during
+            development, host your code if you like, and give you full
+            credit for your work.</para>
+            <para><guilabel>Proxy Server</guilabel></para>
+            <para>The Proxy Server is responsible for tying together
+                the rest of the OpenStack Object Storage architecture.
+                For each request, it will look up the location of the
+                account, container, or object in the ring (see below)
+                and route the request accordingly. The public API is
+                also exposed through the Proxy Server.</para>
+            <para>A large number of failures are also handled in the
+                Proxy Server. For example, if a server is unavailable
+                for an object PUT, it will ask the ring for a hand-off
+                server and route there instead.</para>
+            <para>When objects are streamed to or from an object
+                server, they are streamed directly through the proxy
+                server to or from the user – the proxy server does not
+                spool them.</para>
+            <para>You can use a proxy server with account management
+                enabled by configuring it in the proxy server
+                configuration file.</para>
+            <para><guilabel>Object Server</guilabel></para>
+            <para>The Object Server is a very simple blob storage
+                server that can store, retrieve and delete objects
+                stored on local devices. Objects are stored as binary
+                files on the filesystem with metadata stored in the
+                file’s extended attributes (xattrs). This requires
+                that the underlying filesystem choice for object
+                servers support xattrs on files. Some filesystems,
+                like ext3, have xattrs turned off by default.</para>
+            <para>Each object is stored using a path derived from the
+                object name’s hash and the operation’s timestamp. Last
+                write always wins, and ensures that the latest object
+                version will be served. A deletion is also treated as
+                a version of the file (a 0 byte file ending with
+                “.ts”, which stands for tombstone). This ensures that
+                deleted files are replicated correctly and older
+                versions don’t magically reappear due to failure
+                scenarios.</para>
+            <para><guilabel>Container Server</guilabel></para>
+            <para>The Container Server’s primary job is to handle
+                listings of objects. It does not’t know where those
+                objects are, just what objects are in a specific
+                container. The listings are stored as sqlite database
+                files, and replicated across the cluster similar to
+                how objects are. Statistics are also tracked that
+                include the total number of objects, and total storage
+                usage for that container.</para>
+            <para><guilabel>Account Server</guilabel></para>
+            <para>The Account Server is very similar to the Container
+                Server, excepting that it is responsible for listings
+                of containers rather than objects.</para>
+            <para><guilabel>Replication</guilabel></para>
+            <para>Replication is designed to keep the system in a
+                consistent state in the face of temporary error
+                conditions like network outages or drive
+                failures.</para>
+            <para>The replication processes compare local data with
+                each remote copy to ensure they all contain the latest
+                version. Object replication uses a hash list to
+                quickly compare subsections of each partition, and
+                container and account replication use a combination of
+                hashes and shared high water marks.</para>
+            <para>Replication updates are push based. For object
+                replication, updating is just a matter of rsyncing
+                files to the peer. Account and container replication
+                push missing records over HTTP or rsync whole database
+                files.</para>
+            <para>The replicator also ensures that data is removed
+                from the system. When an item (object, container, or
+                account) is deleted, a tombstone is set as the latest
+                version of the item. The replicator will see the
+                tombstone and ensure that the item is removed from the
+                entire system.</para>
+            <para>To separate the cluster-internal replication traffic
+                from client traffic, separate replication servers can
+                be used. These replication servers are based on the
+                standard storage servers, but they listen on the
+                replication IP and only respond to REPLICATE requests.
+                Storage servers can serve REPLICATE requests, so an
+                operator can transition to using a separate
+                replication network with no cluster downtime.</para>
+            <para>Replication IP and port information is stored in the
+                ring on a per-node basis. These parameters will be
+                used if they are present, but they are not required.
+                If this information does not exist or is empty for a
+                particular node, the node's standard IP and port will
+                be used for replication.</para>
+            <para><guilabel>Updaters</guilabel></para>
+            <para>There are times when container or account data can
+                not be immediately updated. This usually occurs during
+                failure scenarios or periods of high load. If an
+                update fails, the update is queued locally on the file
+                system, and the updater will process the failed
+                updates. This is where an eventual consistency window
+                will most likely come in to play. For example, suppose
+                a container server is under load and a new object is
+                put in to the system. The object will be immediately
+                available for reads as soon as the proxy server
+                responds to the client with success. However, the
+                container server did not update the object listing,
+                and so the update would be queued for a later update.
+                Container listings, therefore, may not immediately
+                contain the object.</para>
+            <para>In practice, the consistency window is only as large
+                as the frequency at which the updater runs and may not
+                even be noticed as the proxy server will route listing
+                requests to the first container server which responds.
+                The server under load may not be the one that serves
+                subsequent listing requests – one of the other two
+                replicas may handle the listing.</para>
+            <para><guilabel>Auditors</guilabel></para>
+            <para>Auditors crawl the local server checking the
+                integrity of the objects, containers, and accounts. If
+                corruption is found (in the case of bit rot, for
+                example), the file is quarantined, and replication
+                will replace the bad file from another replica. If
+                other errors are found they are logged (for example,
+                an object’s listing can’t be found on any container
+                server it should be).</para>
+    </section>
+    <section xml:id="cluster-architecture">
+    <title>Cluster Arch</title>
+        <para><guilabel>Access Tier</guilabel></para>
+    <figure>
+        <title>Swift Cluster Architecture</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image47.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+        <para>Large-scale deployments segment off an "Access Tier".
+            This tier is the “Grand Central” of the Object Storage
+            system. It fields incoming API requests from clients and
+            moves data in and out of the system. This tier is composed
+            of front-end load balancers, ssl- terminators,
+            authentication services, and it runs the (distributed)
+            brain of the object storage system — the proxy server
+            processes.</para>
+        <para>Having the access servers in their own tier enables
+            read/write access to be scaled out independently of
+            storage capacity. For example, if the cluster is on the
+            public Internet and requires ssl-termination and has high
+            demand for data access, many access servers can be
+            provisioned. However, if the cluster is on a private
+            network and it is being used primarily for archival
+            purposes, fewer access servers are needed.</para>
+        <para>As this is an HTTP addressable storage service, a load
+            balancer can be incorporated into the access tier.</para>
+        <para>Typically, this tier comprises a collection of 1U
+            servers. These machines use a moderate amount of RAM and
+            are network I/O intensive. As these systems field each
+            incoming API request, it is wise to provision them with
+            two high-throughput (10GbE) interfaces. One interface is
+            used for 'front-end' incoming requests and the other for
+            'back-end' access to the object storage nodes to put and
+            fetch data.</para>
+            <para><guilabel>Factors to Consider</guilabel></para>
+            <para>For most publicly facing deployments as well as
+                private deployments available across a wide-reaching
+                corporate network, SSL will be used to encrypt traffic
+                to the client. SSL adds significant processing load to
+                establish sessions between clients; more capacity in
+                the access layer will need to be provisioned. SSL may
+                not be required for private deployments on trusted
+                networks.</para>
+        <para><guilabel>Storage Nodes</guilabel></para>
+    <figure>
+        <title>Object Storage (Swift)</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata fileref="figures/image48.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+        <para>The next component is the storage servers themselves.
+            Generally, most configurations should have each of the
+            five Zones with an equal amount of storage capacity.
+            Storage nodes use a reasonable amount of memory and CPU.
+            Metadata needs to be readily available to quickly return
+            objects. The object stores run services not only to field
+            incoming requests from the Access Tier, but to also run
+            replicators, auditors, and reapers. Object stores can be
+            provisioned with single gigabit or 10 gigabit network
+            interface depending on expected workload and desired
+            performance.</para>
+        <para>Currently 2TB or 3TB SATA disks deliver good
+            price/performance value. Desktop-grade drives can be used
+            where there are responsive remote hands in the datacenter,
+            and enterprise-grade drives can be used where this is not
+            the case.</para>
+            <para><guilabel>Factors to Consider</guilabel></para>
+            <para>Desired I/O performance for single-threaded requests
+                should be kept in mind. This system does not use RAID,
+                so each request for an object is handled by a single
+                disk. Disk performance impacts single-threaded
+                response rates.</para>
+            <para>To achieve apparent higher throughput, the object
+                storage system is designed with concurrent
+                uploads/downloads in mind. The network I/O capacity
+                (1GbE, bonded 1GbE pair, or 10GbE) should match your
+                desired concurrent throughput needs for reads and
+                writes.</para>
+    </section>
+    <section xml:id="account-reaper">
+    <title>Account Reaper</title>
+    <para>The Account Reaper removes data from deleted accounts in the
+        background.</para>
+    <para>An account is marked for deletion by a reseller issuing a
+        DELETE request on the account’s storage URL. This simply puts
+        the value DELETED into the status column of the account_stat
+        table in the account database (and replicas), indicating the
+        data for the account should be deleted later.</para>
+    <para>There is normally no set retention time and no undelete; it
+        is assumed the reseller will implement such features and only
+        call DELETE on the account once it is truly desired the
+        account’s data be removed. However, in order to protect the
+        Swift cluster accounts from an improper or mistaken delete
+        request, you can set a delay_reaping value in the
+        [account-reaper] section of the account-server.conf to delay
+        the actual deletion of data. At this time, there is no utility
+        to undelete an account; one would have to update the account
+        database replicas directly, setting the status column to an
+        empty string and updating the put_timestamp to be greater than
+        the delete_timestamp. (On the TODO list is writing a utility
+        to perform this task, preferably through a ReST call.)</para>
+    <para>The account reaper runs on each account server and scans the
+        server occasionally for account databases marked for deletion.
+        It will only trigger on accounts that server is the primary
+        node for, so that multiple account servers aren’t all trying
+        to do the same work at the same time. Using multiple servers
+        to delete one account might improve deletion speed, but
+        requires coordination so they aren’t duplicating effort. Speed
+        really isn’t as much of a concern with data deletion and large
+        accounts aren’t deleted that often.</para>
+    <para>The deletion process for an account itself is pretty
+        straightforward. For each container in the account, each
+        object is deleted and then the container is deleted. Any
+        deletion requests that fail won’t stop the overall process,
+        but will cause the overall process to fail eventually (for
+        example, if an object delete times out, the container won’t be
+        able to be deleted later and therefore the account won’t be
+        deleted either). The overall process continues even on a
+        failure so that it doesn’t get hung up reclaiming cluster
+        space because of one troublesome spot. The account reaper will
+        keep trying to delete an account until it eventually becomes
+        empty, at which point the database reclaim process within the
+        db_replicator will eventually remove the database
+        files.</para>
+    <para>Sometimes a persistent error state can prevent some object
+        or container from being deleted. If this happens, you will see
+        a message such as “Account &lt;name&gt; has not been reaped
+        since &lt;date&gt;” in the log. You can control when this is
+        logged with the reap_warn_after value in the [account-reaper]
+        section of the account-server.conf file. By default this is 30
+        days.</para>
+    </section>
+    <section xml:id="replication">
+    <title>Replication</title>
+    <para>Because each replica in swift functions independently, and
+        clients generally require only a simple majority of nodes
+        responding to consider an operation successful, transient
+        failures like network partitions can quickly cause replicas to
+        diverge. These differences are eventually reconciled by
+        asynchronous, peer-to-peer replicator processes. The
+        replicator processes traverse their local filesystems,
+        concurrently performing operations in a manner that balances
+        load across physical disks.</para>
+    <para>Replication uses a push model, with records and files
+        generally only being copied from local to remote replicas.
+        This is important because data on the node may not belong
+        there (as in the case of handoffs and ring changes), and a
+        replicator can’t know what data exists elsewhere in the
+        cluster that it should pull in. It’s the duty of any node that
+        contains data to ensure that data gets to where it belongs.
+        Replica placement is handled by the ring.</para>
+    <para>Every deleted record or file in the system is marked by a
+        tombstone, so that deletions can be replicated alongside
+        creations. The replication process cleans up tombstones after
+        a time period known as the consistency window. The consistency
+        window encompasses replication duration and how long transient
+        failure can remove a node from the cluster. Tombstone cleanup
+        must be tied to replication to reach replica
+        convergence.</para>
+    <para>If a replicator detects that a remote drive has failed, the
+        replicator uses the get_more_nodes interface for the ring to
+        choose an alternate node with which to synchronize. The
+        replicator can maintain desired levels of replication in the
+        face of disk failures, though some replicas may not be in an
+        immediately usable location. Note that the replicator doesn’t
+        maintain desired levels of replication when other failures,
+        such as entire node failures, occur because most failure are
+        transient.</para>
+    <para>Replication is an area of active development, and likely
+        rife with potential improvements to speed and
+        correctness.</para>
+    <para>There are two major classes of replicator - the db
+        replicator, which replicates accounts and containers, and the
+        object replicator, which replicates object data.</para>
+    <para><guilabel>DB Replication</guilabel></para>
+        <para>The first step performed by db replication is a low-cost
+            hash comparison to determine whether two replicas already
+            match. Under normal operation, this check is able to
+            verify that most databases in the system are already
+            synchronized very quickly. If the hashes differ, the
+            replicator brings the databases in sync by sharing records
+            added since the last sync point.</para>
+        <para>This sync point is a high water mark noting the last
+            record at which two databases were known to be in sync,
+            and is stored in each database as a tuple of the remote
+            database id and record id. Database ids are unique amongst
+            all replicas of the database, and record ids are
+            monotonically increasing integers. After all new records
+            have been pushed to the remote database, the entire sync
+            table of the local database is pushed, so the remote
+            database can guarantee that it is in sync with everything
+            with which the local database has previously
+            synchronized.</para>
+        <para>If a replica is found to be missing entirely, the whole
+            local database file is transmitted to the peer using
+            rsync(1) and vested with a new unique id.</para>
+        <para>In practice, DB replication can process hundreds of
+            databases per concurrency setting per second (up to the
+            number of available CPUs or disks) and is bound by the
+            number of DB transactions that must be performed.</para>
+        <para><guilabel>Object Replication</guilabel></para>
+        <para>The initial implementation of object replication simply
+            performed an rsync to push data from a local partition to
+            all remote servers it was expected to exist on. While this
+            performed adequately at small scale, replication times
+            skyrocketed once directory structures could no longer be
+            held in RAM. We now use a modification of this scheme in
+            which a hash of the contents for each suffix directory is
+            saved to a per-partition hashes file. The hash for a
+            suffix directory is invalidated when the contents of that
+            suffix directory are modified.</para>
+        <para>The object replication process reads in these hash
+            files, calculating any invalidated hashes. It then
+            transmits the hashes to each remote server that should
+            hold the partition, and only suffix directories with
+            differing hashes on the remote server are rsynced. After
+            pushing files to the remote server, the replication
+            process notifies it to recalculate hashes for the rsynced
+            suffix directories.</para>
+        <para>Performance of object replication is generally bound by
+            the number of uncached directories it has to traverse,
+            usually as a result of invalidated suffix directory
+            hashes. Using write volume and partition counts from our
+            running systems, it was designed so that around 2% of the
+            hash space on a normal node will be invalidated per day,
+            which has experimentally given us acceptable replication
+            speeds.</para>
+    </section>
 </section>