Convert objectstorage files to RST

Cloud Admin Guide files converted: objectstorage-account-reaper.rst objectstorage-arch.rst objectstorage-replication.rst objectstorage-ringbuilder.rst objectstorage-tenant-specific-image-storage.rst Change-Id: I9d6416d4dfdf3bd71d59eef64b9ba3af07e15a8a Implements: blueprint reorganise-user-guides
2015-06-11 12:06:11 +10:00 · 2015-06-11 12:06:11 +10:00 · 45fb0a6a86
commit 45fb0a6a86
parent eb85580cc4
8 changed files with 443 additions and 7 deletions
--- a/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
+++ b/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
--- a/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
+++ b/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
--- a/doc/admin-guide-cloud-rst/source/objectstorage.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage.rst
@ -2,9 +2,6 @@
 Object Storage
 ==============
 Contents
 ~~~~~~~~
 .. toctree::
   :maxdepth: 2
@ -12,13 +9,13 @@ Contents
   objectstorage_features.rst
   objectstorage_characteristics.rst
   objectstorage_components.rst
   objectstorage-monitoring.rst
   objectstorage-admin.rst
 .. TODO (karenb)
   objectstorage_ringbuilder.rst
   objectstorage_arch.rst
   objectstorage_replication.rst
   objectstorage_account_reaper.rst
   objectstorage_tenant_specific_image_storage.rst
   objectstorage-monitoring.rst
   objectstorage-admin.rst
 .. TODO (karenb)
   objectstorage_troubleshoot.rst
--- a/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst
@ -0,0 +1,50 @@
 ==============
 Account reaper
 ==============
 In the background, the account reaper removes data from the deleted
 accounts.
 A reseller marks an account for deletion by issuing a ``DELETE`` request
 on the account's storage URL. This action sets the ``status`` column of
 the account\_stat table in the account database and replicas to
 ``DELETED``, marking the account's data for deletion.
 Typically, a specific retention time or undelete are not provided.
 However, you can set a ``delay_reaping`` value in the
 ``[account-reaper]`` section of the :file:`account-server.conf` file to
 delay the actual deletion of data. At this time, to undelete you have to update
 the account database replicas directly, setting the status column to an
 empty string and updating the put\_timestamp to be greater than the
 delete\_timestamp.
 .. note::
   It is on the development to-do list to write a utility that performs
   this task, preferably through a REST call.
 The account reaper runs on each account server and scans the server
 occasionally for account databases marked for deletion. It only fires up
 on the accounts for which the server is the primary node, so that
 multiple account servers aren't trying to do it simultaneously. Using
 multiple servers to delete one account might improve the deletion speed
 but requires coordination to avoid duplication. Speed really is not a
 big concern with data deletion, and large accounts aren't deleted often.
 Deleting an account is simple. For each account container, all objects
 are deleted and then the container is deleted. Deletion requests that
 fail will not stop the overall process but will cause the overall
 process to fail eventually (for example, if an object delete times out,
 you will not be able to delete the container or the account). The
 account reaper keeps trying to delete an account until it is empty, at
 which point the database reclaim process within the db\_replicator will
 remove the database files.
 A persistent error state may prevent the deletion of an object or
 container. If this happens, you will see a message in the log, for example::
  "Account <name> has not been reaped since <date>"
 You can control when this is logged with the ``reap_warn_after`` value in the
 ``[account-reaper]`` section of the :file:`account-server.conf` file.
 The default value is 30 days.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
@ -0,0 +1,81 @@
 ====================
 Cluster architecture
 ====================
 Access tier
 ~~~~~~~~~~~
 Large-scale deployments segment off an access tier, which is considered
 the Object Storage system's central hub. The access tier fields the
 incoming API requests from clients and moves data in and out of the
 system. This tier consists of front-end load balancers, ssl-terminators,
 and authentication services. It runs the (distributed) brain of the
 Object Storage system: the proxy server processes.
 **Object Storage architecture**
 |
 .. image:: figures/objectstorage-arch.png
 |
 Because access servers are collocated in their own tier, you can scale
 out read/write access regardless of the storage capacity. For example,
 if a cluster is on the public Internet, requires SSL termination, and
 has a high demand for data access, you can provision many access
 servers. However, if the cluster is on a private network and used
 primarily for archival purposes, you need fewer access servers.
 Since this is an HTTP addressable storage service, you may incorporate a
 load balancer into the access tier.
 Typically, the tier consists of a collection of 1U servers. These
 machines use a moderate amount of RAM and are network I/O intensive.
 Since these systems field each incoming API request, you should
 provision them with two high-throughput (10GbE) interfaces - one for the
 incoming "front-end" requests and the other for the "back-end" access to
 the object storage nodes to put and fetch data.
 Factors to consider
 -------------------
 For most publicly facing deployments as well as private deployments
 available across a wide-reaching corporate network, you use SSL to
 encrypt traffic to the client. SSL adds significant processing load to
 establish sessions between clients, which is why you have to provision
 more capacity in the access layer. SSL may not be required for private
 deployments on trusted networks.
 Storage nodes
 ~~~~~~~~~~~~~
 In most configurations, each of the five zones should have an equal
 amount of storage capacity. Storage nodes use a reasonable amount of
 memory and CPU. Metadata needs to be readily available to return objects
 quickly. The object stores run services not only to field incoming
 requests from the access tier, but to also run replicators, auditors,
 and reapers. You can provision object stores provisioned with single
 gigabit or 10 gigabit network interface depending on the expected
 workload and desired performance.
 **Object Storage (swift)**
 |
 .. image:: figures/objectstorage-nodes.png
 |
 Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
 price. You can use desktop-grade drives if you have responsive remote
 hands in the datacenter and enterprise-grade drives if you don't.
 Factors to consider
 -------------------
 You should keep in mind the desired I/O performance for single-threaded
 requests . This system does not use RAID, so a single disk handles each
 request for an object. Disk performance impacts single-threaded response
 rates.
 To achieve apparent higher throughput, the object storage system is
 designed to handle concurrent uploads/downloads. The network I/O
 capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
 concurrent throughput needs for reads and writes.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst
@ -0,0 +1,96 @@
 ===========
 Replication
 ===========
 Because each replica in Object Storage functions independently and
 clients generally require only a simple majority of nodes to respond to
 consider an operation successful, transient failures like network
 partitions can quickly cause replicas to diverge. These differences are
 eventually reconciled by asynchronous, peer-to-peer replicator
 processes. The replicator processes traverse their local file systems
 and concurrently perform operations in a manner that balances load
 across physical disks.
 Replication uses a push model, with records and files generally only
 being copied from local to remote replicas. This is important because
 data on the node might not belong there (as in the case of hand offs and
 ring changes), and a replicator cannot know which data it should pull in
 from elsewhere in the cluster. Any node that contains data must ensure
 that data gets to where it belongs. The ring handles replica placement.
 To replicate deletions in addition to creations, every deleted record or
 file in the system is marked by a tombstone. The replication process
 cleans up tombstones after a time period known as the *consistency
 window*. This window defines the duration of the replication and how
 long transient failure can remove a node from the cluster. Tombstone
 cleanup must be tied to replication to reach replica convergence.
 If a replicator detects that a remote drive has failed, the replicator
 uses the ``get_more_nodes`` interface for the ring to choose an
 alternate node with which to synchronize. The replicator can maintain
 desired levels of replication during disk failures, though some replicas
 might not be in an immediately usable location.
 .. note::
   The replicator does not maintain desired levels of replication when
   failures such as entire node failures occur; most failures are
   transient.
 The main replication types are:
 - Database replication
    Replicates containers and objects.
 - Object replication
    Replicates object data.
 Database replication
 ~~~~~~~~~~~~~~~~~~~~
 Database replication completes a low-cost hash comparison to determine
 whether two replicas already match. Normally, this check can quickly
 verify that most databases in the system are already synchronized. If
 the hashes differ, the replicator synchronizes the databases by sharing
 records added since the last synchronization point.
 This synchronization point is a high water mark that notes the last
 record at which two databases were known to be synchronized, and is
 stored in each database as a tuple of the remote database ID and record
 ID. Database IDs are unique across all replicas of the database, and
 record IDs are monotonically increasing integers. After all new records
 are pushed to the remote database, the entire synchronization table of
 the local database is pushed, so the remote database can guarantee that
 it is synchronized with everything with which the local database was
 previously synchronized.
 If a replica is missing, the whole local database file is transmitted to
 the peer by using rsync(1) and is assigned a new unique ID.
 In practice, database replication can process hundreds of databases per
 concurrency setting per second (up to the number of available CPUs or
 disks) and is bound by the number of database transactions that must be
 performed.
 Object replication
 ~~~~~~~~~~~~~~~~~~
 The initial implementation of object replication performed an rsync to
 push data from a local partition to all remote servers where it was
 expected to reside. While this worked at small scale, replication times
 skyrocketed once directory structures could no longer be held in RAM.
 This scheme was modified to save a hash of the contents for each suffix
 directory to a per-partition hashes file. The hash for a suffix
 directory is no longer valid when the contents of that suffix directory
 is modified.
 The object replication process reads in hash files and calculates any
 invalidated hashes. Then, it transmits the hashes to each remote server
 that should hold the partition, and only suffix directories with
 differing hashes on the remote server are rsynced. After pushing files
 to the remote server, the replication process notifies it to recalculate
 hashes for the rsynced suffix directories.
 The number of uncached directories that object replication must
 traverse, usually as a result of invalidated suffix directory hashes,
 impedes performance. To provide acceptable replication speeds, object
 replication is designed to invalidate around 2 percent of the hash space
 on a normal node each day.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
@ -0,0 +1,181 @@
 ============
 Ring-builder
 ============
 Use the swift-ring-builder utility to build and manage rings. This
 utility assigns partitions to devices and writes an optimized Python
 structure to a gzipped, serialized file on disk for transmission to the
 servers. The server processes occasionally check the modification time
 of the file and reload in-memory copies of the ring structure as needed.
 If you use a slightly older version of the ring, one of the three
 replicas for a partition subset will be incorrect because of the way the
 ring-builder manages changes to the ring. You can work around this
 issue.
 The ring-builder also keeps its own builder file with the ring
 information and additional data required to build future rings. It is
 very important to keep multiple backup copies of these builder files.
 One option is to copy the builder files out to every server while
 copying the ring files themselves. Another is to upload the builder
 files into the cluster itself. If you lose the builder file, you have to
 create a new ring from scratch. Nearly all partitions would be assigned
 to different devices and, therefore, nearly all of the stored data would
 have to be replicated to new locations. So, recovery from a builder file
 loss is possible, but data would be unreachable for an extended time.
 Ring data structure
 ~~~~~~~~~~~~~~~~~~~
 The ring data structure consists of three top level fields: a list of
 devices in the cluster, a list of lists of device ids indicating
 partition to device assignments, and an integer indicating the number of
 bits to shift an MD5 hash to calculate the partition for the hash.
 Partition assignment list
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 This is a list of ``array('H')`` of devices ids. The outermost list
 contains an ``array('H')`` for each replica. Each ``array('H')`` has a
 length equal to the partition count for the ring. Each integer in the
 ``array('H')`` is an index into the above list of devices. The partition
 list is known internally to the Ring class as ``_replica2part2dev_id``.
 So, to create a list of device dictionaries assigned to a partition, the
 Python code would look like::
  devices = [self.devs[part2dev_id[partition]] for
  part2dev_id in self._replica2part2dev_id]
 That code is a little simplistic because it does not account for the
 removal of duplicate devices. If a ring has more replicas than devices,
 a partition will have more than one replica on a device.
 ``array('H')`` is used for memory conservation as there may be millions
 of partitions.
 Replica counts
 ~~~~~~~~~~~~~~
 To support the gradual change in replica counts, a ring can have a real
 number of replicas and is not restricted to an integer number of
 replicas.
 A fractional replica count is for the whole ring and not for individual
 partitions. It indicates the average number of replicas for each
 partition. For example, a replica count of 3.2 means that 20 percent of
 partitions have four replicas and 80 percent have three replicas.
 The replica count is adjustable.
 Example::
  $ swift-ring-builder account.builder set_replicas 4
  $ swift-ring-builder account.builder rebalance
 You must rebalance the replica ring in globally distributed clusters.
 Operators of these clusters generally want an equal number of replicas
 and regions. Therefore, when an operator adds or removes a region, the
 operator adds or removes a replica. Removing unneeded replicas saves on
 the cost of disks.
 You can gradually increase the replica count at a rate that does not
 adversely affect cluster performance.
 For example::
  $ swift-ring-builder object.builder set_replicas 3.01
  $ swift-ring-builder object.builder rebalance
  <distribute rings and wait>...
  $ swift-ring-builder object.builder set_replicas 3.02
  $ swift-ring-builder object.builder rebalance
  <distribute rings and wait>...
 Changes take effect after the ring is rebalanced. Therefore, if you
 intend to change from 3 replicas to 3.01 but you accidentally type
 2.01, no data is lost.
 Additionally, the ``swift-ring-builder X.builder create`` command can now
 take a decimal argument for the number of replicas.
 Partition shift value
 ~~~~~~~~~~~~~~~~~~~~~
 The partition shift value is known internally to the Ring class as
 ``_part_shift``. This value is used to shift an MD5 hash to calculate
 the partition where the data for that hash should reside. Only the top
 four bytes of the hash is used in this process. For example, to compute
 the partition for the :file:`/account/container/object` path using Python::
  partition = unpack_from('>I',
  md5('/account/container/object').digest())[0] >>
  self._part_shift
 For a ring generated with part\_power P, the partition shift value is
 ``32 - P``.
 Build the ring
 ~~~~~~~~~~~~~~
 The ring builder process includes these high-level steps:
 #. The utility calculates the number of partitions to assign to each
   device based on the weight of the device. For example, for a
   partition at the power of 20, the ring has 1,048,576 partitions. One
   thousand devices of equal weight each want 1,048.576 partitions. The
   devices are sorted by the number of partitions they desire and kept
   in order throughout the initialization process.
   .. note::
      Each device is also assigned a random tiebreaker value that is
      used when two devices desire the same number of partitions. This
      tiebreaker is not stored on disk anywhere, and so two different
      rings created with the same parameters will have different
      partition assignments. For repeatable partition assignments,
      ``RingBuilder.rebalance()`` takes an optional seed value that
      seeds the Python pseudo-random number generator.
 #. The ring builder assigns each partition replica to the device that
   requires most partitions at that point while keeping it as far away
   as possible from other replicas. The ring builder prefers to assign a
   replica to a device in a region that does not already have a replica.
   If no such region is available, the ring builder searches for a
   device in a different zone, or on a different server. If it does not
   find one, it looks for a device with no replicas. Finally, if all
   options are exhausted, the ring builder assigns the replica to the
   device that has the fewest replicas already assigned.
   .. note::
      The ring builder assigns multiple replicas to one device only if
      the ring has fewer devices than it has replicas.
 #. When building a new ring from an old ring, the ring builder
   recalculates the desired number of partitions that each device wants.
 #. The ring builder unassigns partitions and gathers these partitions
   for reassignment, as follows:
   - The ring builder unassigns any assigned partitions from any
     removed devices and adds these partitions to the gathered list.
   - The ring builder unassigns any partition replicas that can be
     spread out for better durability and adds these partitions to the
     gathered list.
   - The ring builder unassigns random partitions from any devices that
     have more partitions than they need and adds these partitions to
     the gathered list.
 #. The ring builder reassigns the gathered partitions to devices by
   using a similar method to the one described previously.
 #. When the ring builder reassigns a replica to a partition, the ring
   builder records the time of the reassignment. The ring builder uses
   this value when it gathers partitions for reassignment so that no
   partition is moved twice in a configurable amount of time. The
   RingBuilder class knows this configurable amount of time as
   ``min_part_hours``. The ring builder ignores this restriction for
   replicas of partitions on removed devices because removal of a device
   happens on device failure only, and reassignment is the only choice.
 These steps do not always perfectly rebalance a ring due to the random
 nature of gathering partitions for reassignment. To help reach a more
 balanced ring, the rebalance process is repeated until near perfect
 (less than 1 percent off) or when the balance does not improve by at
 least 1 percent (indicating we probably cannot get perfect balance due
 to wildly imbalanced zones or too many partitions recently moved).
--- a/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst
@ -0,0 +1,31 @@
 =============================================================
 Configure tenant-specific image locations with Object Storage
 =============================================================
 For some deployers, it is not ideal to store all images in one place to
 enable all tenants and users to access them. You can configure the Image
 service to store image data in tenant-specific image locations. Then,
 only the following tenants can use the Image service to access the
 created image:
 - The tenant who owns the image
 - Tenants that are defined in ``swift_store_admin_tenants`` and that
  have admin-level accounts
 **To configure tenant-specific image locations**
 #. Configure swift as your ``default_store`` in the :file:`glance-api.conf` file.
 #. Set these configuration options in the :file:`glance-api.conf` file:
   - swift_store_multi_tenant
      Set to ``True`` to enable tenant-specific storage locations.
      Default is ``False``.
   - swift_store_admin_tenants
      Specify a list of tenant IDs that can grant read and write access to all
      Object Storage containers that are created by the Image service.
 With this configuration, images are stored in an Object Storage service
 (swift) endpoint that is pulled from the service catalog for the
 authenticated user.