Convert objectstorage files to RST

Cloud Admin Guide files converted: objectstorage-account-reaper.rst objectstorage-arch.rst objectstorage-replication.rst objectstorage-ringbuilder.rst objectstorage-tenant-specific-image-storage.rst Change-Id: I9d6416d4dfdf3bd71d59eef64b9ba3af07e15a8a Implements: blueprint reorganise-user-guides
2015-06-11 12:06:11 +10:00 · 2015-06-11 12:06:11 +10:00 · 45fb0a6a86
commit 45fb0a6a86
parent eb85580cc4
8 changed files with 443 additions and 7 deletions
--- a/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
+++ b/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
--- a/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
+++ b/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
--- a/doc/admin-guide-cloud-rst/source/objectstorage.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage.rst
@ -2,9 +2,6 @@
 Object Storage
 ==============

-Contents
-~~~~~~~~
-
 .. toctree::
   :maxdepth: 2

@ -12,13 +9,13 @@ Contents
   objectstorage_features.rst
   objectstorage_characteristics.rst
   objectstorage_components.rst
-   objectstorage-monitoring.rst
-   objectstorage-admin.rst
-
-.. TODO (karenb)
   objectstorage_ringbuilder.rst
   objectstorage_arch.rst
   objectstorage_replication.rst
   objectstorage_account_reaper.rst
   objectstorage_tenant_specific_image_storage.rst
+   objectstorage-monitoring.rst
+   objectstorage-admin.rst
+
+.. TODO (karenb)
   objectstorage_troubleshoot.rst
--- a/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst
@ -0,0 +1,50 @@
+==============
+Account reaper
+==============
+
+In the background, the account reaper removes data from the deleted
+accounts.
+
+A reseller marks an account for deletion by issuing a ``DELETE`` request
+on the account's storage URL. This action sets the ``status`` column of
+the account\_stat table in the account database and replicas to
+``DELETED``, marking the account's data for deletion.
+
+Typically, a specific retention time or undelete are not provided.
+However, you can set a ``delay_reaping`` value in the
+``[account-reaper]`` section of the :file:`account-server.conf` file to
+delay the actual deletion of data. At this time, to undelete you have to update
+the account database replicas directly, setting the status column to an
+empty string and updating the put\_timestamp to be greater than the
+delete\_timestamp.
+
+.. note::
+
+   It is on the development to-do list to write a utility that performs
+   this task, preferably through a REST call.
+
+The account reaper runs on each account server and scans the server
+occasionally for account databases marked for deletion. It only fires up
+on the accounts for which the server is the primary node, so that
+multiple account servers aren't trying to do it simultaneously. Using
+multiple servers to delete one account might improve the deletion speed
+but requires coordination to avoid duplication. Speed really is not a
+big concern with data deletion, and large accounts aren't deleted often.
+
+Deleting an account is simple. For each account container, all objects
+are deleted and then the container is deleted. Deletion requests that
+fail will not stop the overall process but will cause the overall
+process to fail eventually (for example, if an object delete times out,
+you will not be able to delete the container or the account). The
+account reaper keeps trying to delete an account until it is empty, at
+which point the database reclaim process within the db\_replicator will
+remove the database files.
+
+A persistent error state may prevent the deletion of an object or
+container. If this happens, you will see a message in the log, for example::
+
+  "Account <name> has not been reaped since <date>"
+
+You can control when this is logged with the ``reap_warn_after`` value in the
+``[account-reaper]`` section of the :file:`account-server.conf` file.
+The default value is 30 days.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
@ -0,0 +1,81 @@
+====================
+Cluster architecture
+====================
+
+Access tier
+~~~~~~~~~~~
+Large-scale deployments segment off an access tier, which is considered
+the Object Storage system's central hub. The access tier fields the
+incoming API requests from clients and moves data in and out of the
+system. This tier consists of front-end load balancers, ssl-terminators,
+and authentication services. It runs the (distributed) brain of the
+Object Storage system: the proxy server processes.
+
+**Object Storage architecture**
+
+|
+
+.. image:: figures/objectstorage-arch.png
+
+|
+
+Because access servers are collocated in their own tier, you can scale
+out read/write access regardless of the storage capacity. For example,
+if a cluster is on the public Internet, requires SSL termination, and
+has a high demand for data access, you can provision many access
+servers. However, if the cluster is on a private network and used
+primarily for archival purposes, you need fewer access servers.
+
+Since this is an HTTP addressable storage service, you may incorporate a
+load balancer into the access tier.
+
+Typically, the tier consists of a collection of 1U servers. These
+machines use a moderate amount of RAM and are network I/O intensive.
+Since these systems field each incoming API request, you should
+provision them with two high-throughput (10GbE) interfaces - one for the
+incoming "front-end" requests and the other for the "back-end" access to
+the object storage nodes to put and fetch data.
+
+Factors to consider
+-------------------
+For most publicly facing deployments as well as private deployments
+available across a wide-reaching corporate network, you use SSL to
+encrypt traffic to the client. SSL adds significant processing load to
+establish sessions between clients, which is why you have to provision
+more capacity in the access layer. SSL may not be required for private
+deployments on trusted networks.
+
+Storage nodes
+~~~~~~~~~~~~~
+In most configurations, each of the five zones should have an equal
+amount of storage capacity. Storage nodes use a reasonable amount of
+memory and CPU. Metadata needs to be readily available to return objects
+quickly. The object stores run services not only to field incoming
+requests from the access tier, but to also run replicators, auditors,
+and reapers. You can provision object stores provisioned with single
+gigabit or 10 gigabit network interface depending on the expected
+workload and desired performance.
+
+**Object Storage (swift)**
+
+|
+
+.. image:: figures/objectstorage-nodes.png
+
+|
+
+Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
+price. You can use desktop-grade drives if you have responsive remote
+hands in the datacenter and enterprise-grade drives if you don't.
+
+Factors to consider
+-------------------
+You should keep in mind the desired I/O performance for single-threaded
+requests . This system does not use RAID, so a single disk handles each
+request for an object. Disk performance impacts single-threaded response
+rates.
+
+To achieve apparent higher throughput, the object storage system is
+designed to handle concurrent uploads/downloads. The network I/O
+capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
+concurrent throughput needs for reads and writes.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst
@ -0,0 +1,96 @@
+===========
+Replication
+===========
+
+Because each replica in Object Storage functions independently and
+clients generally require only a simple majority of nodes to respond to
+consider an operation successful, transient failures like network
+partitions can quickly cause replicas to diverge. These differences are
+eventually reconciled by asynchronous, peer-to-peer replicator
+processes. The replicator processes traverse their local file systems
+and concurrently perform operations in a manner that balances load
+across physical disks.
+
+Replication uses a push model, with records and files generally only
+being copied from local to remote replicas. This is important because
+data on the node might not belong there (as in the case of hand offs and
+ring changes), and a replicator cannot know which data it should pull in
+from elsewhere in the cluster. Any node that contains data must ensure
+that data gets to where it belongs. The ring handles replica placement.
+
+To replicate deletions in addition to creations, every deleted record or
+file in the system is marked by a tombstone. The replication process
+cleans up tombstones after a time period known as the *consistency
+window*. This window defines the duration of the replication and how
+long transient failure can remove a node from the cluster. Tombstone
+cleanup must be tied to replication to reach replica convergence.
+
+If a replicator detects that a remote drive has failed, the replicator
+uses the ``get_more_nodes`` interface for the ring to choose an
+alternate node with which to synchronize. The replicator can maintain
+desired levels of replication during disk failures, though some replicas
+might not be in an immediately usable location.
+
+.. note::
+
+   The replicator does not maintain desired levels of replication when
+   failures such as entire node failures occur; most failures are
+   transient.
+
+The main replication types are:
+
+- Database replication
+    Replicates containers and objects.
+
+- Object replication
+    Replicates object data.
+
+Database replication
+~~~~~~~~~~~~~~~~~~~~
+Database replication completes a low-cost hash comparison to determine
+whether two replicas already match. Normally, this check can quickly
+verify that most databases in the system are already synchronized. If
+the hashes differ, the replicator synchronizes the databases by sharing
+records added since the last synchronization point.
+
+This synchronization point is a high water mark that notes the last
+record at which two databases were known to be synchronized, and is
+stored in each database as a tuple of the remote database ID and record
+ID. Database IDs are unique across all replicas of the database, and
+record IDs are monotonically increasing integers. After all new records
+are pushed to the remote database, the entire synchronization table of
+the local database is pushed, so the remote database can guarantee that
+it is synchronized with everything with which the local database was
+previously synchronized.
+
+If a replica is missing, the whole local database file is transmitted to
+the peer by using rsync(1) and is assigned a new unique ID.
+
+In practice, database replication can process hundreds of databases per
+concurrency setting per second (up to the number of available CPUs or
+disks) and is bound by the number of database transactions that must be
+performed.
+
+Object replication
+~~~~~~~~~~~~~~~~~~
+The initial implementation of object replication performed an rsync to
+push data from a local partition to all remote servers where it was
+expected to reside. While this worked at small scale, replication times
+skyrocketed once directory structures could no longer be held in RAM.
+This scheme was modified to save a hash of the contents for each suffix
+directory to a per-partition hashes file. The hash for a suffix
+directory is no longer valid when the contents of that suffix directory
+is modified.
+
+The object replication process reads in hash files and calculates any
+invalidated hashes. Then, it transmits the hashes to each remote server
+that should hold the partition, and only suffix directories with
+differing hashes on the remote server are rsynced. After pushing files
+to the remote server, the replication process notifies it to recalculate
+hashes for the rsynced suffix directories.
+
+The number of uncached directories that object replication must
+traverse, usually as a result of invalidated suffix directory hashes,
+impedes performance. To provide acceptable replication speeds, object
+replication is designed to invalidate around 2 percent of the hash space
+on a normal node each day.
--- a/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
@ -0,0 +1,181 @@
+============
+Ring-builder
+============
+
+Use the swift-ring-builder utility to build and manage rings. This
+utility assigns partitions to devices and writes an optimized Python
+structure to a gzipped, serialized file on disk for transmission to the
+servers. The server processes occasionally check the modification time
+of the file and reload in-memory copies of the ring structure as needed.
+If you use a slightly older version of the ring, one of the three
+replicas for a partition subset will be incorrect because of the way the
+ring-builder manages changes to the ring. You can work around this
+issue.
+
+The ring-builder also keeps its own builder file with the ring
+information and additional data required to build future rings. It is
+very important to keep multiple backup copies of these builder files.
+One option is to copy the builder files out to every server while
+copying the ring files themselves. Another is to upload the builder
+files into the cluster itself. If you lose the builder file, you have to
+create a new ring from scratch. Nearly all partitions would be assigned
+to different devices and, therefore, nearly all of the stored data would
+have to be replicated to new locations. So, recovery from a builder file
+loss is possible, but data would be unreachable for an extended time.
+
+Ring data structure
+~~~~~~~~~~~~~~~~~~~
+The ring data structure consists of three top level fields: a list of
+devices in the cluster, a list of lists of device ids indicating
+partition to device assignments, and an integer indicating the number of
+bits to shift an MD5 hash to calculate the partition for the hash.
+
+Partition assignment list
+~~~~~~~~~~~~~~~~~~~~~~~~~
+This is a list of ``array('H')`` of devices ids. The outermost list
+contains an ``array('H')`` for each replica. Each ``array('H')`` has a
+length equal to the partition count for the ring. Each integer in the
+``array('H')`` is an index into the above list of devices. The partition
+list is known internally to the Ring class as ``_replica2part2dev_id``.
+
+So, to create a list of device dictionaries assigned to a partition, the
+Python code would look like::
+
+  devices = [self.devs[part2dev_id[partition]] for
+  part2dev_id in self._replica2part2dev_id]
+
+That code is a little simplistic because it does not account for the
+removal of duplicate devices. If a ring has more replicas than devices,
+a partition will have more than one replica on a device.
+
+``array('H')`` is used for memory conservation as there may be millions
+of partitions.
+
+Replica counts
+~~~~~~~~~~~~~~
+To support the gradual change in replica counts, a ring can have a real
+number of replicas and is not restricted to an integer number of
+replicas.
+
+A fractional replica count is for the whole ring and not for individual
+partitions. It indicates the average number of replicas for each
+partition. For example, a replica count of 3.2 means that 20 percent of
+partitions have four replicas and 80 percent have three replicas.
+
+The replica count is adjustable.
+
+Example::
+
+  $ swift-ring-builder account.builder set_replicas 4
+  $ swift-ring-builder account.builder rebalance
+
+You must rebalance the replica ring in globally distributed clusters.
+Operators of these clusters generally want an equal number of replicas
+and regions. Therefore, when an operator adds or removes a region, the
+operator adds or removes a replica. Removing unneeded replicas saves on
+the cost of disks.
+
+You can gradually increase the replica count at a rate that does not
+adversely affect cluster performance.
+
+For example::
+
+  $ swift-ring-builder object.builder set_replicas 3.01
+  $ swift-ring-builder object.builder rebalance
+  <distribute rings and wait>...
+
+  $ swift-ring-builder object.builder set_replicas 3.02
+  $ swift-ring-builder object.builder rebalance
+  <distribute rings and wait>...
+
+Changes take effect after the ring is rebalanced. Therefore, if you
+intend to change from 3 replicas to 3.01 but you accidentally type
+2.01, no data is lost.
+
+Additionally, the ``swift-ring-builder X.builder create`` command can now
+take a decimal argument for the number of replicas.
+
+Partition shift value
+~~~~~~~~~~~~~~~~~~~~~
+The partition shift value is known internally to the Ring class as
+``_part_shift``. This value is used to shift an MD5 hash to calculate
+the partition where the data for that hash should reside. Only the top
+four bytes of the hash is used in this process. For example, to compute
+the partition for the :file:`/account/container/object` path using Python::
+
+  partition = unpack_from('>I',
+  md5('/account/container/object').digest())[0] >>
+  self._part_shift
+
+For a ring generated with part\_power P, the partition shift value is
+``32 - P``.
+
+Build the ring
+~~~~~~~~~~~~~~
+The ring builder process includes these high-level steps:
+
+#. The utility calculates the number of partitions to assign to each
+   device based on the weight of the device. For example, for a
+   partition at the power of 20, the ring has 1,048,576 partitions. One
+   thousand devices of equal weight each want 1,048.576 partitions. The
+   devices are sorted by the number of partitions they desire and kept
+   in order throughout the initialization process.
+
+   .. note::
+
+      Each device is also assigned a random tiebreaker value that is
+      used when two devices desire the same number of partitions. This
+      tiebreaker is not stored on disk anywhere, and so two different
+      rings created with the same parameters will have different
+      partition assignments. For repeatable partition assignments,
+      ``RingBuilder.rebalance()`` takes an optional seed value that
+      seeds the Python pseudo-random number generator.
+
+#. The ring builder assigns each partition replica to the device that
+   requires most partitions at that point while keeping it as far away
+   as possible from other replicas. The ring builder prefers to assign a
+   replica to a device in a region that does not already have a replica.
+   If no such region is available, the ring builder searches for a
+   device in a different zone, or on a different server. If it does not
+   find one, it looks for a device with no replicas. Finally, if all
+   options are exhausted, the ring builder assigns the replica to the
+   device that has the fewest replicas already assigned.
+
+   .. note::
+
+      The ring builder assigns multiple replicas to one device only if
+      the ring has fewer devices than it has replicas.
+
+#. When building a new ring from an old ring, the ring builder
+   recalculates the desired number of partitions that each device wants.
+
+#. The ring builder unassigns partitions and gathers these partitions
+   for reassignment, as follows:
+
+   - The ring builder unassigns any assigned partitions from any
+     removed devices and adds these partitions to the gathered list.
+   - The ring builder unassigns any partition replicas that can be
+     spread out for better durability and adds these partitions to the
+     gathered list.
+   - The ring builder unassigns random partitions from any devices that
+     have more partitions than they need and adds these partitions to
+     the gathered list.
+
+#. The ring builder reassigns the gathered partitions to devices by
+   using a similar method to the one described previously.
+
+#. When the ring builder reassigns a replica to a partition, the ring
+   builder records the time of the reassignment. The ring builder uses
+   this value when it gathers partitions for reassignment so that no
+   partition is moved twice in a configurable amount of time. The
+   RingBuilder class knows this configurable amount of time as
+   ``min_part_hours``. The ring builder ignores this restriction for
+   replicas of partitions on removed devices because removal of a device
+   happens on device failure only, and reassignment is the only choice.
+
+These steps do not always perfectly rebalance a ring due to the random
+nature of gathering partitions for reassignment. To help reach a more
+balanced ring, the rebalance process is repeated until near perfect
+(less than 1 percent off) or when the balance does not improve by at
+least 1 percent (indicating we probably cannot get perfect balance due
+to wildly imbalanced zones or too many partitions recently moved).
--- a/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst
+++ b/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst
@ -0,0 +1,31 @@
+=============================================================
+Configure tenant-specific image locations with Object Storage
+=============================================================
+
+For some deployers, it is not ideal to store all images in one place to
+enable all tenants and users to access them. You can configure the Image
+service to store image data in tenant-specific image locations. Then,
+only the following tenants can use the Image service to access the
+created image:
+
+- The tenant who owns the image
+- Tenants that are defined in ``swift_store_admin_tenants`` and that
+  have admin-level accounts
+
+**To configure tenant-specific image locations**
+
+#. Configure swift as your ``default_store`` in the :file:`glance-api.conf` file.
+
+#. Set these configuration options in the :file:`glance-api.conf` file:
+
+   - swift_store_multi_tenant
+      Set to ``True`` to enable tenant-specific storage locations.
+      Default is ``False``.
+
+   - swift_store_admin_tenants
+      Specify a list of tenant IDs that can grant read and write access to all
+      Object Storage containers that are created by the Image service.
+
+With this configuration, images are stored in an Object Storage service
+(swift) endpoint that is pulled from the service catalog for the
+authenticated user.