Convert objectstorage files to RST

Cloud Admin Guide files converted:

objectstorage-account-reaper.rst
objectstorage-arch.rst
objectstorage-replication.rst
objectstorage-ringbuilder.rst
objectstorage-tenant-specific-image-storage.rst

Change-Id: I9d6416d4dfdf3bd71d59eef64b9ba3af07e15a8a
Implements: blueprint reorganise-user-guides
This commit is contained in:
Brian Moss 2015-06-11 12:06:11 +10:00
parent eb85580cc4
commit 45fb0a6a86
8 changed files with 443 additions and 7 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

View File

@ -2,9 +2,6 @@
Object Storage
==============
Contents
~~~~~~~~
.. toctree::
:maxdepth: 2
@ -12,13 +9,13 @@ Contents
objectstorage_features.rst
objectstorage_characteristics.rst
objectstorage_components.rst
objectstorage-monitoring.rst
objectstorage-admin.rst
.. TODO (karenb)
objectstorage_ringbuilder.rst
objectstorage_arch.rst
objectstorage_replication.rst
objectstorage_account_reaper.rst
objectstorage_tenant_specific_image_storage.rst
objectstorage-monitoring.rst
objectstorage-admin.rst
.. TODO (karenb)
objectstorage_troubleshoot.rst

View File

@ -0,0 +1,50 @@
==============
Account reaper
==============
In the background, the account reaper removes data from the deleted
accounts.
A reseller marks an account for deletion by issuing a ``DELETE`` request
on the account's storage URL. This action sets the ``status`` column of
the account\_stat table in the account database and replicas to
``DELETED``, marking the account's data for deletion.
Typically, a specific retention time or undelete are not provided.
However, you can set a ``delay_reaping`` value in the
``[account-reaper]`` section of the :file:`account-server.conf` file to
delay the actual deletion of data. At this time, to undelete you have to update
the account database replicas directly, setting the status column to an
empty string and updating the put\_timestamp to be greater than the
delete\_timestamp.
.. note::
It is on the development to-do list to write a utility that performs
this task, preferably through a REST call.
The account reaper runs on each account server and scans the server
occasionally for account databases marked for deletion. It only fires up
on the accounts for which the server is the primary node, so that
multiple account servers aren't trying to do it simultaneously. Using
multiple servers to delete one account might improve the deletion speed
but requires coordination to avoid duplication. Speed really is not a
big concern with data deletion, and large accounts aren't deleted often.
Deleting an account is simple. For each account container, all objects
are deleted and then the container is deleted. Deletion requests that
fail will not stop the overall process but will cause the overall
process to fail eventually (for example, if an object delete times out,
you will not be able to delete the container or the account). The
account reaper keeps trying to delete an account until it is empty, at
which point the database reclaim process within the db\_replicator will
remove the database files.
A persistent error state may prevent the deletion of an object or
container. If this happens, you will see a message in the log, for example::
"Account <name> has not been reaped since <date>"
You can control when this is logged with the ``reap_warn_after`` value in the
``[account-reaper]`` section of the :file:`account-server.conf` file.
The default value is 30 days.

View File

@ -0,0 +1,81 @@
====================
Cluster architecture
====================
Access tier
~~~~~~~~~~~
Large-scale deployments segment off an access tier, which is considered
the Object Storage system's central hub. The access tier fields the
incoming API requests from clients and moves data in and out of the
system. This tier consists of front-end load balancers, ssl-terminators,
and authentication services. It runs the (distributed) brain of the
Object Storage system: the proxy server processes.
**Object Storage architecture**
|
.. image:: figures/objectstorage-arch.png
|
Because access servers are collocated in their own tier, you can scale
out read/write access regardless of the storage capacity. For example,
if a cluster is on the public Internet, requires SSL termination, and
has a high demand for data access, you can provision many access
servers. However, if the cluster is on a private network and used
primarily for archival purposes, you need fewer access servers.
Since this is an HTTP addressable storage service, you may incorporate a
load balancer into the access tier.
Typically, the tier consists of a collection of 1U servers. These
machines use a moderate amount of RAM and are network I/O intensive.
Since these systems field each incoming API request, you should
provision them with two high-throughput (10GbE) interfaces - one for the
incoming "front-end" requests and the other for the "back-end" access to
the object storage nodes to put and fetch data.
Factors to consider
-------------------
For most publicly facing deployments as well as private deployments
available across a wide-reaching corporate network, you use SSL to
encrypt traffic to the client. SSL adds significant processing load to
establish sessions between clients, which is why you have to provision
more capacity in the access layer. SSL may not be required for private
deployments on trusted networks.
Storage nodes
~~~~~~~~~~~~~
In most configurations, each of the five zones should have an equal
amount of storage capacity. Storage nodes use a reasonable amount of
memory and CPU. Metadata needs to be readily available to return objects
quickly. The object stores run services not only to field incoming
requests from the access tier, but to also run replicators, auditors,
and reapers. You can provision object stores provisioned with single
gigabit or 10 gigabit network interface depending on the expected
workload and desired performance.
**Object Storage (swift)**
|
.. image:: figures/objectstorage-nodes.png
|
Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
price. You can use desktop-grade drives if you have responsive remote
hands in the datacenter and enterprise-grade drives if you don't.
Factors to consider
-------------------
You should keep in mind the desired I/O performance for single-threaded
requests . This system does not use RAID, so a single disk handles each
request for an object. Disk performance impacts single-threaded response
rates.
To achieve apparent higher throughput, the object storage system is
designed to handle concurrent uploads/downloads. The network I/O
capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
concurrent throughput needs for reads and writes.

View File

@ -0,0 +1,96 @@
===========
Replication
===========
Because each replica in Object Storage functions independently and
clients generally require only a simple majority of nodes to respond to
consider an operation successful, transient failures like network
partitions can quickly cause replicas to diverge. These differences are
eventually reconciled by asynchronous, peer-to-peer replicator
processes. The replicator processes traverse their local file systems
and concurrently perform operations in a manner that balances load
across physical disks.
Replication uses a push model, with records and files generally only
being copied from local to remote replicas. This is important because
data on the node might not belong there (as in the case of hand offs and
ring changes), and a replicator cannot know which data it should pull in
from elsewhere in the cluster. Any node that contains data must ensure
that data gets to where it belongs. The ring handles replica placement.
To replicate deletions in addition to creations, every deleted record or
file in the system is marked by a tombstone. The replication process
cleans up tombstones after a time period known as the *consistency
window*. This window defines the duration of the replication and how
long transient failure can remove a node from the cluster. Tombstone
cleanup must be tied to replication to reach replica convergence.
If a replicator detects that a remote drive has failed, the replicator
uses the ``get_more_nodes`` interface for the ring to choose an
alternate node with which to synchronize. The replicator can maintain
desired levels of replication during disk failures, though some replicas
might not be in an immediately usable location.
.. note::
The replicator does not maintain desired levels of replication when
failures such as entire node failures occur; most failures are
transient.
The main replication types are:
- Database replication
Replicates containers and objects.
- Object replication
Replicates object data.
Database replication
~~~~~~~~~~~~~~~~~~~~
Database replication completes a low-cost hash comparison to determine
whether two replicas already match. Normally, this check can quickly
verify that most databases in the system are already synchronized. If
the hashes differ, the replicator synchronizes the databases by sharing
records added since the last synchronization point.
This synchronization point is a high water mark that notes the last
record at which two databases were known to be synchronized, and is
stored in each database as a tuple of the remote database ID and record
ID. Database IDs are unique across all replicas of the database, and
record IDs are monotonically increasing integers. After all new records
are pushed to the remote database, the entire synchronization table of
the local database is pushed, so the remote database can guarantee that
it is synchronized with everything with which the local database was
previously synchronized.
If a replica is missing, the whole local database file is transmitted to
the peer by using rsync(1) and is assigned a new unique ID.
In practice, database replication can process hundreds of databases per
concurrency setting per second (up to the number of available CPUs or
disks) and is bound by the number of database transactions that must be
performed.
Object replication
~~~~~~~~~~~~~~~~~~
The initial implementation of object replication performed an rsync to
push data from a local partition to all remote servers where it was
expected to reside. While this worked at small scale, replication times
skyrocketed once directory structures could no longer be held in RAM.
This scheme was modified to save a hash of the contents for each suffix
directory to a per-partition hashes file. The hash for a suffix
directory is no longer valid when the contents of that suffix directory
is modified.
The object replication process reads in hash files and calculates any
invalidated hashes. Then, it transmits the hashes to each remote server
that should hold the partition, and only suffix directories with
differing hashes on the remote server are rsynced. After pushing files
to the remote server, the replication process notifies it to recalculate
hashes for the rsynced suffix directories.
The number of uncached directories that object replication must
traverse, usually as a result of invalidated suffix directory hashes,
impedes performance. To provide acceptable replication speeds, object
replication is designed to invalidate around 2 percent of the hash space
on a normal node each day.

View File

@ -0,0 +1,181 @@
============
Ring-builder
============
Use the swift-ring-builder utility to build and manage rings. This
utility assigns partitions to devices and writes an optimized Python
structure to a gzipped, serialized file on disk for transmission to the
servers. The server processes occasionally check the modification time
of the file and reload in-memory copies of the ring structure as needed.
If you use a slightly older version of the ring, one of the three
replicas for a partition subset will be incorrect because of the way the
ring-builder manages changes to the ring. You can work around this
issue.
The ring-builder also keeps its own builder file with the ring
information and additional data required to build future rings. It is
very important to keep multiple backup copies of these builder files.
One option is to copy the builder files out to every server while
copying the ring files themselves. Another is to upload the builder
files into the cluster itself. If you lose the builder file, you have to
create a new ring from scratch. Nearly all partitions would be assigned
to different devices and, therefore, nearly all of the stored data would
have to be replicated to new locations. So, recovery from a builder file
loss is possible, but data would be unreachable for an extended time.
Ring data structure
~~~~~~~~~~~~~~~~~~~
The ring data structure consists of three top level fields: a list of
devices in the cluster, a list of lists of device ids indicating
partition to device assignments, and an integer indicating the number of
bits to shift an MD5 hash to calculate the partition for the hash.
Partition assignment list
~~~~~~~~~~~~~~~~~~~~~~~~~
This is a list of ``array('H')`` of devices ids. The outermost list
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
length equal to the partition count for the ring. Each integer in the
``array('H')`` is an index into the above list of devices. The partition
list is known internally to the Ring class as ``_replica2part2dev_id``.
So, to create a list of device dictionaries assigned to a partition, the
Python code would look like::
devices = [self.devs[part2dev_id[partition]] for
part2dev_id in self._replica2part2dev_id]
That code is a little simplistic because it does not account for the
removal of duplicate devices. If a ring has more replicas than devices,
a partition will have more than one replica on a device.
``array('H')`` is used for memory conservation as there may be millions
of partitions.
Replica counts
~~~~~~~~~~~~~~
To support the gradual change in replica counts, a ring can have a real
number of replicas and is not restricted to an integer number of
replicas.
A fractional replica count is for the whole ring and not for individual
partitions. It indicates the average number of replicas for each
partition. For example, a replica count of 3.2 means that 20 percent of
partitions have four replicas and 80 percent have three replicas.
The replica count is adjustable.
Example::
$ swift-ring-builder account.builder set_replicas 4
$ swift-ring-builder account.builder rebalance
You must rebalance the replica ring in globally distributed clusters.
Operators of these clusters generally want an equal number of replicas
and regions. Therefore, when an operator adds or removes a region, the
operator adds or removes a replica. Removing unneeded replicas saves on
the cost of disks.
You can gradually increase the replica count at a rate that does not
adversely affect cluster performance.
For example::
$ swift-ring-builder object.builder set_replicas 3.01
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
$ swift-ring-builder object.builder set_replicas 3.02
$ swift-ring-builder object.builder rebalance
<distribute rings and wait>...
Changes take effect after the ring is rebalanced. Therefore, if you
intend to change from 3 replicas to 3.01 but you accidentally type
2.01, no data is lost.
Additionally, the ``swift-ring-builder X.builder create`` command can now
take a decimal argument for the number of replicas.
Partition shift value
~~~~~~~~~~~~~~~~~~~~~
The partition shift value is known internally to the Ring class as
``_part_shift``. This value is used to shift an MD5 hash to calculate
the partition where the data for that hash should reside. Only the top
four bytes of the hash is used in this process. For example, to compute
the partition for the :file:`/account/container/object` path using Python::
partition = unpack_from('>I',
md5('/account/container/object').digest())[0] >>
self._part_shift
For a ring generated with part\_power P, the partition shift value is
``32 - P``.
Build the ring
~~~~~~~~~~~~~~
The ring builder process includes these high-level steps:
#. The utility calculates the number of partitions to assign to each
device based on the weight of the device. For example, for a
partition at the power of 20, the ring has 1,048,576 partitions. One
thousand devices of equal weight each want 1,048.576 partitions. The
devices are sorted by the number of partitions they desire and kept
in order throughout the initialization process.
.. note::
Each device is also assigned a random tiebreaker value that is
used when two devices desire the same number of partitions. This
tiebreaker is not stored on disk anywhere, and so two different
rings created with the same parameters will have different
partition assignments. For repeatable partition assignments,
``RingBuilder.rebalance()`` takes an optional seed value that
seeds the Python pseudo-random number generator.
#. The ring builder assigns each partition replica to the device that
requires most partitions at that point while keeping it as far away
as possible from other replicas. The ring builder prefers to assign a
replica to a device in a region that does not already have a replica.
If no such region is available, the ring builder searches for a
device in a different zone, or on a different server. If it does not
find one, it looks for a device with no replicas. Finally, if all
options are exhausted, the ring builder assigns the replica to the
device that has the fewest replicas already assigned.
.. note::
The ring builder assigns multiple replicas to one device only if
the ring has fewer devices than it has replicas.
#. When building a new ring from an old ring, the ring builder
recalculates the desired number of partitions that each device wants.
#. The ring builder unassigns partitions and gathers these partitions
for reassignment, as follows:
- The ring builder unassigns any assigned partitions from any
removed devices and adds these partitions to the gathered list.
- The ring builder unassigns any partition replicas that can be
spread out for better durability and adds these partitions to the
gathered list.
- The ring builder unassigns random partitions from any devices that
have more partitions than they need and adds these partitions to
the gathered list.
#. The ring builder reassigns the gathered partitions to devices by
using a similar method to the one described previously.
#. When the ring builder reassigns a replica to a partition, the ring
builder records the time of the reassignment. The ring builder uses
this value when it gathers partitions for reassignment so that no
partition is moved twice in a configurable amount of time. The
RingBuilder class knows this configurable amount of time as
``min_part_hours``. The ring builder ignores this restriction for
replicas of partitions on removed devices because removal of a device
happens on device failure only, and reassignment is the only choice.
These steps do not always perfectly rebalance a ring due to the random
nature of gathering partitions for reassignment. To help reach a more
balanced ring, the rebalance process is repeated until near perfect
(less than 1 percent off) or when the balance does not improve by at
least 1 percent (indicating we probably cannot get perfect balance due
to wildly imbalanced zones or too many partitions recently moved).

View File

@ -0,0 +1,31 @@
=============================================================
Configure tenant-specific image locations with Object Storage
=============================================================
For some deployers, it is not ideal to store all images in one place to
enable all tenants and users to access them. You can configure the Image
service to store image data in tenant-specific image locations. Then,
only the following tenants can use the Image service to access the
created image:
- The tenant who owns the image
- Tenants that are defined in ``swift_store_admin_tenants`` and that
have admin-level accounts
**To configure tenant-specific image locations**
#. Configure swift as your ``default_store`` in the :file:`glance-api.conf` file.
#. Set these configuration options in the :file:`glance-api.conf` file:
- swift_store_multi_tenant
Set to ``True`` to enable tenant-specific storage locations.
Default is ``False``.
- swift_store_admin_tenants
Specify a list of tenant IDs that can grant read and write access to all
Object Storage containers that are created by the Image service.
With this configuration, images are stored in an Object Storage service
(swift) endpoint that is pulled from the service catalog for the
authenticated user.