Convert objectstorage files to RST
Cloud Admin Guide files converted: objectstorage-account-reaper.rst objectstorage-arch.rst objectstorage-replication.rst objectstorage-ringbuilder.rst objectstorage-tenant-specific-image-storage.rst Change-Id: I9d6416d4dfdf3bd71d59eef64b9ba3af07e15a8a Implements: blueprint reorganise-user-guides
This commit is contained in:
parent
eb85580cc4
commit
45fb0a6a86
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
Normal file
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 56 KiB |
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
Normal file
BIN
doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 58 KiB |
@ -2,9 +2,6 @@
|
||||
Object Storage
|
||||
==============
|
||||
|
||||
Contents
|
||||
~~~~~~~~
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
@ -12,13 +9,13 @@ Contents
|
||||
objectstorage_features.rst
|
||||
objectstorage_characteristics.rst
|
||||
objectstorage_components.rst
|
||||
objectstorage-monitoring.rst
|
||||
objectstorage-admin.rst
|
||||
|
||||
.. TODO (karenb)
|
||||
objectstorage_ringbuilder.rst
|
||||
objectstorage_arch.rst
|
||||
objectstorage_replication.rst
|
||||
objectstorage_account_reaper.rst
|
||||
objectstorage_tenant_specific_image_storage.rst
|
||||
objectstorage-monitoring.rst
|
||||
objectstorage-admin.rst
|
||||
|
||||
.. TODO (karenb)
|
||||
objectstorage_troubleshoot.rst
|
||||
|
@ -0,0 +1,50 @@
|
||||
==============
|
||||
Account reaper
|
||||
==============
|
||||
|
||||
In the background, the account reaper removes data from the deleted
|
||||
accounts.
|
||||
|
||||
A reseller marks an account for deletion by issuing a ``DELETE`` request
|
||||
on the account's storage URL. This action sets the ``status`` column of
|
||||
the account\_stat table in the account database and replicas to
|
||||
``DELETED``, marking the account's data for deletion.
|
||||
|
||||
Typically, a specific retention time or undelete are not provided.
|
||||
However, you can set a ``delay_reaping`` value in the
|
||||
``[account-reaper]`` section of the :file:`account-server.conf` file to
|
||||
delay the actual deletion of data. At this time, to undelete you have to update
|
||||
the account database replicas directly, setting the status column to an
|
||||
empty string and updating the put\_timestamp to be greater than the
|
||||
delete\_timestamp.
|
||||
|
||||
.. note::
|
||||
|
||||
It is on the development to-do list to write a utility that performs
|
||||
this task, preferably through a REST call.
|
||||
|
||||
The account reaper runs on each account server and scans the server
|
||||
occasionally for account databases marked for deletion. It only fires up
|
||||
on the accounts for which the server is the primary node, so that
|
||||
multiple account servers aren't trying to do it simultaneously. Using
|
||||
multiple servers to delete one account might improve the deletion speed
|
||||
but requires coordination to avoid duplication. Speed really is not a
|
||||
big concern with data deletion, and large accounts aren't deleted often.
|
||||
|
||||
Deleting an account is simple. For each account container, all objects
|
||||
are deleted and then the container is deleted. Deletion requests that
|
||||
fail will not stop the overall process but will cause the overall
|
||||
process to fail eventually (for example, if an object delete times out,
|
||||
you will not be able to delete the container or the account). The
|
||||
account reaper keeps trying to delete an account until it is empty, at
|
||||
which point the database reclaim process within the db\_replicator will
|
||||
remove the database files.
|
||||
|
||||
A persistent error state may prevent the deletion of an object or
|
||||
container. If this happens, you will see a message in the log, for example::
|
||||
|
||||
"Account <name> has not been reaped since <date>"
|
||||
|
||||
You can control when this is logged with the ``reap_warn_after`` value in the
|
||||
``[account-reaper]`` section of the :file:`account-server.conf` file.
|
||||
The default value is 30 days.
|
81
doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
Normal file
81
doc/admin-guide-cloud-rst/source/objectstorage_arch.rst
Normal file
@ -0,0 +1,81 @@
|
||||
====================
|
||||
Cluster architecture
|
||||
====================
|
||||
|
||||
Access tier
|
||||
~~~~~~~~~~~
|
||||
Large-scale deployments segment off an access tier, which is considered
|
||||
the Object Storage system's central hub. The access tier fields the
|
||||
incoming API requests from clients and moves data in and out of the
|
||||
system. This tier consists of front-end load balancers, ssl-terminators,
|
||||
and authentication services. It runs the (distributed) brain of the
|
||||
Object Storage system: the proxy server processes.
|
||||
|
||||
**Object Storage architecture**
|
||||
|
||||
|
|
||||
|
||||
.. image:: figures/objectstorage-arch.png
|
||||
|
||||
|
|
||||
|
||||
Because access servers are collocated in their own tier, you can scale
|
||||
out read/write access regardless of the storage capacity. For example,
|
||||
if a cluster is on the public Internet, requires SSL termination, and
|
||||
has a high demand for data access, you can provision many access
|
||||
servers. However, if the cluster is on a private network and used
|
||||
primarily for archival purposes, you need fewer access servers.
|
||||
|
||||
Since this is an HTTP addressable storage service, you may incorporate a
|
||||
load balancer into the access tier.
|
||||
|
||||
Typically, the tier consists of a collection of 1U servers. These
|
||||
machines use a moderate amount of RAM and are network I/O intensive.
|
||||
Since these systems field each incoming API request, you should
|
||||
provision them with two high-throughput (10GbE) interfaces - one for the
|
||||
incoming "front-end" requests and the other for the "back-end" access to
|
||||
the object storage nodes to put and fetch data.
|
||||
|
||||
Factors to consider
|
||||
-------------------
|
||||
For most publicly facing deployments as well as private deployments
|
||||
available across a wide-reaching corporate network, you use SSL to
|
||||
encrypt traffic to the client. SSL adds significant processing load to
|
||||
establish sessions between clients, which is why you have to provision
|
||||
more capacity in the access layer. SSL may not be required for private
|
||||
deployments on trusted networks.
|
||||
|
||||
Storage nodes
|
||||
~~~~~~~~~~~~~
|
||||
In most configurations, each of the five zones should have an equal
|
||||
amount of storage capacity. Storage nodes use a reasonable amount of
|
||||
memory and CPU. Metadata needs to be readily available to return objects
|
||||
quickly. The object stores run services not only to field incoming
|
||||
requests from the access tier, but to also run replicators, auditors,
|
||||
and reapers. You can provision object stores provisioned with single
|
||||
gigabit or 10 gigabit network interface depending on the expected
|
||||
workload and desired performance.
|
||||
|
||||
**Object Storage (swift)**
|
||||
|
||||
|
|
||||
|
||||
.. image:: figures/objectstorage-nodes.png
|
||||
|
||||
|
|
||||
|
||||
Currently, a 2 TB or 3 TB SATA disk delivers good performance for the
|
||||
price. You can use desktop-grade drives if you have responsive remote
|
||||
hands in the datacenter and enterprise-grade drives if you don't.
|
||||
|
||||
Factors to consider
|
||||
-------------------
|
||||
You should keep in mind the desired I/O performance for single-threaded
|
||||
requests . This system does not use RAID, so a single disk handles each
|
||||
request for an object. Disk performance impacts single-threaded response
|
||||
rates.
|
||||
|
||||
To achieve apparent higher throughput, the object storage system is
|
||||
designed to handle concurrent uploads/downloads. The network I/O
|
||||
capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired
|
||||
concurrent throughput needs for reads and writes.
|
@ -0,0 +1,96 @@
|
||||
===========
|
||||
Replication
|
||||
===========
|
||||
|
||||
Because each replica in Object Storage functions independently and
|
||||
clients generally require only a simple majority of nodes to respond to
|
||||
consider an operation successful, transient failures like network
|
||||
partitions can quickly cause replicas to diverge. These differences are
|
||||
eventually reconciled by asynchronous, peer-to-peer replicator
|
||||
processes. The replicator processes traverse their local file systems
|
||||
and concurrently perform operations in a manner that balances load
|
||||
across physical disks.
|
||||
|
||||
Replication uses a push model, with records and files generally only
|
||||
being copied from local to remote replicas. This is important because
|
||||
data on the node might not belong there (as in the case of hand offs and
|
||||
ring changes), and a replicator cannot know which data it should pull in
|
||||
from elsewhere in the cluster. Any node that contains data must ensure
|
||||
that data gets to where it belongs. The ring handles replica placement.
|
||||
|
||||
To replicate deletions in addition to creations, every deleted record or
|
||||
file in the system is marked by a tombstone. The replication process
|
||||
cleans up tombstones after a time period known as the *consistency
|
||||
window*. This window defines the duration of the replication and how
|
||||
long transient failure can remove a node from the cluster. Tombstone
|
||||
cleanup must be tied to replication to reach replica convergence.
|
||||
|
||||
If a replicator detects that a remote drive has failed, the replicator
|
||||
uses the ``get_more_nodes`` interface for the ring to choose an
|
||||
alternate node with which to synchronize. The replicator can maintain
|
||||
desired levels of replication during disk failures, though some replicas
|
||||
might not be in an immediately usable location.
|
||||
|
||||
.. note::
|
||||
|
||||
The replicator does not maintain desired levels of replication when
|
||||
failures such as entire node failures occur; most failures are
|
||||
transient.
|
||||
|
||||
The main replication types are:
|
||||
|
||||
- Database replication
|
||||
Replicates containers and objects.
|
||||
|
||||
- Object replication
|
||||
Replicates object data.
|
||||
|
||||
Database replication
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
Database replication completes a low-cost hash comparison to determine
|
||||
whether two replicas already match. Normally, this check can quickly
|
||||
verify that most databases in the system are already synchronized. If
|
||||
the hashes differ, the replicator synchronizes the databases by sharing
|
||||
records added since the last synchronization point.
|
||||
|
||||
This synchronization point is a high water mark that notes the last
|
||||
record at which two databases were known to be synchronized, and is
|
||||
stored in each database as a tuple of the remote database ID and record
|
||||
ID. Database IDs are unique across all replicas of the database, and
|
||||
record IDs are monotonically increasing integers. After all new records
|
||||
are pushed to the remote database, the entire synchronization table of
|
||||
the local database is pushed, so the remote database can guarantee that
|
||||
it is synchronized with everything with which the local database was
|
||||
previously synchronized.
|
||||
|
||||
If a replica is missing, the whole local database file is transmitted to
|
||||
the peer by using rsync(1) and is assigned a new unique ID.
|
||||
|
||||
In practice, database replication can process hundreds of databases per
|
||||
concurrency setting per second (up to the number of available CPUs or
|
||||
disks) and is bound by the number of database transactions that must be
|
||||
performed.
|
||||
|
||||
Object replication
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
The initial implementation of object replication performed an rsync to
|
||||
push data from a local partition to all remote servers where it was
|
||||
expected to reside. While this worked at small scale, replication times
|
||||
skyrocketed once directory structures could no longer be held in RAM.
|
||||
This scheme was modified to save a hash of the contents for each suffix
|
||||
directory to a per-partition hashes file. The hash for a suffix
|
||||
directory is no longer valid when the contents of that suffix directory
|
||||
is modified.
|
||||
|
||||
The object replication process reads in hash files and calculates any
|
||||
invalidated hashes. Then, it transmits the hashes to each remote server
|
||||
that should hold the partition, and only suffix directories with
|
||||
differing hashes on the remote server are rsynced. After pushing files
|
||||
to the remote server, the replication process notifies it to recalculate
|
||||
hashes for the rsynced suffix directories.
|
||||
|
||||
The number of uncached directories that object replication must
|
||||
traverse, usually as a result of invalidated suffix directory hashes,
|
||||
impedes performance. To provide acceptable replication speeds, object
|
||||
replication is designed to invalidate around 2 percent of the hash space
|
||||
on a normal node each day.
|
181
doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
Normal file
181
doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst
Normal file
@ -0,0 +1,181 @@
|
||||
============
|
||||
Ring-builder
|
||||
============
|
||||
|
||||
Use the swift-ring-builder utility to build and manage rings. This
|
||||
utility assigns partitions to devices and writes an optimized Python
|
||||
structure to a gzipped, serialized file on disk for transmission to the
|
||||
servers. The server processes occasionally check the modification time
|
||||
of the file and reload in-memory copies of the ring structure as needed.
|
||||
If you use a slightly older version of the ring, one of the three
|
||||
replicas for a partition subset will be incorrect because of the way the
|
||||
ring-builder manages changes to the ring. You can work around this
|
||||
issue.
|
||||
|
||||
The ring-builder also keeps its own builder file with the ring
|
||||
information and additional data required to build future rings. It is
|
||||
very important to keep multiple backup copies of these builder files.
|
||||
One option is to copy the builder files out to every server while
|
||||
copying the ring files themselves. Another is to upload the builder
|
||||
files into the cluster itself. If you lose the builder file, you have to
|
||||
create a new ring from scratch. Nearly all partitions would be assigned
|
||||
to different devices and, therefore, nearly all of the stored data would
|
||||
have to be replicated to new locations. So, recovery from a builder file
|
||||
loss is possible, but data would be unreachable for an extended time.
|
||||
|
||||
Ring data structure
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
The ring data structure consists of three top level fields: a list of
|
||||
devices in the cluster, a list of lists of device ids indicating
|
||||
partition to device assignments, and an integer indicating the number of
|
||||
bits to shift an MD5 hash to calculate the partition for the hash.
|
||||
|
||||
Partition assignment list
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
This is a list of ``array('H')`` of devices ids. The outermost list
|
||||
contains an ``array('H')`` for each replica. Each ``array('H')`` has a
|
||||
length equal to the partition count for the ring. Each integer in the
|
||||
``array('H')`` is an index into the above list of devices. The partition
|
||||
list is known internally to the Ring class as ``_replica2part2dev_id``.
|
||||
|
||||
So, to create a list of device dictionaries assigned to a partition, the
|
||||
Python code would look like::
|
||||
|
||||
devices = [self.devs[part2dev_id[partition]] for
|
||||
part2dev_id in self._replica2part2dev_id]
|
||||
|
||||
That code is a little simplistic because it does not account for the
|
||||
removal of duplicate devices. If a ring has more replicas than devices,
|
||||
a partition will have more than one replica on a device.
|
||||
|
||||
``array('H')`` is used for memory conservation as there may be millions
|
||||
of partitions.
|
||||
|
||||
Replica counts
|
||||
~~~~~~~~~~~~~~
|
||||
To support the gradual change in replica counts, a ring can have a real
|
||||
number of replicas and is not restricted to an integer number of
|
||||
replicas.
|
||||
|
||||
A fractional replica count is for the whole ring and not for individual
|
||||
partitions. It indicates the average number of replicas for each
|
||||
partition. For example, a replica count of 3.2 means that 20 percent of
|
||||
partitions have four replicas and 80 percent have three replicas.
|
||||
|
||||
The replica count is adjustable.
|
||||
|
||||
Example::
|
||||
|
||||
$ swift-ring-builder account.builder set_replicas 4
|
||||
$ swift-ring-builder account.builder rebalance
|
||||
|
||||
You must rebalance the replica ring in globally distributed clusters.
|
||||
Operators of these clusters generally want an equal number of replicas
|
||||
and regions. Therefore, when an operator adds or removes a region, the
|
||||
operator adds or removes a replica. Removing unneeded replicas saves on
|
||||
the cost of disks.
|
||||
|
||||
You can gradually increase the replica count at a rate that does not
|
||||
adversely affect cluster performance.
|
||||
|
||||
For example::
|
||||
|
||||
$ swift-ring-builder object.builder set_replicas 3.01
|
||||
$ swift-ring-builder object.builder rebalance
|
||||
<distribute rings and wait>...
|
||||
|
||||
$ swift-ring-builder object.builder set_replicas 3.02
|
||||
$ swift-ring-builder object.builder rebalance
|
||||
<distribute rings and wait>...
|
||||
|
||||
Changes take effect after the ring is rebalanced. Therefore, if you
|
||||
intend to change from 3 replicas to 3.01 but you accidentally type
|
||||
2.01, no data is lost.
|
||||
|
||||
Additionally, the ``swift-ring-builder X.builder create`` command can now
|
||||
take a decimal argument for the number of replicas.
|
||||
|
||||
Partition shift value
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
The partition shift value is known internally to the Ring class as
|
||||
``_part_shift``. This value is used to shift an MD5 hash to calculate
|
||||
the partition where the data for that hash should reside. Only the top
|
||||
four bytes of the hash is used in this process. For example, to compute
|
||||
the partition for the :file:`/account/container/object` path using Python::
|
||||
|
||||
partition = unpack_from('>I',
|
||||
md5('/account/container/object').digest())[0] >>
|
||||
self._part_shift
|
||||
|
||||
For a ring generated with part\_power P, the partition shift value is
|
||||
``32 - P``.
|
||||
|
||||
Build the ring
|
||||
~~~~~~~~~~~~~~
|
||||
The ring builder process includes these high-level steps:
|
||||
|
||||
#. The utility calculates the number of partitions to assign to each
|
||||
device based on the weight of the device. For example, for a
|
||||
partition at the power of 20, the ring has 1,048,576 partitions. One
|
||||
thousand devices of equal weight each want 1,048.576 partitions. The
|
||||
devices are sorted by the number of partitions they desire and kept
|
||||
in order throughout the initialization process.
|
||||
|
||||
.. note::
|
||||
|
||||
Each device is also assigned a random tiebreaker value that is
|
||||
used when two devices desire the same number of partitions. This
|
||||
tiebreaker is not stored on disk anywhere, and so two different
|
||||
rings created with the same parameters will have different
|
||||
partition assignments. For repeatable partition assignments,
|
||||
``RingBuilder.rebalance()`` takes an optional seed value that
|
||||
seeds the Python pseudo-random number generator.
|
||||
|
||||
#. The ring builder assigns each partition replica to the device that
|
||||
requires most partitions at that point while keeping it as far away
|
||||
as possible from other replicas. The ring builder prefers to assign a
|
||||
replica to a device in a region that does not already have a replica.
|
||||
If no such region is available, the ring builder searches for a
|
||||
device in a different zone, or on a different server. If it does not
|
||||
find one, it looks for a device with no replicas. Finally, if all
|
||||
options are exhausted, the ring builder assigns the replica to the
|
||||
device that has the fewest replicas already assigned.
|
||||
|
||||
.. note::
|
||||
|
||||
The ring builder assigns multiple replicas to one device only if
|
||||
the ring has fewer devices than it has replicas.
|
||||
|
||||
#. When building a new ring from an old ring, the ring builder
|
||||
recalculates the desired number of partitions that each device wants.
|
||||
|
||||
#. The ring builder unassigns partitions and gathers these partitions
|
||||
for reassignment, as follows:
|
||||
|
||||
- The ring builder unassigns any assigned partitions from any
|
||||
removed devices and adds these partitions to the gathered list.
|
||||
- The ring builder unassigns any partition replicas that can be
|
||||
spread out for better durability and adds these partitions to the
|
||||
gathered list.
|
||||
- The ring builder unassigns random partitions from any devices that
|
||||
have more partitions than they need and adds these partitions to
|
||||
the gathered list.
|
||||
|
||||
#. The ring builder reassigns the gathered partitions to devices by
|
||||
using a similar method to the one described previously.
|
||||
|
||||
#. When the ring builder reassigns a replica to a partition, the ring
|
||||
builder records the time of the reassignment. The ring builder uses
|
||||
this value when it gathers partitions for reassignment so that no
|
||||
partition is moved twice in a configurable amount of time. The
|
||||
RingBuilder class knows this configurable amount of time as
|
||||
``min_part_hours``. The ring builder ignores this restriction for
|
||||
replicas of partitions on removed devices because removal of a device
|
||||
happens on device failure only, and reassignment is the only choice.
|
||||
|
||||
These steps do not always perfectly rebalance a ring due to the random
|
||||
nature of gathering partitions for reassignment. To help reach a more
|
||||
balanced ring, the rebalance process is repeated until near perfect
|
||||
(less than 1 percent off) or when the balance does not improve by at
|
||||
least 1 percent (indicating we probably cannot get perfect balance due
|
||||
to wildly imbalanced zones or too many partitions recently moved).
|
@ -0,0 +1,31 @@
|
||||
=============================================================
|
||||
Configure tenant-specific image locations with Object Storage
|
||||
=============================================================
|
||||
|
||||
For some deployers, it is not ideal to store all images in one place to
|
||||
enable all tenants and users to access them. You can configure the Image
|
||||
service to store image data in tenant-specific image locations. Then,
|
||||
only the following tenants can use the Image service to access the
|
||||
created image:
|
||||
|
||||
- The tenant who owns the image
|
||||
- Tenants that are defined in ``swift_store_admin_tenants`` and that
|
||||
have admin-level accounts
|
||||
|
||||
**To configure tenant-specific image locations**
|
||||
|
||||
#. Configure swift as your ``default_store`` in the :file:`glance-api.conf` file.
|
||||
|
||||
#. Set these configuration options in the :file:`glance-api.conf` file:
|
||||
|
||||
- swift_store_multi_tenant
|
||||
Set to ``True`` to enable tenant-specific storage locations.
|
||||
Default is ``False``.
|
||||
|
||||
- swift_store_admin_tenants
|
||||
Specify a list of tenant IDs that can grant read and write access to all
|
||||
Object Storage containers that are created by the Image service.
|
||||
|
||||
With this configuration, images are stored in an Object Storage service
|
||||
(swift) endpoint that is pulled from the service catalog for the
|
||||
authenticated user.
|
Loading…
Reference in New Issue
Block a user