8e651a2d3d
The object server can be configured to leave a certain amount of disk space free; default is 1%. This is useful in avoiding 100%-full filesystems, as those can get Swift in a state where the filesystem is too full to write tombstones, so you can't delete objects to free up space. When a cluster has accounts/containers and objects on the same disks, then you can wind up with a 100%-full disk since account and container servers don't respect fallocate_reserve. This commit makes account and container servers respect fallocate_reserve so that disks shared between account/container and object rings won't get 100% full. When a disk's free space falls below the configured reserve, account and container PUT, POST, and REPLICATE requests will fail with a 507 status code. These are the operations that can significantly increase the disk space used by a given database. I called the parameter "fallocate_reserve" for consistency with the object server. No actual fallocate() call happens under Swift's control in the account or container servers (sqlite3 might make such a call, but it's out of our hands). Change-Id: I083442eef14bf83c0ea717b1decb3e6b56dbf1d0
1515 lines
79 KiB
ReStructuredText
1515 lines
79 KiB
ReStructuredText
=====================
|
||
Administrator's Guide
|
||
=====================
|
||
|
||
-------------------------
|
||
Defining Storage Policies
|
||
-------------------------
|
||
|
||
Defining your Storage Policies is very easy to do with Swift. It is important
|
||
that the administrator understand the concepts behind Storage Policies
|
||
before actually creating and using them in order to get the most benefit out
|
||
of the feature and, more importantly, to avoid having to make unnecessary changes
|
||
once a set of policies have been deployed to a cluster.
|
||
|
||
It is highly recommended that the reader fully read and comprehend
|
||
:doc:`overview_policies` before proceeding with administration of
|
||
policies. Plan carefully and it is suggested that experimentation be
|
||
done first on a non-production cluster to be certain that the desired
|
||
configuration meets the needs of the users. See :ref:`upgrade-policy`
|
||
before planning the upgrade of your existing deployment.
|
||
|
||
Following is a high level view of the very few steps it takes to configure
|
||
policies once you have decided what you want to do:
|
||
|
||
#. Define your policies in ``/etc/swift/swift.conf``
|
||
#. Create the corresponding object rings
|
||
#. Communicate the names of the Storage Policies to cluster users
|
||
|
||
For a specific example that takes you through these steps, please see
|
||
:doc:`policies_saio`
|
||
|
||
------------------
|
||
Managing the Rings
|
||
------------------
|
||
|
||
You may build the storage rings on any server with the appropriate
|
||
version of Swift installed. Once built or changed (rebalanced), you
|
||
must distribute the rings to all the servers in the cluster. Storage
|
||
rings contain information about all the Swift storage partitions and
|
||
how they are distributed between the different nodes and disks.
|
||
|
||
Swift 1.6.0 is the last version to use a Python pickle format.
|
||
Subsequent versions use a different serialization format. **Rings
|
||
generated by Swift versions 1.6.0 and earlier may be read by any
|
||
version, but rings generated after 1.6.0 may only be read by Swift
|
||
versions greater than 1.6.0.** So when upgrading from version 1.6.0 or
|
||
earlier to a version greater than 1.6.0, either upgrade Swift on your
|
||
ring building server **last** after all Swift nodes have been successfully
|
||
upgraded, or refrain from generating rings until all Swift nodes have
|
||
been successfully upgraded.
|
||
|
||
If you need to downgrade from a version of Swift greater than 1.6.0 to
|
||
a version less than or equal to 1.6.0, first downgrade your ring-building
|
||
server, generate new rings, push them out, then continue with the rest
|
||
of the downgrade.
|
||
|
||
For more information see :doc:`overview_ring`.
|
||
|
||
.. highlight:: none
|
||
|
||
Removing a device from the ring::
|
||
|
||
swift-ring-builder <builder-file> remove <ip_address>/<device_name>
|
||
|
||
Removing a server from the ring::
|
||
|
||
swift-ring-builder <builder-file> remove <ip_address>
|
||
|
||
Adding devices to the ring:
|
||
|
||
See :ref:`ring-preparing`
|
||
|
||
See what devices for a server are in the ring::
|
||
|
||
swift-ring-builder <builder-file> search <ip_address>
|
||
|
||
Once you are done with all changes to the ring, the changes need to be
|
||
"committed"::
|
||
|
||
swift-ring-builder <builder-file> rebalance
|
||
|
||
Once the new rings are built, they should be pushed out to all the servers
|
||
in the cluster.
|
||
|
||
Optionally, if invoked as 'swift-ring-builder-safe' the directory containing
|
||
the specified builder file will be locked (via a .lock file in the parent
|
||
directory). This provides a basic safe guard against multiple instances
|
||
of the swift-ring-builder (or other utilities that observe this lock) from
|
||
attempting to write to or read the builder/ring files while operations are in
|
||
progress. This can be useful in environments where ring management has been
|
||
automated but the operator still needs to interact with the rings manually.
|
||
|
||
If the ring builder is not producing the balances that you are
|
||
expecting, you can gain visibility into what it's doing with the
|
||
``--debug`` flag.::
|
||
|
||
swift-ring-builder <builder-file> rebalance --debug
|
||
|
||
This produces a great deal of output that is mostly useful if you are
|
||
either (a) attempting to fix the ring builder, or (b) filing a bug
|
||
against the ring builder.
|
||
|
||
You may notice in the rebalance output a 'dispersion' number. What this
|
||
number means is explained in :ref:`ring_dispersion` but in essence
|
||
is the percentage of partitions in the ring that have too many replicas
|
||
within a particular failure domain. You can ask 'swift-ring-builder' what
|
||
the dispersion is with::
|
||
|
||
swift-ring-builder <builder-file> dispersion
|
||
|
||
This will give you the percentage again, if you want a detailed view of
|
||
the dispersion simply add a ``--verbose``::
|
||
|
||
swift-ring-builder <builder-file> dispersion --verbose
|
||
|
||
This will not only display the percentage but will also display a dispersion
|
||
table that lists partition dispersion by tier. You can use this table to figure
|
||
out were you need to add capacity or to help tune an :ref:`ring_overload` value.
|
||
|
||
Now let's take an example with 1 region, 3 zones and 4 devices. Each device has
|
||
the same weight, and the ``dispersion --verbose`` might show the following::
|
||
|
||
Dispersion is 16.666667, Balance is 0.000000, Overload is 0.00%
|
||
Required overload is 33.333333%
|
||
Worst tier is 33.333333 (r1z3)
|
||
--------------------------------------------------------------------------
|
||
Tier Parts % Max 0 1 2 3
|
||
--------------------------------------------------------------------------
|
||
r1 768 0.00 3 0 0 0 256
|
||
r1z1 192 0.00 1 64 192 0 0
|
||
r1z1-127.0.0.1 192 0.00 1 64 192 0 0
|
||
r1z1-127.0.0.1/sda 192 0.00 1 64 192 0 0
|
||
r1z2 192 0.00 1 64 192 0 0
|
||
r1z2-127.0.0.2 192 0.00 1 64 192 0 0
|
||
r1z2-127.0.0.2/sda 192 0.00 1 64 192 0 0
|
||
r1z3 384 33.33 1 0 128 128 0
|
||
r1z3-127.0.0.3 384 33.33 1 0 128 128 0
|
||
r1z3-127.0.0.3/sda 192 0.00 1 64 192 0 0
|
||
r1z3-127.0.0.3/sdb 192 0.00 1 64 192 0 0
|
||
|
||
The first line reports that there are 256 partitions with 3 copies in region 1;
|
||
and this is an expected output in this case (single region with 3 replicas) as
|
||
reported by the "Max" value.
|
||
|
||
However, there is some imbalance in the cluster, more precisely in zone 3. The
|
||
"Max" reports a maximum of 1 copy in this zone; however 50.00% of the partitions
|
||
are storing 2 replicas in this zone (which is somewhat expected, because there
|
||
are more disks in this zone).
|
||
|
||
You can now either add more capacity to the other zones, decrease the total
|
||
weight in zone 3 or set the overload to a value `greater than` 33.333333% -
|
||
only as much overload as needed will be used.
|
||
|
||
-----------------------
|
||
Scripting Ring Creation
|
||
-----------------------
|
||
You can create scripts to create the account and container rings and rebalance. Here's an example script for the Account ring. Use similar commands to create a make-container-ring.sh script on the proxy server node.
|
||
|
||
1. Create a script file called make-account-ring.sh on the proxy
|
||
server node with the following content::
|
||
|
||
#!/bin/bash
|
||
cd /etc/swift
|
||
rm -f account.builder account.ring.gz backups/account.builder backups/account.ring.gz
|
||
swift-ring-builder account.builder create 18 3 1
|
||
swift-ring-builder account.builder add r1z1-<account-server-1>:6202/sdb1 1
|
||
swift-ring-builder account.builder add r1z2-<account-server-2>:6202/sdb1 1
|
||
swift-ring-builder account.builder rebalance
|
||
|
||
You need to replace the values of <account-server-1>,
|
||
<account-server-2>, etc. with the IP addresses of the account
|
||
servers used in your setup. You can have as many account servers as
|
||
you need. All account servers are assumed to be listening on port
|
||
6202, and have a storage device called "sdb1" (this is a directory
|
||
name created under /drives when we setup the account server). The
|
||
"z1", "z2", etc. designate zones, and you can choose whether you
|
||
put devices in the same or different zones. The "r1" designates
|
||
the region, with different regions specified as "r1", "r2", etc.
|
||
|
||
2. Make the script file executable and run it to create the account ring file::
|
||
|
||
chmod +x make-account-ring.sh
|
||
sudo ./make-account-ring.sh
|
||
|
||
3. Copy the resulting ring file /etc/swift/account.ring.gz to all the
|
||
account server nodes in your Swift environment, and put them in the
|
||
/etc/swift directory on these nodes. Make sure that every time you
|
||
change the account ring configuration, you copy the resulting ring
|
||
file to all the account nodes.
|
||
|
||
-----------------------
|
||
Handling System Updates
|
||
-----------------------
|
||
|
||
It is recommended that system updates and reboots are done a zone at a time.
|
||
This allows the update to happen, and for the Swift cluster to stay available
|
||
and responsive to requests. It is also advisable when updating a zone, let
|
||
it run for a while before updating the other zones to make sure the update
|
||
doesn't have any adverse effects.
|
||
|
||
----------------------
|
||
Handling Drive Failure
|
||
----------------------
|
||
|
||
In the event that a drive has failed, the first step is to make sure the drive
|
||
is unmounted. This will make it easier for Swift to work around the failure
|
||
until it has been resolved. If the drive is going to be replaced immediately,
|
||
then it is just best to replace the drive, format it, remount it, and let
|
||
replication fill it up.
|
||
|
||
After the drive is unmounted, make sure the mount point is owned by root
|
||
(root:root 755). This ensures that rsync will not try to replicate into the
|
||
root drive once the failed drive is unmounted.
|
||
|
||
If the drive can't be replaced immediately, then it is best to leave it
|
||
unmounted, and set the device weight to 0. This will allow all the
|
||
replicas that were on that drive to be replicated elsewhere until the drive
|
||
is replaced. Once the drive is replaced, the device weight can be increased
|
||
again. Setting the device weight to 0 instead of removing the drive from the
|
||
ring gives Swift the chance to replicate data from the failing disk too (in case
|
||
it is still possible to read some of the data).
|
||
|
||
Setting the device weight to 0 (or removing a failed drive from the ring) has
|
||
another benefit: all partitions that were stored on the failed drive are
|
||
distributed over the remaining disks in the cluster, and each disk only needs to
|
||
store a few new partitions. This is much faster compared to replicating all
|
||
partitions to a single, new disk. It decreases the time to recover from a
|
||
degraded number of replicas significantly, and becomes more and more important
|
||
with bigger disks.
|
||
|
||
-----------------------
|
||
Handling Server Failure
|
||
-----------------------
|
||
|
||
If a server is having hardware issues, it is a good idea to make sure the
|
||
Swift services are not running. This will allow Swift to work around the
|
||
failure while you troubleshoot.
|
||
|
||
If the server just needs a reboot, or a small amount of work that should
|
||
only last a couple of hours, then it is probably best to let Swift work
|
||
around the failure and get the machine fixed and back online. When the
|
||
machine comes back online, replication will make sure that anything that is
|
||
missing during the downtime will get updated.
|
||
|
||
If the server has more serious issues, then it is probably best to remove
|
||
all of the server's devices from the ring. Once the server has been repaired
|
||
and is back online, the server's devices can be added back into the ring.
|
||
It is important that the devices are reformatted before putting them back
|
||
into the ring as it is likely to be responsible for a different set of
|
||
partitions than before.
|
||
|
||
-----------------------
|
||
Detecting Failed Drives
|
||
-----------------------
|
||
|
||
It has been our experience that when a drive is about to fail, error messages
|
||
will spew into `/var/log/kern.log`. There is a script called
|
||
`swift-drive-audit` that can be run via cron to watch for bad drives. If
|
||
errors are detected, it will unmount the bad drive, so that Swift can
|
||
work around it. The script takes a configuration file with the following
|
||
settings:
|
||
|
||
``[drive-audit]``
|
||
|
||
================== ============== ===========================================
|
||
Option Default Description
|
||
------------------ -------------- -------------------------------------------
|
||
user swift Drop privileges to this user for non-root
|
||
tasks
|
||
log_facility LOG_LOCAL0 Syslog log facility
|
||
log_level INFO Log level
|
||
device_dir /srv/node Directory devices are mounted under
|
||
minutes 60 Number of minutes to look back in
|
||
`/var/log/kern.log`
|
||
error_limit 1 Number of errors to find before a device
|
||
is unmounted
|
||
log_file_pattern /var/log/kern* Location of the log file with globbing
|
||
pattern to check against device errors
|
||
regex_pattern_X (see below) Regular expression patterns to be used to
|
||
locate device blocks with errors in the
|
||
log file
|
||
================== ============== ===========================================
|
||
|
||
The default regex pattern used to locate device blocks with errors are
|
||
`\berror\b.*\b(sd[a-z]{1,2}\d?)\b` and `\b(sd[a-z]{1,2}\d?)\b.*\berror\b`.
|
||
One is able to overwrite the default above by providing new expressions
|
||
using the format `regex_pattern_X = regex_expression`, where `X` is a number.
|
||
|
||
This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are
|
||
using a different distro or OS, some care should be taken before using in production.
|
||
|
||
------------------------------
|
||
Preventing Disk Full Scenarios
|
||
------------------------------
|
||
|
||
.. highlight:: cfg
|
||
|
||
Prevent disk full scenarios by ensuring that the ``proxy-server`` blocks PUT
|
||
requests and rsync prevents replication to the specific drives.
|
||
|
||
You can prevent `proxy-server` PUT requests to low space disks by
|
||
ensuring ``fallocate_reserve`` is set in ``account-server.conf``,
|
||
``container-server.conf``, and ``object-server.conf``. By default,
|
||
``fallocate_reserve`` is set to 1%. In the object server, this blocks
|
||
PUT requests that would leave the free disk space below 1% of the
|
||
disk. In the account and container servers, this blocks operations
|
||
that will increase account or container database size once the free
|
||
disk space falls below 1%.
|
||
|
||
Setting ``fallocate_reserve`` is highly recommended to avoid filling
|
||
disks to 100%. When Swift's disks are completely full, all requests
|
||
involving those disks will fail, including DELETE requests that would
|
||
otherwise free up space. This is because object deletion includes the
|
||
creation of a zero-byte tombstone (.ts) to record the time of the
|
||
deletion for replication purposes; this happens prior to deletion of
|
||
the object's data. On a completely-full filesystem, that zero-byte .ts
|
||
file cannot be created, so the DELETE request will fail and the disk
|
||
will remain completely full. If ``fallocate_reserve`` is set, then the
|
||
filesystem will have enough space to create the zero-byte .ts file,
|
||
and thus the deletion of the object will succeed and free up some
|
||
space.
|
||
|
||
In order to prevent rsync replication to specific drives, firstly
|
||
setup ``rsync_module`` per disk in your ``object-replicator``.
|
||
Set this in ``object-server.conf``:
|
||
|
||
.. code::
|
||
|
||
[object-replicator]
|
||
rsync_module = {replication_ip}::object_{device}
|
||
|
||
Set the individual drives in ``rsync.conf``. For example:
|
||
|
||
.. code::
|
||
|
||
[object_sda]
|
||
max connections = 4
|
||
lock file = /var/lock/object_sda.lock
|
||
|
||
[object_sdb]
|
||
max connections = 4
|
||
lock file = /var/lock/object_sdb.lock
|
||
|
||
Finally, monitor the disk space of each disk and adjust the rsync
|
||
``max connections`` per drive to ``-1``. We recommend utilising your existing
|
||
monitoring solution to achieve this. The following is an example script:
|
||
|
||
.. code-block:: python
|
||
|
||
#!/usr/bin/env python
|
||
import os
|
||
import errno
|
||
|
||
RESERVE = 500 * 2 ** 20 # 500 MiB
|
||
|
||
DEVICES = '/srv/node1'
|
||
|
||
path_template = '/etc/rsync.d/disable_%s.conf'
|
||
config_template = '''
|
||
[object_%s]
|
||
max connections = -1
|
||
'''
|
||
|
||
def disable_rsync(device):
|
||
with open(path_template % device, 'w') as f:
|
||
f.write(config_template.lstrip() % device)
|
||
|
||
|
||
def enable_rsync(device):
|
||
try:
|
||
os.unlink(path_template % device)
|
||
except OSError as e:
|
||
# ignore file does not exist
|
||
if e.errno != errno.ENOENT:
|
||
raise
|
||
|
||
|
||
for device in os.listdir(DEVICES):
|
||
path = os.path.join(DEVICES, device)
|
||
st = os.statvfs(path)
|
||
free = st.f_bavail * st.f_frsize
|
||
if free < RESERVE:
|
||
disable_rsync(device)
|
||
else:
|
||
enable_rsync(device)
|
||
|
||
For the above script to work, ensure ``/etc/rsync.d/`` conf files are
|
||
included, by specifying ``&include`` in your ``rsync.conf`` file:
|
||
|
||
.. code::
|
||
|
||
&include /etc/rsync.d
|
||
|
||
Use this in conjunction with a cron job to periodically run the script, for example:
|
||
|
||
.. highlight:: none
|
||
|
||
.. code::
|
||
|
||
# /etc/cron.d/devicecheck
|
||
* * * * * root /some/path/to/disable_rsync.py
|
||
|
||
.. _dispersion_report:
|
||
|
||
-----------------
|
||
Dispersion Report
|
||
-----------------
|
||
|
||
There is a swift-dispersion-report tool for measuring overall cluster health.
|
||
This is accomplished by checking if a set of deliberately distributed
|
||
containers and objects are currently in their proper places within the cluster.
|
||
|
||
For instance, a common deployment has three replicas of each object. The health
|
||
of that object can be measured by checking if each replica is in its proper
|
||
place. If only 2 of the 3 is in place the object's heath can be said to be at
|
||
66.66%, where 100% would be perfect.
|
||
|
||
A single object's health, especially an older object, usually reflects the
|
||
health of that entire partition the object is in. If we make enough objects on
|
||
a distinct percentage of the partitions in the cluster, we can get a pretty
|
||
valid estimate of the overall cluster health. In practice, about 1% partition
|
||
coverage seems to balance well between accuracy and the amount of time it takes
|
||
to gather results.
|
||
|
||
The first thing that needs to be done to provide this health value is create a
|
||
new account solely for this usage. Next, we need to place the containers and
|
||
objects throughout the system so that they are on distinct partitions. The
|
||
swift-dispersion-populate tool does this by making up random container and
|
||
object names until they fall on distinct partitions. Last, and repeatedly for
|
||
the life of the cluster, we need to run the swift-dispersion-report tool to
|
||
check the health of each of these containers and objects.
|
||
|
||
.. highlight:: cfg
|
||
|
||
These tools need direct access to the entire cluster and to the ring files
|
||
(installing them on a proxy server will probably do). Both
|
||
swift-dispersion-populate and swift-dispersion-report use the same
|
||
configuration file, /etc/swift/dispersion.conf. Example conf file::
|
||
|
||
[dispersion]
|
||
auth_url = http://localhost:8080/auth/v1.0
|
||
auth_user = test:tester
|
||
auth_key = testing
|
||
endpoint_type = internalURL
|
||
|
||
.. highlight:: none
|
||
|
||
There are also options for the conf file for specifying the dispersion coverage
|
||
(defaults to 1%), retries, concurrency, etc. though usually the defaults are
|
||
fine. If you want to use keystone v3 for authentication there are options like
|
||
auth_version, user_domain_name, project_domain_name and project_name.
|
||
|
||
Once the configuration is in place, run `swift-dispersion-populate` to populate
|
||
the containers and objects throughout the cluster.
|
||
|
||
Now that those containers and objects are in place, you can run
|
||
`swift-dispersion-report` to get a dispersion report, or the overall health of
|
||
the cluster. Here is an example of a cluster in perfect health::
|
||
|
||
$ swift-dispersion-report
|
||
Queried 2621 containers for dispersion reporting, 19s, 0 retries
|
||
100.00% of container copies found (7863 of 7863)
|
||
Sample represents 1.00% of the container partition space
|
||
|
||
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
||
100.00% of object copies found (7857 of 7857)
|
||
Sample represents 1.00% of the object partition space
|
||
|
||
Now I'll deliberately double the weight of a device in the object ring (with
|
||
replication turned off) and rerun the dispersion report to show what impact
|
||
that has::
|
||
|
||
$ swift-ring-builder object.builder set_weight d0 200
|
||
$ swift-ring-builder object.builder rebalance
|
||
...
|
||
$ swift-dispersion-report
|
||
Queried 2621 containers for dispersion reporting, 8s, 0 retries
|
||
100.00% of container copies found (7863 of 7863)
|
||
Sample represents 1.00% of the container partition space
|
||
|
||
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
||
There were 1763 partitions missing one copy.
|
||
77.56% of object copies found (6094 of 7857)
|
||
Sample represents 1.00% of the object partition space
|
||
|
||
You can see the health of the objects in the cluster has gone down
|
||
significantly. Of course, I only have four devices in this test environment, in
|
||
a production environment with many many devices the impact of one device change
|
||
is much less. Next, I'll run the replicators to get everything put back into
|
||
place and then rerun the dispersion report::
|
||
|
||
... start object replicators and monitor logs until they're caught up ...
|
||
$ swift-dispersion-report
|
||
Queried 2621 containers for dispersion reporting, 17s, 0 retries
|
||
100.00% of container copies found (7863 of 7863)
|
||
Sample represents 1.00% of the container partition space
|
||
|
||
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
||
100.00% of object copies found (7857 of 7857)
|
||
Sample represents 1.00% of the object partition space
|
||
|
||
You can also run the report for only containers or objects::
|
||
|
||
$ swift-dispersion-report --container-only
|
||
Queried 2621 containers for dispersion reporting, 17s, 0 retries
|
||
100.00% of container copies found (7863 of 7863)
|
||
Sample represents 1.00% of the container partition space
|
||
|
||
$ swift-dispersion-report --object-only
|
||
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
||
100.00% of object copies found (7857 of 7857)
|
||
Sample represents 1.00% of the object partition space
|
||
|
||
Alternatively, the dispersion report can also be output in JSON format. This
|
||
allows it to be more easily consumed by third party utilities::
|
||
|
||
$ swift-dispersion-report -j
|
||
{"object": {"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container": {"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected": 12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}}
|
||
|
||
Note that you may select which storage policy to use by setting the option
|
||
'--policy-name silver' or '-P silver' (silver is the example policy name here).
|
||
If no policy is specified, the default will be used per the swift.conf file.
|
||
When you specify a policy the containers created also include the policy index,
|
||
thus even when running a container_only report, you will need to specify the
|
||
policy not using the default.
|
||
|
||
-----------------------------------------------
|
||
Geographically Distributed Swift Considerations
|
||
-----------------------------------------------
|
||
|
||
Swift provides two features that may be used to distribute replicas of objects
|
||
across multiple geographically distributed data-centers: with
|
||
:doc:`overview_global_cluster` object replicas may be dispersed across devices
|
||
from different data-centers by using `regions` in ring device descriptors; with
|
||
:doc:`overview_container_sync` objects may be copied between independent Swift
|
||
clusters in each data-center. The operation and configuration of each are
|
||
described in their respective documentation. The following points should be
|
||
considered when selecting the feature that is most appropriate for a particular
|
||
use case:
|
||
|
||
#. Global Clusters allows the distribution of object replicas across
|
||
data-centers to be controlled by the cluster operator on per-policy basis,
|
||
since the distribution is determined by the assignment of devices from
|
||
each data-center in each policy's ring file. With Container Sync the end
|
||
user controls the distribution of objects across clusters on a
|
||
per-container basis.
|
||
|
||
#. Global Clusters requires an operator to coordinate ring deployments across
|
||
multiple data-centers. Container Sync allows for independent management of
|
||
separate Swift clusters in each data-center, and for existing Swift
|
||
clusters to be used as peers in Container Sync relationships without
|
||
deploying new policies/rings.
|
||
|
||
#. Global Clusters seamlessly supports features that may rely on
|
||
cross-container operations such as large objects and versioned writes.
|
||
Container Sync requires the end user to ensure that all required
|
||
containers are sync'd for these features to work in all data-centers.
|
||
|
||
#. Global Clusters makes objects available for GET or HEAD requests in both
|
||
data-centers even if a replica of the object has not yet been
|
||
asynchronously migrated between data-centers, by forwarding requests
|
||
between data-centers. Container Sync is unable to serve requests for an
|
||
object in a particular data-center until the asynchronous sync process has
|
||
copied the object to that data-center.
|
||
|
||
#. Global Clusters may require less storage capacity than Container Sync to
|
||
achieve equivalent durability of objects in each data-center. Global
|
||
Clusters can restore replicas that are lost or corrupted in one
|
||
data-center using replicas from other data-centers. Container Sync
|
||
requires each data-center to independently manage the durability of
|
||
objects, which may result in each data-center storing more replicas than
|
||
with Global Clusters.
|
||
|
||
#. Global Clusters execute all account/container metadata updates
|
||
synchronously to account/container replicas in all data-centers, which may
|
||
incur delays when making updates across WANs. Container Sync only copies
|
||
objects between data-centers and all Swift internal traffic is
|
||
confined to each data-center.
|
||
|
||
#. Global Clusters does not yet guarantee the availability of objects stored
|
||
in Erasure Coded policies when one data-center is offline. With Container
|
||
Sync the availability of objects in each data-center is independent of the
|
||
state of other data-centers once objects have been synced. Container Sync
|
||
also allows objects to be stored using different policy types in different
|
||
data-centers.
|
||
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
Checking handoff partition distribution
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
You can check if handoff partitions are piling up on a server by
|
||
comparing the expected number of partitions with the actual number on
|
||
your disks. First get the number of partitions that are currently
|
||
assigned to a server using the ``dispersion`` command from
|
||
``swift-ring-builder``::
|
||
|
||
swift-ring-builder sample.builder dispersion --verbose
|
||
Dispersion is 0.000000, Balance is 0.000000, Overload is 0.00%
|
||
Required overload is 0.000000%
|
||
--------------------------------------------------------------------------
|
||
Tier Parts % Max 0 1 2 3
|
||
--------------------------------------------------------------------------
|
||
r1 8192 0.00 2 0 0 8192 0
|
||
r1z1 4096 0.00 1 4096 4096 0 0
|
||
r1z1-172.16.10.1 4096 0.00 1 4096 4096 0 0
|
||
r1z1-172.16.10.1/sda1 4096 0.00 1 4096 4096 0 0
|
||
r1z2 4096 0.00 1 4096 4096 0 0
|
||
r1z2-172.16.10.2 4096 0.00 1 4096 4096 0 0
|
||
r1z2-172.16.10.2/sda1 4096 0.00 1 4096 4096 0 0
|
||
r1z3 4096 0.00 1 4096 4096 0 0
|
||
r1z3-172.16.10.3 4096 0.00 1 4096 4096 0 0
|
||
r1z3-172.16.10.3/sda1 4096 0.00 1 4096 4096 0 0
|
||
r1z4 4096 0.00 1 4096 4096 0 0
|
||
r1z4-172.16.20.4 4096 0.00 1 4096 4096 0 0
|
||
r1z4-172.16.20.4/sda1 4096 0.00 1 4096 4096 0 0
|
||
r2 8192 0.00 2 0 8192 0 0
|
||
r2z1 4096 0.00 1 4096 4096 0 0
|
||
r2z1-172.16.20.1 4096 0.00 1 4096 4096 0 0
|
||
r2z1-172.16.20.1/sda1 4096 0.00 1 4096 4096 0 0
|
||
r2z2 4096 0.00 1 4096 4096 0 0
|
||
r2z2-172.16.20.2 4096 0.00 1 4096 4096 0 0
|
||
r2z2-172.16.20.2/sda1 4096 0.00 1 4096 4096 0 0
|
||
|
||
As you can see from the output, each server should store 4096 partitions, and
|
||
each region should store 8192 partitions. This example used a partition power
|
||
of 13 and 3 replicas.
|
||
|
||
With write_affinity enabled it is expected to have a higher number of
|
||
partitions on disk compared to the value reported by the
|
||
swift-ring-builder dispersion command. The number of additional (handoff)
|
||
partitions in region r1 depends on your cluster size, the amount
|
||
of incoming data as well as the replication speed.
|
||
|
||
Let's use the example from above with 6 nodes in 2 regions, and write_affinity
|
||
configured to write to region r1 first. `swift-ring-builder` reported that
|
||
each node should store 4096 partitions::
|
||
|
||
Expected partitions for region r2: 8192
|
||
Handoffs stored across 4 nodes in region r1: 8192 / 4 = 2048
|
||
Maximum number of partitions on each server in region r1: 2048 + 4096 = 6144
|
||
|
||
Worst case is that handoff partitions in region 1 are populated with new
|
||
object replicas faster than replication is able to move them to region 2.
|
||
In that case you will see ~ 6144 partitions per
|
||
server in region r1. Your actual number should be lower and
|
||
between 4096 and 6144 partitions (preferably on the lower side).
|
||
|
||
Now count the number of object partitions on a given server in region 1,
|
||
for example on 172.16.10.1. Note that the pathnames might be
|
||
different; `/srv/node/` is the default mount location, and `objects`
|
||
applies only to storage policy 0 (storage policy 1 would use
|
||
`objects-1` and so on)::
|
||
|
||
find -L /srv/node/ -maxdepth 3 -type d -wholename "*objects/*" | wc -l
|
||
|
||
If this number is always on the upper end of the expected partition
|
||
number range (4096 to 6144) or increasing you should check your
|
||
replication speed and maybe even disable write_affinity.
|
||
Please refer to the next section how to collect metrics from Swift, and
|
||
especially :ref:`swift-recon -r <recon-replication>` how to check replication
|
||
stats.
|
||
|
||
|
||
.. _cluster_telemetry_and_monitoring:
|
||
|
||
--------------------------------
|
||
Cluster Telemetry and Monitoring
|
||
--------------------------------
|
||
|
||
Various metrics and telemetry can be obtained from the account, container, and
|
||
object servers using the recon server middleware and the swift-recon cli. To do
|
||
so update your account, container, or object servers pipelines to include recon
|
||
and add the associated filter config.
|
||
|
||
.. highlight:: cfg
|
||
|
||
object-server.conf sample::
|
||
|
||
[pipeline:main]
|
||
pipeline = recon object-server
|
||
|
||
[filter:recon]
|
||
use = egg:swift#recon
|
||
recon_cache_path = /var/cache/swift
|
||
|
||
container-server.conf sample::
|
||
|
||
[pipeline:main]
|
||
pipeline = recon container-server
|
||
|
||
[filter:recon]
|
||
use = egg:swift#recon
|
||
recon_cache_path = /var/cache/swift
|
||
|
||
account-server.conf sample::
|
||
|
||
[pipeline:main]
|
||
pipeline = recon account-server
|
||
|
||
[filter:recon]
|
||
use = egg:swift#recon
|
||
recon_cache_path = /var/cache/swift
|
||
|
||
.. highlight:: none
|
||
|
||
The recon_cache_path simply sets the directory where stats for a few items will
|
||
be stored. Depending on the method of deployment you may need to create this
|
||
directory manually and ensure that Swift has read/write access.
|
||
|
||
Finally, if you also wish to track asynchronous pending on your object
|
||
servers you will need to setup a cronjob to run the swift-recon-cron script
|
||
periodically on your object servers::
|
||
|
||
*/5 * * * * swift /usr/bin/swift-recon-cron /etc/swift/object-server.conf
|
||
|
||
Once the recon middleware is enabled, a GET request for
|
||
"/recon/<metric>" to the backend object server will return a
|
||
JSON-formatted response::
|
||
|
||
fhines@ubuntu:~$ curl -i http://localhost:6030/recon/async
|
||
HTTP/1.1 200 OK
|
||
Content-Type: application/json
|
||
Content-Length: 20
|
||
Date: Tue, 18 Oct 2011 21:03:01 GMT
|
||
|
||
{"async_pending": 0}
|
||
|
||
|
||
Note that the default port for the object server is 6200, except on a
|
||
Swift All-In-One installation, which uses 6010, 6020, 6030, and 6040.
|
||
|
||
The following metrics and telemetry are currently exposed:
|
||
|
||
========================= ========================================================================================
|
||
Request URI Description
|
||
------------------------- ----------------------------------------------------------------------------------------
|
||
/recon/load returns 1,5, and 15 minute load average
|
||
/recon/mem returns /proc/meminfo
|
||
/recon/mounted returns *ALL* currently mounted filesystems
|
||
/recon/unmounted returns all unmounted drives if mount_check = True
|
||
/recon/diskusage returns disk utilization for storage devices
|
||
/recon/driveaudit returns # of drive audit errors
|
||
/recon/ringmd5 returns object/container/account ring md5sums
|
||
/recon/swiftconfmd5 returns swift.conf md5sum
|
||
/recon/quarantined returns # of quarantined objects/accounts/containers
|
||
/recon/sockstat returns consumable info from /proc/net/sockstat|6
|
||
/recon/devices returns list of devices and devices dir i.e. /srv/node
|
||
/recon/async returns count of async pending
|
||
/recon/replication returns object replication info (for backward compatibility)
|
||
/recon/replication/<type> returns replication info for given type (account, container, object)
|
||
/recon/auditor/<type> returns auditor stats on last reported scan for given type (account, container, object)
|
||
/recon/updater/<type> returns last updater sweep times for given type (container, object)
|
||
/recon/expirer/object returns time elapsed and number of objects deleted during last object expirer sweep
|
||
/recon/version returns Swift version
|
||
/recon/time returns node time
|
||
========================= ========================================================================================
|
||
|
||
Note that 'object_replication_last' and 'object_replication_time' in object
|
||
replication info are considered to be transitional and will be removed in
|
||
the subsequent releases. Use 'replication_last' and 'replication_time' instead.
|
||
|
||
This information can also be queried via the swift-recon command line utility::
|
||
|
||
fhines@ubuntu:~$ swift-recon -h
|
||
Usage:
|
||
usage: swift-recon <server_type> [-v] [--suppress] [-a] [-r] [-u] [-d]
|
||
[-l] [-T] [--md5] [--auditor] [--updater] [--expirer] [--sockstat]
|
||
|
||
<server_type> account|container|object
|
||
Defaults to object server.
|
||
|
||
ex: swift-recon container -l --auditor
|
||
|
||
|
||
Options:
|
||
-h, --help show this help message and exit
|
||
-v, --verbose Print verbose info
|
||
--suppress Suppress most connection related errors
|
||
-a, --async Get async stats
|
||
-r, --replication Get replication stats
|
||
--auditor Get auditor stats
|
||
--updater Get updater stats
|
||
--expirer Get expirer stats
|
||
-u, --unmounted Check cluster for unmounted devices
|
||
-d, --diskusage Get disk usage stats
|
||
-l, --loadstats Get cluster load average stats
|
||
-q, --quarantined Get cluster quarantine stats
|
||
--md5 Get md5sum of servers ring and compare to local copy
|
||
--sockstat Get cluster socket usage stats
|
||
-T, --time Check time synchronization
|
||
--all Perform all checks. Equal to
|
||
-arudlqT --md5 --sockstat --auditor --updater
|
||
--expirer --driveaudit --validate-servers
|
||
-z ZONE, --zone=ZONE Only query servers in specified zone
|
||
-t SECONDS, --timeout=SECONDS
|
||
Time to wait for a response from a server
|
||
--swiftdir=SWIFTDIR Default = /etc/swift
|
||
|
||
.. _recon-replication:
|
||
|
||
For example, to obtain container replication info from all hosts in zone "3"::
|
||
|
||
fhines@ubuntu:~$ swift-recon container -r --zone 3
|
||
===============================================================================
|
||
--> Starting reconnaissance on 1 hosts
|
||
===============================================================================
|
||
[2012-04-02 02:45:48] Checking on replication
|
||
[failure] low: 0.000, high: 0.000, avg: 0.000, reported: 1
|
||
[success] low: 486.000, high: 486.000, avg: 486.000, reported: 1
|
||
[replication_time] low: 20.853, high: 20.853, avg: 20.853, reported: 1
|
||
[attempted] low: 243.000, high: 243.000, avg: 243.000, reported: 1
|
||
|
||
---------------------------
|
||
Reporting Metrics to StatsD
|
||
---------------------------
|
||
|
||
.. highlight:: cfg
|
||
|
||
If you have a StatsD_ server running, Swift may be configured to send it
|
||
real-time operational metrics. To enable this, set the following
|
||
configuration entries (see the sample configuration files)::
|
||
|
||
log_statsd_host = localhost
|
||
log_statsd_port = 8125
|
||
log_statsd_default_sample_rate = 1.0
|
||
log_statsd_sample_rate_factor = 1.0
|
||
log_statsd_metric_prefix = [empty-string]
|
||
|
||
If `log_statsd_host` is not set, this feature is disabled. The default values
|
||
for the other settings are given above. The `log_statsd_host` can be a
|
||
hostname, an IPv4 address, or an IPv6 address (not surrounded with brackets, as
|
||
this is unnecessary since the port is specified separately). If a hostname
|
||
resolves to an IPv4 address, an IPv4 socket will be used to send StatsD UDP
|
||
packets, even if the hostname would also resolve to an IPv6 address.
|
||
|
||
.. _StatsD: http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
|
||
.. _Graphite: http://graphiteapp.org/
|
||
.. _Ganglia: http://ganglia.sourceforge.net/
|
||
|
||
The sample rate is a real number between 0 and 1 which defines the
|
||
probability of sending a sample for any given event or timing measurement.
|
||
This sample rate is sent with each sample to StatsD and used to
|
||
multiply the value. For example, with a sample rate of 0.5, StatsD will
|
||
multiply that counter's value by 2 when flushing the metric to an upstream
|
||
monitoring system (Graphite_, Ganglia_, etc.).
|
||
|
||
Some relatively high-frequency metrics have a default sample rate less than
|
||
one. If you want to override the default sample rate for all metrics whose
|
||
default sample rate is not specified in the Swift source, you may set
|
||
`log_statsd_default_sample_rate` to a value less than one. This is NOT
|
||
recommended (see next paragraph). A better way to reduce StatsD load is to
|
||
adjust `log_statsd_sample_rate_factor` to a value less than one. The
|
||
`log_statsd_sample_rate_factor` is multiplied to any sample rate (either the
|
||
global default or one specified by the actual metric logging call in the Swift
|
||
source) prior to handling. In other words, this one tunable can lower the
|
||
frequency of all StatsD logging by a proportional amount.
|
||
|
||
To get the best data, start with the default `log_statsd_default_sample_rate`
|
||
and `log_statsd_sample_rate_factor` values of 1 and only lower
|
||
`log_statsd_sample_rate_factor` if needed. The
|
||
`log_statsd_default_sample_rate` should not be used and remains for backward
|
||
compatibility only.
|
||
|
||
The metric prefix will be prepended to every metric sent to the StatsD server
|
||
For example, with::
|
||
|
||
log_statsd_metric_prefix = proxy01
|
||
|
||
the metric `proxy-server.errors` would be sent to StatsD as
|
||
`proxy01.proxy-server.errors`. This is useful for differentiating different
|
||
servers when sending statistics to a central StatsD server. If you run a local
|
||
StatsD server per node, you could configure a per-node metrics prefix there and
|
||
leave `log_statsd_metric_prefix` blank.
|
||
|
||
Note that metrics reported to StatsD are counters or timing data (which are
|
||
sent in units of milliseconds). StatsD usually expands timing data out to min,
|
||
max, avg, count, and 90th percentile per timing metric, but the details of
|
||
this behavior will depend on the configuration of your StatsD server. Some
|
||
important "gauge" metrics may still need to be collected using another method.
|
||
For example, the `object-server.async_pendings` StatsD metric counts the generation
|
||
of async_pendings in real-time, but will not tell you the current number of
|
||
async_pending container updates on disk at any point in time.
|
||
|
||
Note also that the set of metrics collected, their names, and their semantics
|
||
are not locked down and will change over time.
|
||
|
||
Metrics for `account-auditor`:
|
||
|
||
========================== =========================================================
|
||
Metric Name Description
|
||
-------------------------- ---------------------------------------------------------
|
||
`account-auditor.errors` Count of audit runs (across all account databases) which
|
||
caught an Exception.
|
||
`account-auditor.passes` Count of individual account databases which passed audit.
|
||
`account-auditor.failures` Count of individual account databases which failed audit.
|
||
`account-auditor.timing` Timing data for individual account database audits.
|
||
========================== =========================================================
|
||
|
||
Metrics for `account-reaper`:
|
||
|
||
============================================== ====================================================
|
||
Metric Name Description
|
||
---------------------------------------------- ----------------------------------------------------
|
||
`account-reaper.errors` Count of devices failing the mount check.
|
||
`account-reaper.timing` Timing data for each reap_account() call.
|
||
`account-reaper.return_codes.X` Count of HTTP return codes from various operations
|
||
(e.g. object listing, container deletion, etc.). The
|
||
value for X is the first digit of the return code
|
||
(2 for 201, 4 for 404, etc.).
|
||
`account-reaper.containers_failures` Count of failures to delete a container.
|
||
`account-reaper.containers_deleted` Count of containers successfully deleted.
|
||
`account-reaper.containers_remaining` Count of containers which failed to delete with
|
||
zero successes.
|
||
`account-reaper.containers_possibly_remaining` Count of containers which failed to delete with
|
||
at least one success.
|
||
`account-reaper.objects_failures` Count of failures to delete an object.
|
||
`account-reaper.objects_deleted` Count of objects successfully deleted.
|
||
`account-reaper.objects_remaining` Count of objects which failed to delete with zero
|
||
successes.
|
||
`account-reaper.objects_possibly_remaining` Count of objects which failed to delete with at
|
||
least one success.
|
||
============================================== ====================================================
|
||
|
||
Metrics for `account-server` ("Not Found" is not considered an error and requests
|
||
which increment `errors` are not included in the timing data):
|
||
|
||
======================================== =======================================================
|
||
Metric Name Description
|
||
---------------------------------------- -------------------------------------------------------
|
||
`account-server.DELETE.errors.timing` Timing data for each DELETE request resulting in an
|
||
error: bad request, not mounted, missing timestamp.
|
||
`account-server.DELETE.timing` Timing data for each DELETE request not resulting in
|
||
an error.
|
||
`account-server.PUT.errors.timing` Timing data for each PUT request resulting in an error:
|
||
bad request, not mounted, conflict, recently-deleted.
|
||
`account-server.PUT.timing` Timing data for each PUT request not resulting in an
|
||
error.
|
||
`account-server.HEAD.errors.timing` Timing data for each HEAD request resulting in an
|
||
error: bad request, not mounted.
|
||
`account-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
||
an error.
|
||
`account-server.GET.errors.timing` Timing data for each GET request resulting in an
|
||
error: bad request, not mounted, bad delimiter,
|
||
account listing limit too high, bad accept header.
|
||
`account-server.GET.timing` Timing data for each GET request not resulting in
|
||
an error.
|
||
`account-server.REPLICATE.errors.timing` Timing data for each REPLICATE request resulting in an
|
||
error: bad request, not mounted.
|
||
`account-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
||
in an error.
|
||
`account-server.POST.errors.timing` Timing data for each POST request resulting in an
|
||
error: bad request, bad or missing timestamp, not
|
||
mounted.
|
||
`account-server.POST.timing` Timing data for each POST request not resulting in
|
||
an error.
|
||
======================================== =======================================================
|
||
|
||
Metrics for `account-replicator`:
|
||
|
||
===================================== ====================================================
|
||
Metric Name Description
|
||
------------------------------------- ----------------------------------------------------
|
||
`account-replicator.diffs` Count of syncs handled by sending differing rows.
|
||
`account-replicator.diff_caps` Count of "diffs" operations which failed because
|
||
"max_diffs" was hit.
|
||
`account-replicator.no_changes` Count of accounts found to be in sync.
|
||
`account-replicator.hashmatches` Count of accounts found to be in sync via hash
|
||
comparison (`broker.merge_syncs` was called).
|
||
`account-replicator.rsyncs` Count of completely missing accounts which were sent
|
||
via rsync.
|
||
`account-replicator.remote_merges` Count of syncs handled by sending entire database
|
||
via rsync.
|
||
`account-replicator.attempts` Count of database replication attempts.
|
||
`account-replicator.failures` Count of database replication attempts which failed
|
||
due to corruption (quarantined) or inability to read
|
||
as well as attempts to individual nodes which
|
||
failed.
|
||
`account-replicator.removes.<device>` Count of databases on <device> deleted because the
|
||
delete_timestamp was greater than the put_timestamp
|
||
and the database had no rows or because it was
|
||
successfully sync'ed to other locations and doesn't
|
||
belong here anymore.
|
||
`account-replicator.successes` Count of replication attempts to an individual node
|
||
which were successful.
|
||
`account-replicator.timing` Timing data for each database replication attempt
|
||
not resulting in a failure.
|
||
===================================== ====================================================
|
||
|
||
Metrics for `container-auditor`:
|
||
|
||
============================ ====================================================
|
||
Metric Name Description
|
||
---------------------------- ----------------------------------------------------
|
||
`container-auditor.errors` Incremented when an Exception is caught in an audit
|
||
pass (only once per pass, max).
|
||
`container-auditor.passes` Count of individual containers passing an audit.
|
||
`container-auditor.failures` Count of individual containers failing an audit.
|
||
`container-auditor.timing` Timing data for each container audit.
|
||
============================ ====================================================
|
||
|
||
Metrics for `container-replicator`:
|
||
|
||
======================================= ====================================================
|
||
Metric Name Description
|
||
--------------------------------------- ----------------------------------------------------
|
||
`container-replicator.diffs` Count of syncs handled by sending differing rows.
|
||
`container-replicator.diff_caps` Count of "diffs" operations which failed because
|
||
"max_diffs" was hit.
|
||
`container-replicator.no_changes` Count of containers found to be in sync.
|
||
`container-replicator.hashmatches` Count of containers found to be in sync via hash
|
||
comparison (`broker.merge_syncs` was called).
|
||
`container-replicator.rsyncs` Count of completely missing containers where were sent
|
||
via rsync.
|
||
`container-replicator.remote_merges` Count of syncs handled by sending entire database
|
||
via rsync.
|
||
`container-replicator.attempts` Count of database replication attempts.
|
||
`container-replicator.failures` Count of database replication attempts which failed
|
||
due to corruption (quarantined) or inability to read
|
||
as well as attempts to individual nodes which
|
||
failed.
|
||
`container-replicator.removes.<device>` Count of databases deleted on <device> because the
|
||
delete_timestamp was greater than the put_timestamp
|
||
and the database had no rows or because it was
|
||
successfully sync'ed to other locations and doesn't
|
||
belong here anymore.
|
||
`container-replicator.successes` Count of replication attempts to an individual node
|
||
which were successful.
|
||
`container-replicator.timing` Timing data for each database replication attempt
|
||
not resulting in a failure.
|
||
======================================= ====================================================
|
||
|
||
Metrics for `container-server` ("Not Found" is not considered an error and requests
|
||
which increment `errors` are not included in the timing data):
|
||
|
||
========================================== ====================================================
|
||
Metric Name Description
|
||
------------------------------------------ ----------------------------------------------------
|
||
`container-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
|
||
not mounted, missing timestamp, conflict.
|
||
`container-server.DELETE.timing` Timing data for each DELETE request not resulting in
|
||
an error.
|
||
`container-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
|
||
missing timestamp, not mounted, conflict.
|
||
`container-server.PUT.timing` Timing data for each PUT request not resulting in an
|
||
error.
|
||
`container-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
|
||
not mounted.
|
||
`container-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
||
an error.
|
||
`container-server.GET.errors.timing` Timing data for GET request errors: bad request,
|
||
not mounted, parameters not utf8, bad accept header.
|
||
`container-server.GET.timing` Timing data for each GET request not resulting in
|
||
an error.
|
||
`container-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
|
||
request, not mounted.
|
||
`container-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
||
in an error.
|
||
`container-server.POST.errors.timing` Timing data for POST request errors: bad request,
|
||
bad x-container-sync-to, not mounted.
|
||
`container-server.POST.timing` Timing data for each POST request not resulting in
|
||
an error.
|
||
========================================== ====================================================
|
||
|
||
Metrics for `container-sync`:
|
||
|
||
=============================== ====================================================
|
||
Metric Name Description
|
||
------------------------------- ----------------------------------------------------
|
||
`container-sync.skips` Count of containers skipped because they don't have
|
||
sync'ing enabled.
|
||
`container-sync.failures` Count of failures sync'ing of individual containers.
|
||
`container-sync.syncs` Count of individual containers sync'ed successfully.
|
||
`container-sync.deletes` Count of container database rows sync'ed by
|
||
deletion.
|
||
`container-sync.deletes.timing` Timing data for each container database row
|
||
synchronization via deletion.
|
||
`container-sync.puts` Count of container database rows sync'ed by Putting.
|
||
`container-sync.puts.timing` Timing data for each container database row
|
||
synchronization via Putting.
|
||
=============================== ====================================================
|
||
|
||
Metrics for `container-updater`:
|
||
|
||
============================== ====================================================
|
||
Metric Name Description
|
||
------------------------------ ----------------------------------------------------
|
||
`container-updater.successes` Count of containers which successfully updated their
|
||
account.
|
||
`container-updater.failures` Count of containers which failed to update their
|
||
account.
|
||
`container-updater.no_changes` Count of containers which didn't need to update
|
||
their account.
|
||
`container-updater.timing` Timing data for processing a container; only
|
||
includes timing for containers which needed to
|
||
update their accounts (i.e. "successes" and
|
||
"failures" but not "no_changes").
|
||
============================== ====================================================
|
||
|
||
Metrics for `object-auditor`:
|
||
|
||
============================ ====================================================
|
||
Metric Name Description
|
||
---------------------------- ----------------------------------------------------
|
||
`object-auditor.quarantines` Count of objects failing audit and quarantined.
|
||
`object-auditor.errors` Count of errors encountered while auditing objects.
|
||
`object-auditor.timing` Timing data for each object audit (does not include
|
||
any rate-limiting sleep time for
|
||
max_files_per_second, but does include rate-limiting
|
||
sleep time for max_bytes_per_second).
|
||
============================ ====================================================
|
||
|
||
Metrics for `object-expirer`:
|
||
|
||
======================== ====================================================
|
||
Metric Name Description
|
||
------------------------ ----------------------------------------------------
|
||
`object-expirer.objects` Count of objects expired.
|
||
`object-expirer.errors` Count of errors encountered while attempting to
|
||
expire an object.
|
||
`object-expirer.timing` Timing data for each object expiration attempt,
|
||
including ones resulting in an error.
|
||
======================== ====================================================
|
||
|
||
Metrics for `object-reconstructor`:
|
||
|
||
====================================================== ======================================================
|
||
Metric Name Description
|
||
------------------------------------------------------ ------------------------------------------------------
|
||
`object-reconstructor.partition.delete.count.<device>` A count of partitions on <device> which were
|
||
reconstructed and synced to another node because they
|
||
didn't belong on this node. This metric is tracked
|
||
per-device to allow for "quiescence detection" for
|
||
object reconstruction activity on each device.
|
||
`object-reconstructor.partition.delete.timing` Timing data for partitions reconstructed and synced to
|
||
another node because they didn't belong on this node.
|
||
This metric is not tracked per device.
|
||
`object-reconstructor.partition.update.count.<device>` A count of partitions on <device> which were
|
||
reconstructed and synced to another node, but also
|
||
belong on this node. As with delete.count, this metric
|
||
is tracked per-device.
|
||
`object-reconstructor.partition.update.timing` Timing data for partitions reconstructed which also
|
||
belong on this node. This metric is not tracked
|
||
per-device.
|
||
`object-reconstructor.suffix.hashes` Count of suffix directories whose hash (of filenames)
|
||
was recalculated.
|
||
`object-reconstructor.suffix.syncs` Count of suffix directories reconstructed with ssync.
|
||
====================================================== ======================================================
|
||
|
||
Metrics for `object-replicator`:
|
||
|
||
=================================================== ====================================================
|
||
Metric Name Description
|
||
--------------------------------------------------- ----------------------------------------------------
|
||
`object-replicator.partition.delete.count.<device>` A count of partitions on <device> which were
|
||
replicated to another node because they didn't
|
||
belong on this node. This metric is tracked
|
||
per-device to allow for "quiescence detection" for
|
||
object replication activity on each device.
|
||
`object-replicator.partition.delete.timing` Timing data for partitions replicated to another
|
||
node because they didn't belong on this node. This
|
||
metric is not tracked per device.
|
||
`object-replicator.partition.update.count.<device>` A count of partitions on <device> which were
|
||
replicated to another node, but also belong on this
|
||
node. As with delete.count, this metric is tracked
|
||
per-device.
|
||
`object-replicator.partition.update.timing` Timing data for partitions replicated which also
|
||
belong on this node. This metric is not tracked
|
||
per-device.
|
||
`object-replicator.suffix.hashes` Count of suffix directories whose hash (of filenames)
|
||
was recalculated.
|
||
`object-replicator.suffix.syncs` Count of suffix directories replicated with rsync.
|
||
=================================================== ====================================================
|
||
|
||
Metrics for `object-server`:
|
||
|
||
======================================= ====================================================
|
||
Metric Name Description
|
||
--------------------------------------- ----------------------------------------------------
|
||
`object-server.quarantines` Count of objects (files) found bad and moved to
|
||
quarantine.
|
||
`object-server.async_pendings` Count of container updates saved as async_pendings
|
||
(may result from PUT or DELETE requests).
|
||
`object-server.POST.errors.timing` Timing data for POST request errors: bad request,
|
||
missing timestamp, delete-at in past, not mounted.
|
||
`object-server.POST.timing` Timing data for each POST request not resulting in
|
||
an error.
|
||
`object-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
|
||
not mounted, missing timestamp, object creation
|
||
constraint violation, delete-at in past.
|
||
`object-server.PUT.timeouts` Count of object PUTs which exceeded max_upload_time.
|
||
`object-server.PUT.timing` Timing data for each PUT request not resulting in an
|
||
error.
|
||
`object-server.PUT.<device>.timing` Timing data per kB transferred (ms/kB) for each
|
||
non-zero-byte PUT request on each device.
|
||
Monitoring problematic devices, higher is bad.
|
||
`object-server.GET.errors.timing` Timing data for GET request errors: bad request,
|
||
not mounted, header timestamps before the epoch,
|
||
precondition failed.
|
||
File errors resulting in a quarantine are not
|
||
counted here.
|
||
`object-server.GET.timing` Timing data for each GET request not resulting in an
|
||
error. Includes requests which couldn't find the
|
||
object (including disk errors resulting in file
|
||
quarantine).
|
||
`object-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
|
||
not mounted.
|
||
`object-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
||
an error. Includes requests which couldn't find the
|
||
object (including disk errors resulting in file
|
||
quarantine).
|
||
`object-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
|
||
missing timestamp, not mounted, precondition
|
||
failed. Includes requests which couldn't find or
|
||
match the object.
|
||
`object-server.DELETE.timing` Timing data for each DELETE request not resulting
|
||
in an error.
|
||
`object-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
|
||
request, not mounted.
|
||
`object-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
||
in an error.
|
||
======================================= ====================================================
|
||
|
||
Metrics for `object-updater`:
|
||
|
||
============================ ====================================================
|
||
Metric Name Description
|
||
---------------------------- ----------------------------------------------------
|
||
`object-updater.errors` Count of drives not mounted or async_pending files
|
||
with an unexpected name.
|
||
`object-updater.timing` Timing data for object sweeps to flush async_pending
|
||
container updates. Does not include object sweeps
|
||
which did not find an existing async_pending storage
|
||
directory.
|
||
`object-updater.quarantines` Count of async_pending container updates which were
|
||
corrupted and moved to quarantine.
|
||
`object-updater.successes` Count of successful container updates.
|
||
`object-updater.failures` Count of failed container updates.
|
||
`object-updater.unlinks` Count of async_pending files unlinked. An
|
||
async_pending file is unlinked either when it is
|
||
successfully processed or when the replicator sees
|
||
that there is a newer async_pending file for the
|
||
same object.
|
||
============================ ====================================================
|
||
|
||
Metrics for `proxy-server` (in the table, `<type>` is the proxy-server
|
||
controller responsible for the request and will be one of "account",
|
||
"container", or "object"):
|
||
|
||
======================================== ====================================================
|
||
Metric Name Description
|
||
---------------------------------------- ----------------------------------------------------
|
||
`proxy-server.errors` Count of errors encountered while serving requests
|
||
before the controller type is determined. Includes
|
||
invalid Content-Length, errors finding the internal
|
||
controller to handle the request, invalid utf8, and
|
||
bad URLs.
|
||
`proxy-server.<type>.handoff_count` Count of node hand-offs; only tracked if log_handoffs
|
||
is set in the proxy-server config.
|
||
`proxy-server.<type>.handoff_all_count` Count of times *only* hand-off locations were
|
||
utilized; only tracked if log_handoffs is set in the
|
||
proxy-server config.
|
||
`proxy-server.<type>.client_timeouts` Count of client timeouts (client did not read within
|
||
`client_timeout` seconds during a GET or did not
|
||
supply data within `client_timeout` seconds during
|
||
a PUT).
|
||
`proxy-server.<type>.client_disconnects` Count of detected client disconnects during PUT
|
||
operations (does NOT include caught Exceptions in
|
||
the proxy-server which caused a client disconnect).
|
||
======================================== ====================================================
|
||
|
||
Metrics for `proxy-logging` middleware (in the table, `<type>` is either the
|
||
proxy-server controller responsible for the request: "account", "container",
|
||
"object", or the string "SOS" if the request came from the `Swift Origin Server`_
|
||
middleware. The `<verb>` portion will be one of "GET", "HEAD", "POST", "PUT",
|
||
"DELETE", "COPY", "OPTIONS", or "BAD_METHOD". The list of valid HTTP methods
|
||
is configurable via the `log_statsd_valid_http_methods` config variable and
|
||
the default setting yields the above behavior):
|
||
|
||
.. _Swift Origin Server: https://github.com/dpgoetz/sos
|
||
|
||
==================================================== ============================================
|
||
Metric Name Description
|
||
---------------------------------------------------- --------------------------------------------
|
||
`proxy-server.<type>.<verb>.<status>.timing` Timing data for requests, start to finish.
|
||
The <status> portion is the numeric HTTP
|
||
status code for the request (e.g. "200" or
|
||
"404").
|
||
`proxy-server.<type>.GET.<status>.first-byte.timing` Timing data up to completion of sending the
|
||
response headers (only for GET requests).
|
||
<status> and <type> are as for the main
|
||
timing metric.
|
||
`proxy-server.<type>.<verb>.<status>.xfer` This counter metric is the sum of bytes
|
||
transferred in (from clients) and out (to
|
||
clients) for requests. The <type>, <verb>,
|
||
and <status> portions of the metric are just
|
||
like the main timing metric.
|
||
==================================================== ============================================
|
||
|
||
The `proxy-logging` middleware also groups these metrics by policy. The
|
||
`<policy-index>` portion represents a policy index):
|
||
|
||
========================================================================== =====================================
|
||
Metric Name Description
|
||
-------------------------------------------------------------------------- -------------------------------------
|
||
`proxy-server.object.policy.<policy-index>.<verb>.<status>.timing` Timing data for requests, aggregated
|
||
by policy index.
|
||
`proxy-server.object.policy.<policy-index>.GET.<status>.first-byte.timing` Timing data up to completion of
|
||
sending the response headers,
|
||
aggregated by policy index.
|
||
`proxy-server.object.policy.<policy-index>.<verb>.<status>.xfer` Sum of bytes transferred in and out,
|
||
aggregated by policy index.
|
||
========================================================================== =====================================
|
||
|
||
Metrics for `tempauth` middleware (in the table, `<reseller_prefix>` represents
|
||
the actual configured reseller_prefix or "`NONE`" if the reseller_prefix is the
|
||
empty string):
|
||
|
||
========================================= ====================================================
|
||
Metric Name Description
|
||
----------------------------------------- ----------------------------------------------------
|
||
`tempauth.<reseller_prefix>.unauthorized` Count of regular requests which were denied with
|
||
HTTPUnauthorized.
|
||
`tempauth.<reseller_prefix>.forbidden` Count of regular requests which were denied with
|
||
HTTPForbidden.
|
||
`tempauth.<reseller_prefix>.token_denied` Count of token requests which were denied.
|
||
`tempauth.<reseller_prefix>.errors` Count of errors.
|
||
========================================= ====================================================
|
||
|
||
|
||
------------------------
|
||
Debugging Tips and Tools
|
||
------------------------
|
||
|
||
When a request is made to Swift, it is given a unique transaction id. This
|
||
id should be in every log line that has to do with that request. This can
|
||
be useful when looking at all the services that are hit by a single request.
|
||
|
||
If you need to know where a specific account, container or object is in the
|
||
cluster, `swift-get-nodes` will show the location where each replica should be.
|
||
|
||
If you are looking at an object on the server and need more info,
|
||
`swift-object-info` will display the account, container, replica locations
|
||
and metadata of the object.
|
||
|
||
If you are looking at a container on the server and need more info,
|
||
`swift-container-info` will display all the information like the account,
|
||
container, replica locations and metadata of the container.
|
||
|
||
If you are looking at an account on the server and need more info,
|
||
`swift-account-info` will display the account, replica locations
|
||
and metadata of the account.
|
||
|
||
If you want to audit the data for an account, `swift-account-audit` can be
|
||
used to crawl the account, checking that all containers and objects can be
|
||
found.
|
||
|
||
-----------------
|
||
Managing Services
|
||
-----------------
|
||
|
||
Swift services are generally managed with ``swift-init``. the general usage is
|
||
``swift-init <service> <command>``, where service is the Swift service to
|
||
manage (for example object, container, account, proxy) and command is one of:
|
||
|
||
========== ===============================================
|
||
Command Description
|
||
---------- -----------------------------------------------
|
||
start Start the service
|
||
stop Stop the service
|
||
restart Restart the service
|
||
shutdown Attempt to gracefully shutdown the service
|
||
reload Attempt to gracefully restart the service
|
||
========== ===============================================
|
||
|
||
A graceful shutdown or reload will finish any current requests before
|
||
completely stopping the old service. There is also a special case of
|
||
``swift-init all <command>``, which will run the command for all swift
|
||
services.
|
||
|
||
In cases where there are multiple configs for a service, a specific config
|
||
can be managed with ``swift-init <service>.<config> <command>``.
|
||
For example, when a separate replication network is used, there might be
|
||
``/etc/swift/object-server/public.conf`` for the object server and
|
||
``/etc/swift/object-server/replication.conf`` for the replication services.
|
||
In this case, the replication services could be restarted with
|
||
``swift-init object-server.replication restart``.
|
||
|
||
--------------
|
||
Object Auditor
|
||
--------------
|
||
|
||
On system failures, the XFS file system can sometimes truncate files it's
|
||
trying to write and produce zero-byte files. The object-auditor will catch
|
||
these problems but in the case of a system crash it would be advisable to run
|
||
an extra, less rate limited sweep to check for these specific files. You can
|
||
run this command as follows::
|
||
|
||
swift-object-auditor /path/to/object-server/config/file.conf once -z 1000
|
||
|
||
``-z`` means to only check for zero-byte files at 1000 files per second.
|
||
|
||
At times it is useful to be able to run the object auditor on a specific
|
||
device or set of devices. You can run the object-auditor as follows::
|
||
|
||
swift-object-auditor /path/to/object-server/config/file.conf once --devices=sda,sdb
|
||
|
||
This will run the object auditor on only the sda and sdb devices. This param
|
||
accepts a comma separated list of values.
|
||
|
||
-----------------
|
||
Object Replicator
|
||
-----------------
|
||
|
||
At times it is useful to be able to run the object replicator on a specific
|
||
device or partition. You can run the object-replicator as follows::
|
||
|
||
swift-object-replicator /path/to/object-server/config/file.conf once --devices=sda,sdb
|
||
|
||
This will run the object replicator on only the sda and sdb devices. You can
|
||
likewise run that command with ``--partitions``. Both params accept a comma
|
||
separated list of values. If both are specified they will be ANDed together.
|
||
These can only be run in "once" mode.
|
||
|
||
-------------
|
||
Swift Orphans
|
||
-------------
|
||
|
||
Swift Orphans are processes left over after a reload of a Swift server.
|
||
|
||
For example, when upgrading a proxy server you would probably finish
|
||
with a ``swift-init proxy-server reload`` or ``/etc/init.d/swift-proxy
|
||
reload``. This kills the parent proxy server process and leaves the
|
||
child processes running to finish processing whatever requests they
|
||
might be handling at the time. It then starts up a new parent proxy
|
||
server process and its children to handle new incoming requests. This
|
||
allows zero-downtime upgrades with no impact to existing requests.
|
||
|
||
The orphaned child processes may take a while to exit, depending on
|
||
the length of the requests they were handling. However, sometimes an
|
||
old process can be hung up due to some bug or hardware issue. In these
|
||
cases, these orphaned processes will hang around
|
||
forever. ``swift-orphans`` can be used to find and kill these orphans.
|
||
|
||
``swift-orphans`` with no arguments will just list the orphans it finds
|
||
that were started more than 24 hours ago. You shouldn't really check
|
||
for orphans until 24 hours after you perform a reload, as some
|
||
requests can take a long time to process. ``swift-orphans -k TERM`` will
|
||
send the SIG_TERM signal to the orphans processes, or you can ``kill
|
||
-TERM`` the pids yourself if you prefer.
|
||
|
||
You can run ``swift-orphans --help`` for more options.
|
||
|
||
|
||
------------
|
||
Swift Oldies
|
||
------------
|
||
|
||
Swift Oldies are processes that have just been around for a long
|
||
time. There's nothing necessarily wrong with this, but it might
|
||
indicate a hung process if you regularly upgrade and reload/restart
|
||
services. You might have so many servers that you don't notice when a
|
||
reload/restart fails; ``swift-oldies`` can help with this.
|
||
|
||
For example, if you upgraded and reloaded/restarted everything 2 days
|
||
ago, and you've already cleaned up any orphans with ``swift-orphans``,
|
||
you can run ``swift-oldies -a 48`` to find any Swift processes still
|
||
around that were started more than 2 days ago and then investigate
|
||
them accordingly.
|
||
|
||
|
||
|
||
-------------------
|
||
Custom Log Handlers
|
||
-------------------
|
||
|
||
Swift supports setting up custom log handlers for services by specifying a
|
||
comma-separated list of functions to invoke when logging is setup. It does so
|
||
via the ``log_custom_handlers`` configuration option. Logger hooks invoked are
|
||
passed the same arguments as Swift's get_logger function (as well as the
|
||
getLogger and LogAdapter object):
|
||
|
||
============== ===============================================
|
||
Name Description
|
||
-------------- -----------------------------------------------
|
||
conf Configuration dict to read settings from
|
||
name Name of the logger received
|
||
log_to_console (optional) Write log messages to console on stderr
|
||
log_route Route for the logging received
|
||
fmt Override log format received
|
||
logger The logging.getLogger object
|
||
adapted_logger The LogAdapter object
|
||
============== ===============================================
|
||
|
||
A basic example that sets up a custom logger might look like the
|
||
following:
|
||
|
||
|
||
.. code-block:: python
|
||
|
||
def my_logger(conf, name, log_to_console, log_route, fmt, logger,
|
||
adapted_logger):
|
||
my_conf_opt = conf.get('some_custom_setting')
|
||
my_handler = third_party_logstore_handler(my_conf_opt)
|
||
logger.addHandler(my_handler)
|
||
|
||
See :ref:`custom-logger-hooks-label` for sample use cases.
|
||
|
||
------------------------
|
||
Securing OpenStack Swift
|
||
------------------------
|
||
|
||
Please refer to the security guide at https://docs.openstack.org/security-guide
|
||
and in particular the `Object Storage
|
||
<https://docs.openstack.org/security-guide/object-storage.html>`__ section.
|