1459 lines
76 KiB
ReStructuredText
1459 lines
76 KiB
ReStructuredText
=====================
|
|
Administrator's Guide
|
|
=====================
|
|
|
|
-------------------------
|
|
Defining Storage Policies
|
|
-------------------------
|
|
|
|
Defining your Storage Policies is very easy to do with Swift. It is important
|
|
that the administrator understand the concepts behind Storage Policies
|
|
before actually creating and using them in order to get the most benefit out
|
|
of the feature and, more importantly, to avoid having to make unnecessary changes
|
|
once a set of policies have been deployed to a cluster.
|
|
|
|
It is highly recommended that the reader fully read and comprehend
|
|
:doc:`overview_policies` before proceeding with administration of
|
|
policies. Plan carefully and it is suggested that experimentation be
|
|
done first on a non-production cluster to be certain that the desired
|
|
configuration meets the needs of the users. See :ref:`upgrade-policy`
|
|
before planning the upgrade of your existing deployment.
|
|
|
|
Following is a high level view of the very few steps it takes to configure
|
|
policies once you have decided what you want to do:
|
|
|
|
#. Define your policies in ``/etc/swift/swift.conf``
|
|
#. Create the corresponding object rings
|
|
#. Communicate the names of the Storage Policies to cluster users
|
|
|
|
For a specific example that takes you through these steps, please see
|
|
:doc:`policies_saio`
|
|
|
|
------------------
|
|
Managing the Rings
|
|
------------------
|
|
|
|
You may build the storage rings on any server with the appropriate
|
|
version of Swift installed. Once built or changed (rebalanced), you
|
|
must distribute the rings to all the servers in the cluster. Storage
|
|
rings contain information about all the Swift storage partitions and
|
|
how they are distributed between the different nodes and disks.
|
|
|
|
Swift 1.6.0 is the last version to use a Python pickle format.
|
|
Subsequent versions use a different serialization format. **Rings
|
|
generated by Swift versions 1.6.0 and earlier may be read by any
|
|
version, but rings generated after 1.6.0 may only be read by Swift
|
|
versions greater than 1.6.0.** So when upgrading from version 1.6.0 or
|
|
earlier to a version greater than 1.6.0, either upgrade Swift on your
|
|
ring building server **last** after all Swift nodes have been successfully
|
|
upgraded, or refrain from generating rings until all Swift nodes have
|
|
been successfully upgraded.
|
|
|
|
If you need to downgrade from a version of swift greater than 1.6.0 to
|
|
a version less than or equal to 1.6.0, first downgrade your ring-building
|
|
server, generate new rings, push them out, then continue with the rest
|
|
of the downgrade.
|
|
|
|
For more information see :doc:`overview_ring`.
|
|
|
|
Removing a device from the ring::
|
|
|
|
swift-ring-builder <builder-file> remove <ip_address>/<device_name>
|
|
|
|
Removing a server from the ring::
|
|
|
|
swift-ring-builder <builder-file> remove <ip_address>
|
|
|
|
Adding devices to the ring:
|
|
|
|
See :ref:`ring-preparing`
|
|
|
|
See what devices for a server are in the ring::
|
|
|
|
swift-ring-builder <builder-file> search <ip_address>
|
|
|
|
Once you are done with all changes to the ring, the changes need to be
|
|
"committed"::
|
|
|
|
swift-ring-builder <builder-file> rebalance
|
|
|
|
Once the new rings are built, they should be pushed out to all the servers
|
|
in the cluster.
|
|
|
|
Optionally, if invoked as 'swift-ring-builder-safe' the directory containing
|
|
the specified builder file will be locked (via a .lock file in the parent
|
|
directory). This provides a basic safe guard against multiple instances
|
|
of the swift-ring-builder (or other utilities that observe this lock) from
|
|
attempting to write to or read the builder/ring files while operations are in
|
|
progress. This can be useful in environments where ring management has been
|
|
automated but the operator still needs to interact with the rings manually.
|
|
|
|
If the ring builder is not producing the balances that you are
|
|
expecting, you can gain visibility into what it's doing with the
|
|
``--debug`` flag.::
|
|
|
|
swift-ring-builder <builder-file> rebalance --debug
|
|
|
|
This produces a great deal of output that is mostly useful if you are
|
|
either (a) attempting to fix the ring builder, or (b) filing a bug
|
|
against the ring builder.
|
|
|
|
You may notice in the rebalance output a 'dispersion' number. What this
|
|
number means is explained in :ref:`ring_dispersion` but in essence
|
|
is the percentage of partitions in the ring that have too many replicas
|
|
within a particular failure domain. You can ask 'swift-ring-builder' what
|
|
the dispersion is with::
|
|
|
|
swift-ring-builder <builder-file> dispersion
|
|
|
|
This will give you the percentage again, if you want a detailed view of
|
|
the dispersion simply add a ``--verbose``::
|
|
|
|
swift-ring-builder <builder-file> dispersion --verbose
|
|
|
|
This will not only display the percentage but will also display a dispersion
|
|
table that lists partition dispersion by tier. You can use this table to figure
|
|
out were you need to add capacity or to help tune an :ref:`ring_overload` value.
|
|
|
|
Now let's take an example with 1 region, 3 zones and 4 devices. Each device has
|
|
the same weight, and the ``dispersion --verbose`` might show the following::
|
|
|
|
Dispersion is 50.000000, Balance is 0.000000, Overload is 0.00%
|
|
Required overload is 33.333333%
|
|
Worst tier is 50.000000 (r1z3)
|
|
--------------------------------------------------------------------------
|
|
Tier Parts % Max 0 1 2 3
|
|
--------------------------------------------------------------------------
|
|
r1 256 0.00 3 0 0 0 256
|
|
r1z1 192 0.00 1 64 192 0 0
|
|
r1z1-127.0.0.1 192 0.00 1 64 192 0 0
|
|
r1z1-127.0.0.1/sda 192 0.00 1 64 192 0 0
|
|
r1z2 192 0.00 1 64 192 0 0
|
|
r1z2-127.0.0.2 192 0.00 1 64 192 0 0
|
|
r1z2-127.0.0.2/sda 192 0.00 1 64 192 0 0
|
|
r1z3 256 50.00 1 0 128 128 0
|
|
r1z3-127.0.0.3 256 50.00 1 0 128 128 0
|
|
r1z3-127.0.0.3/sda 192 0.00 1 64 192 0 0
|
|
r1z3-127.0.0.3/sdb 192 0.00 1 64 192 0 0
|
|
|
|
|
|
The first line reports that there are 256 partitions with 3 copies in region 1;
|
|
and this is an expected output in this case (single region with 3 replicas) as
|
|
reported by the "Max" value.
|
|
|
|
However, there is some inbalance in the cluster, more precisely in zone 3. The
|
|
"Max" reports a maximum of 1 copy in this zone; however 50.00% of the partitions
|
|
are storing 2 replicas in this zone (which is somewhat expected, because there
|
|
are more disks in this zone).
|
|
|
|
You can now either add more capacity to the other zones, decrease the total
|
|
weight in zone 3 or set the overload to a value `greater than` 33.333333% -
|
|
only as much overload as needed will be used.
|
|
|
|
-----------------------
|
|
Scripting Ring Creation
|
|
-----------------------
|
|
You can create scripts to create the account and container rings and rebalance. Here's an example script for the Account ring. Use similar commands to create a make-container-ring.sh script on the proxy server node.
|
|
|
|
1. Create a script file called make-account-ring.sh on the proxy
|
|
server node with the following content::
|
|
|
|
#!/bin/bash
|
|
cd /etc/swift
|
|
rm -f account.builder account.ring.gz backups/account.builder backups/account.ring.gz
|
|
swift-ring-builder account.builder create 18 3 1
|
|
swift-ring-builder account.builder add r1z1-<account-server-1>:6202/sdb1 1
|
|
swift-ring-builder account.builder add r1z2-<account-server-2>:6202/sdb1 1
|
|
swift-ring-builder account.builder rebalance
|
|
|
|
You need to replace the values of <account-server-1>,
|
|
<account-server-2>, etc. with the IP addresses of the account
|
|
servers used in your setup. You can have as many account servers as
|
|
you need. All account servers are assumed to be listening on port
|
|
6202, and have a storage device called "sdb1" (this is a directory
|
|
name created under /drives when we setup the account server). The
|
|
"z1", "z2", etc. designate zones, and you can choose whether you
|
|
put devices in the same or different zones. The "r1" designates
|
|
the region, with different regions specified as "r1", "r2", etc.
|
|
|
|
2. Make the script file executable and run it to create the account ring file::
|
|
|
|
chmod +x make-account-ring.sh
|
|
sudo ./make-account-ring.sh
|
|
|
|
3. Copy the resulting ring file /etc/swift/account.ring.gz to all the
|
|
account server nodes in your Swift environment, and put them in the
|
|
/etc/swift directory on these nodes. Make sure that every time you
|
|
change the account ring configuration, you copy the resulting ring
|
|
file to all the account nodes.
|
|
|
|
-----------------------
|
|
Handling System Updates
|
|
-----------------------
|
|
|
|
It is recommended that system updates and reboots are done a zone at a time.
|
|
This allows the update to happen, and for the Swift cluster to stay available
|
|
and responsive to requests. It is also advisable when updating a zone, let
|
|
it run for a while before updating the other zones to make sure the update
|
|
doesn't have any adverse effects.
|
|
|
|
----------------------
|
|
Handling Drive Failure
|
|
----------------------
|
|
|
|
In the event that a drive has failed, the first step is to make sure the drive
|
|
is unmounted. This will make it easier for swift to work around the failure
|
|
until it has been resolved. If the drive is going to be replaced immediately,
|
|
then it is just best to replace the drive, format it, remount it, and let
|
|
replication fill it up.
|
|
|
|
After the drive is unmounted, make sure the mount point is owned by root
|
|
(root:root 755). This ensures that rsync will not try to replicate into the
|
|
root drive once the failed drive is unmounted.
|
|
|
|
If the drive can't be replaced immediately, then it is best to leave it
|
|
unmounted, and set the device weight to 0. This will allow all the
|
|
replicas that were on that drive to be replicated elsewhere until the drive
|
|
is replaced. Once the drive is replaced, the device weight can be increased
|
|
again. Setting the device weight to 0 instead of removing the drive from the
|
|
ring gives Swift the chance to replicate data from the failing disk too (in case
|
|
it is still possible to read some of the data).
|
|
|
|
Setting the device weight to 0 (or removing a failed drive from the ring) has
|
|
another benefit: all partitions that were stored on the failed drive are
|
|
distributed over the remaining disks in the cluster, and each disk only needs to
|
|
store a few new partitions. This is much faster compared to replicating all
|
|
partitions to a single, new disk. It decreases the time to recover from a
|
|
degraded number of replicas significantly, and becomes more and more important
|
|
with bigger disks.
|
|
|
|
-----------------------
|
|
Handling Server Failure
|
|
-----------------------
|
|
|
|
If a server is having hardware issues, it is a good idea to make sure the
|
|
swift services are not running. This will allow Swift to work around the
|
|
failure while you troubleshoot.
|
|
|
|
If the server just needs a reboot, or a small amount of work that should
|
|
only last a couple of hours, then it is probably best to let Swift work
|
|
around the failure and get the machine fixed and back online. When the
|
|
machine comes back online, replication will make sure that anything that is
|
|
missing during the downtime will get updated.
|
|
|
|
If the server has more serious issues, then it is probably best to remove
|
|
all of the server's devices from the ring. Once the server has been repaired
|
|
and is back online, the server's devices can be added back into the ring.
|
|
It is important that the devices are reformatted before putting them back
|
|
into the ring as it is likely to be responsible for a different set of
|
|
partitions than before.
|
|
|
|
-----------------------
|
|
Detecting Failed Drives
|
|
-----------------------
|
|
|
|
It has been our experience that when a drive is about to fail, error messages
|
|
will spew into `/var/log/kern.log`. There is a script called
|
|
`swift-drive-audit` that can be run via cron to watch for bad drives. If
|
|
errors are detected, it will unmount the bad drive, so that Swift can
|
|
work around it. The script takes a configuration file with the following
|
|
settings:
|
|
|
|
[drive-audit]
|
|
|
|
================== ============== ===========================================
|
|
Option Default Description
|
|
------------------ -------------- -------------------------------------------
|
|
log_facility LOG_LOCAL0 Syslog log facility
|
|
log_level INFO Log level
|
|
device_dir /srv/node Directory devices are mounted under
|
|
minutes 60 Number of minutes to look back in
|
|
`/var/log/kern.log`
|
|
error_limit 1 Number of errors to find before a device
|
|
is unmounted
|
|
log_file_pattern /var/log/kern* Location of the log file with globbing
|
|
pattern to check against device errors
|
|
regex_pattern_X (see below) Regular expression patterns to be used to
|
|
locate device blocks with errors in the
|
|
log file
|
|
================== ============== ===========================================
|
|
|
|
The default regex pattern used to locate device blocks with errors are
|
|
`\berror\b.*\b(sd[a-z]{1,2}\d?)\b` and `\b(sd[a-z]{1,2}\d?)\b.*\berror\b`.
|
|
One is able to overwrite the default above by providing new expressions
|
|
using the format `regex_pattern_X = regex_expression`, where `X` is a number.
|
|
|
|
This script has been tested on Ubuntu 10.04 and Ubuntu 12.04, so if you are
|
|
using a different distro or OS, some care should be taken before using in production.
|
|
|
|
------------------------------
|
|
Preventing Disk Full Scenarios
|
|
------------------------------
|
|
|
|
Prevent disk full scenarios by ensuring that the ``proxy-server`` blocks PUT
|
|
requests and rsync prevents replication to the specific drives.
|
|
|
|
You can prevent `proxy-server` PUT requests to low space disks by ensuring
|
|
``fallocate_reserve`` is set in the ``object-server.conf``. By default,
|
|
``fallocate_reserve`` is set to 1%. This blocks PUT requests that leave the
|
|
free disk space below 1% of the disk.
|
|
|
|
In order to prevent rsync replication to specific drives, firstly
|
|
setup ``rsync_module`` per disk in your ``object-replicator``.
|
|
Set this in ``object-server.conf``:
|
|
|
|
.. code::
|
|
|
|
[object-replicator]
|
|
rsync_module = {replication_ip}::object_{device}
|
|
|
|
Set the individual drives in ``rsync.conf``. For example:
|
|
|
|
.. code::
|
|
|
|
[object_sda]
|
|
max connections = 4
|
|
lock file = /var/lock/object_sda.lock
|
|
|
|
[object_sdb]
|
|
max connections = 4
|
|
lock file = /var/lock/object_sdb.lock
|
|
|
|
Finally, monitor the disk space of each disk and adjust the rsync
|
|
``max connections`` per drive to ``-1``. We recommend utilising your existing
|
|
monitoring solution to achieve this. The following is an example script:
|
|
|
|
.. code-block:: python
|
|
|
|
#!/usr/bin/env python
|
|
import os
|
|
import errno
|
|
|
|
RESERVE = 500 * 2 ** 20 # 500 MiB
|
|
|
|
DEVICES = '/srv/node1'
|
|
|
|
path_template = '/etc/rsync.d/disable_%s.conf'
|
|
config_template = '''
|
|
[object_%s]
|
|
max connections = -1
|
|
'''
|
|
|
|
def disable_rsync(device):
|
|
with open(path_template % device, 'w') as f:
|
|
f.write(config_template.lstrip() % device)
|
|
|
|
|
|
def enable_rsync(device):
|
|
try:
|
|
os.unlink(path_template % device)
|
|
except OSError as e:
|
|
# ignore file does not exist
|
|
if e.errno != errno.ENOENT:
|
|
raise
|
|
|
|
|
|
for device in os.listdir(DEVICES):
|
|
path = os.path.join(DEVICES, device)
|
|
st = os.statvfs(path)
|
|
free = st.f_bavail * st.f_frsize
|
|
if free < RESERVE:
|
|
disable_rsync(device)
|
|
else:
|
|
enable_rsync(device)
|
|
|
|
For the above script to work, ensure ``/etc/rsync.d/`` conf files are
|
|
included, by specifying ``&include`` in your ``rsync.conf`` file:
|
|
|
|
.. code::
|
|
|
|
&include /etc/rsync.d
|
|
|
|
Use this in conjunction with a cron job to periodically run the script, for example:
|
|
|
|
.. code::
|
|
|
|
# /etc/cron.d/devicecheck
|
|
* * * * * root /some/path/to/disable_rsync.py
|
|
|
|
.. _dispersion_report:
|
|
|
|
-----------------
|
|
Dispersion Report
|
|
-----------------
|
|
|
|
There is a swift-dispersion-report tool for measuring overall cluster health.
|
|
This is accomplished by checking if a set of deliberately distributed
|
|
containers and objects are currently in their proper places within the cluster.
|
|
|
|
For instance, a common deployment has three replicas of each object. The health
|
|
of that object can be measured by checking if each replica is in its proper
|
|
place. If only 2 of the 3 is in place the object's heath can be said to be at
|
|
66.66%, where 100% would be perfect.
|
|
|
|
A single object's health, especially an older object, usually reflects the
|
|
health of that entire partition the object is in. If we make enough objects on
|
|
a distinct percentage of the partitions in the cluster, we can get a pretty
|
|
valid estimate of the overall cluster health. In practice, about 1% partition
|
|
coverage seems to balance well between accuracy and the amount of time it takes
|
|
to gather results.
|
|
|
|
The first thing that needs to be done to provide this health value is create a
|
|
new account solely for this usage. Next, we need to place the containers and
|
|
objects throughout the system so that they are on distinct partitions. The
|
|
swift-dispersion-populate tool does this by making up random container and
|
|
object names until they fall on distinct partitions. Last, and repeatedly for
|
|
the life of the cluster, we need to run the swift-dispersion-report tool to
|
|
check the health of each of these containers and objects.
|
|
|
|
These tools need direct access to the entire cluster and to the ring files
|
|
(installing them on a proxy server will probably do). Both
|
|
swift-dispersion-populate and swift-dispersion-report use the same
|
|
configuration file, /etc/swift/dispersion.conf. Example conf file::
|
|
|
|
[dispersion]
|
|
auth_url = http://localhost:8080/auth/v1.0
|
|
auth_user = test:tester
|
|
auth_key = testing
|
|
endpoint_type = internalURL
|
|
|
|
There are also options for the conf file for specifying the dispersion coverage
|
|
(defaults to 1%), retries, concurrency, etc. though usually the defaults are
|
|
fine. If you want to use keystone v3 for authentication there are options like
|
|
auth_version, user_domain_name, project_domain_name and project_name.
|
|
|
|
Once the configuration is in place, run `swift-dispersion-populate` to populate
|
|
the containers and objects throughout the cluster.
|
|
|
|
Now that those containers and objects are in place, you can run
|
|
`swift-dispersion-report` to get a dispersion report, or the overall health of
|
|
the cluster. Here is an example of a cluster in perfect health::
|
|
|
|
$ swift-dispersion-report
|
|
Queried 2621 containers for dispersion reporting, 19s, 0 retries
|
|
100.00% of container copies found (7863 of 7863)
|
|
Sample represents 1.00% of the container partition space
|
|
|
|
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
|
100.00% of object copies found (7857 of 7857)
|
|
Sample represents 1.00% of the object partition space
|
|
|
|
Now I'll deliberately double the weight of a device in the object ring (with
|
|
replication turned off) and rerun the dispersion report to show what impact
|
|
that has::
|
|
|
|
$ swift-ring-builder object.builder set_weight d0 200
|
|
$ swift-ring-builder object.builder rebalance
|
|
...
|
|
$ swift-dispersion-report
|
|
Queried 2621 containers for dispersion reporting, 8s, 0 retries
|
|
100.00% of container copies found (7863 of 7863)
|
|
Sample represents 1.00% of the container partition space
|
|
|
|
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
|
There were 1763 partitions missing one copy.
|
|
77.56% of object copies found (6094 of 7857)
|
|
Sample represents 1.00% of the object partition space
|
|
|
|
You can see the health of the objects in the cluster has gone down
|
|
significantly. Of course, I only have four devices in this test environment, in
|
|
a production environment with many many devices the impact of one device change
|
|
is much less. Next, I'll run the replicators to get everything put back into
|
|
place and then rerun the dispersion report::
|
|
|
|
... start object replicators and monitor logs until they're caught up ...
|
|
$ swift-dispersion-report
|
|
Queried 2621 containers for dispersion reporting, 17s, 0 retries
|
|
100.00% of container copies found (7863 of 7863)
|
|
Sample represents 1.00% of the container partition space
|
|
|
|
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
|
100.00% of object copies found (7857 of 7857)
|
|
Sample represents 1.00% of the object partition space
|
|
|
|
You can also run the report for only containers or objects::
|
|
|
|
$ swift-dispersion-report --container-only
|
|
Queried 2621 containers for dispersion reporting, 17s, 0 retries
|
|
100.00% of container copies found (7863 of 7863)
|
|
Sample represents 1.00% of the container partition space
|
|
|
|
$ swift-dispersion-report --object-only
|
|
Queried 2619 objects for dispersion reporting, 7s, 0 retries
|
|
100.00% of object copies found (7857 of 7857)
|
|
Sample represents 1.00% of the object partition space
|
|
|
|
Alternatively, the dispersion report can also be output in json format. This
|
|
allows it to be more easily consumed by third party utilities::
|
|
|
|
$ swift-dispersion-report -j
|
|
{"object": {"retries:": 0, "missing_two": 0, "copies_found": 7863, "missing_one": 0, "copies_expected": 7863, "pct_found": 100.0, "overlapping": 0, "missing_all": 0}, "container": {"retries:": 0, "missing_two": 0, "copies_found": 12534, "missing_one": 0, "copies_expected": 12534, "pct_found": 100.0, "overlapping": 15, "missing_all": 0}}
|
|
|
|
Note that you may select which storage policy to use by setting the option
|
|
'--policy-name silver' or '-P silver' (silver is the example policy name here).
|
|
If no policy is specified, the default will be used per the swift.conf file.
|
|
When you specify a policy the containers created also include the policy index,
|
|
thus even when running a container_only report, you will need to specify the
|
|
policy not using the default.
|
|
|
|
-----------------------------------
|
|
Geographically Distributed Clusters
|
|
-----------------------------------
|
|
|
|
Swift's default configuration is currently designed to work in a
|
|
single region, where a region is defined as a group of machines with
|
|
high-bandwidth, low-latency links between them. However, configuration
|
|
options exist that make running a performant multi-region Swift
|
|
cluster possible.
|
|
|
|
For the rest of this section, we will assume a two-region Swift
|
|
cluster: region 1 in San Francisco (SF), and region 2 in New York
|
|
(NY). Each region shall contain within it 3 zones, numbered 1, 2, and
|
|
3, for a total of 6 zones.
|
|
|
|
~~~~~~~~~~~~~
|
|
read_affinity
|
|
~~~~~~~~~~~~~
|
|
|
|
This setting, combined with sorting_method setting, makes the proxy server prefer local backend servers for
|
|
GET and HEAD requests over non-local ones. For example, it is
|
|
preferable for an SF proxy server to service object GET requests
|
|
by talking to SF object servers, as the client will receive lower
|
|
latency and higher throughput.
|
|
|
|
By default, Swift randomly chooses one of the three replicas to give
|
|
to the client, thereby spreading the load evenly. In the case of a
|
|
geographically-distributed cluster, the administrator is likely to
|
|
prioritize keeping traffic local over even distribution of results.
|
|
This is where the read_affinity setting comes in.
|
|
|
|
Example::
|
|
|
|
[app:proxy-server]
|
|
sorting_method = affinity
|
|
read_affinity = r1=100
|
|
|
|
This will make the proxy attempt to service GET and HEAD requests from
|
|
backends in region 1 before contacting any backends in region 2.
|
|
However, if no region 1 backends are available (due to replica
|
|
placement, failed hardware, or other reasons), then the proxy will
|
|
fall back to backend servers in other regions.
|
|
|
|
Example::
|
|
|
|
[app:proxy-server]
|
|
sorting_method = affinity
|
|
read_affinity = r1z1=100, r1=200
|
|
|
|
This will make the proxy attempt to service GET and HEAD requests from
|
|
backends in region 1 zone 1, then backends in region 1, then any other
|
|
backends. If a proxy is physically close to a particular zone or
|
|
zones, this can provide bandwidth savings. For example, if a zone
|
|
corresponds to servers in a particular rack, and the proxy server is
|
|
in that same rack, then setting read_affinity to prefer reads from
|
|
within the rack will result in less traffic between the top-of-rack
|
|
switches.
|
|
|
|
The read_affinity setting may contain any number of region/zone
|
|
specifiers; the priority number (after the equals sign) determines the
|
|
ordering in which backend servers will be contacted. A lower number
|
|
means higher priority.
|
|
|
|
Note that read_affinity only affects the ordering of primary nodes
|
|
(see ring docs for definition of primary node), not the ordering of
|
|
handoff nodes.
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
write_affinity and write_affinity_node_count
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This setting makes the proxy server prefer local backend servers for
|
|
object PUT requests over non-local ones. For example, it may be
|
|
preferable for an SF proxy server to service object PUT requests
|
|
by talking to SF object servers, as the client will receive lower
|
|
latency and higher throughput. However, if this setting is used, note
|
|
that a NY proxy server handling a GET request for an object that was
|
|
PUT using write affinity may have to fetch it across the WAN link, as
|
|
the object won't immediately have any replicas in NY. However,
|
|
replication will move the object's replicas to their proper homes in
|
|
both SF and NY.
|
|
|
|
Note that only object PUT requests are affected by the write_affinity
|
|
setting; POST, GET, HEAD, DELETE, OPTIONS, and account/container PUT
|
|
requests are not affected.
|
|
|
|
This setting lets you trade data distribution for throughput. If
|
|
write_affinity is enabled, then object replicas will initially be
|
|
stored all within a particular region or zone, thereby decreasing the
|
|
quality of the data distribution, but the replicas will be distributed
|
|
over fast WAN links, giving higher throughput to clients. Note that
|
|
the replicators will eventually move objects to their proper,
|
|
well-distributed homes.
|
|
|
|
The write_affinity setting is useful only when you don't typically
|
|
read objects immediately after writing them. For example, consider a
|
|
workload of mainly backups: if you have a bunch of machines in NY that
|
|
periodically write backups to Swift, then odds are that you don't then
|
|
immediately read those backups in SF. If your workload doesn't look
|
|
like that, then you probably shouldn't use write_affinity.
|
|
|
|
The write_affinity_node_count setting is only useful in conjunction
|
|
with write_affinity; it governs how many local object servers will be
|
|
tried before falling back to non-local ones.
|
|
|
|
Example::
|
|
|
|
[app:proxy-server]
|
|
write_affinity = r1
|
|
write_affinity_node_count = 2 * replicas
|
|
|
|
Assuming 3 replicas, this configuration will make object PUTs try
|
|
storing the object's replicas on up to 6 disks ("2 * replicas") in
|
|
region 1 ("r1"). Proxy server tries to find 3 devices for storing the
|
|
object. While a device is unavailable, it queries the ring for the 4th
|
|
device and so on until 6th device. If the 6th disk is still unavailable,
|
|
the last replica will be sent to other region. It doesn't mean there'll
|
|
have 6 replicas in region 1.
|
|
|
|
|
|
You should be aware that, if you have data coming into SF faster than
|
|
your link to NY can transfer it, then your cluster's data distribution
|
|
will get worse and worse over time as objects pile up in SF. If this
|
|
happens, it is recommended to disable write_affinity and simply let
|
|
object PUTs traverse the WAN link, as that will naturally limit the
|
|
object growth rate to what your WAN link can handle.
|
|
|
|
|
|
--------------------------------
|
|
Cluster Telemetry and Monitoring
|
|
--------------------------------
|
|
|
|
Various metrics and telemetry can be obtained from the account, container, and
|
|
object servers using the recon server middleware and the swift-recon cli. To do
|
|
so update your account, container, or object servers pipelines to include recon
|
|
and add the associated filter config.
|
|
|
|
object-server.conf sample::
|
|
|
|
[pipeline:main]
|
|
pipeline = recon object-server
|
|
|
|
[filter:recon]
|
|
use = egg:swift#recon
|
|
recon_cache_path = /var/cache/swift
|
|
|
|
container-server.conf sample::
|
|
|
|
[pipeline:main]
|
|
pipeline = recon container-server
|
|
|
|
[filter:recon]
|
|
use = egg:swift#recon
|
|
recon_cache_path = /var/cache/swift
|
|
|
|
account-server.conf sample::
|
|
|
|
[pipeline:main]
|
|
pipeline = recon account-server
|
|
|
|
[filter:recon]
|
|
use = egg:swift#recon
|
|
recon_cache_path = /var/cache/swift
|
|
|
|
The recon_cache_path simply sets the directory where stats for a few items will
|
|
be stored. Depending on the method of deployment you may need to create this
|
|
directory manually and ensure that swift has read/write access.
|
|
|
|
Finally, if you also wish to track asynchronous pending on your object
|
|
servers you will need to setup a cronjob to run the swift-recon-cron script
|
|
periodically on your object servers::
|
|
|
|
*/5 * * * * swift /usr/bin/swift-recon-cron /etc/swift/object-server.conf
|
|
|
|
Once the recon middleware is enabled, a GET request for
|
|
"/recon/<metric>" to the backend object server will return a
|
|
JSON-formatted response::
|
|
|
|
fhines@ubuntu:~$ curl -i http://localhost:6030/recon/async
|
|
HTTP/1.1 200 OK
|
|
Content-Type: application/json
|
|
Content-Length: 20
|
|
Date: Tue, 18 Oct 2011 21:03:01 GMT
|
|
|
|
{"async_pending": 0}
|
|
|
|
|
|
Note that the default port for the object server is 6200, except on a
|
|
Swift All-In-One installation, which uses 6010, 6020, 6030, and 6040.
|
|
|
|
The following metrics and telemetry are currently exposed:
|
|
|
|
========================= ========================================================================================
|
|
Request URI Description
|
|
------------------------- ----------------------------------------------------------------------------------------
|
|
/recon/load returns 1,5, and 15 minute load average
|
|
/recon/mem returns /proc/meminfo
|
|
/recon/mounted returns *ALL* currently mounted filesystems
|
|
/recon/unmounted returns all unmounted drives if mount_check = True
|
|
/recon/diskusage returns disk utilization for storage devices
|
|
/recon/ringmd5 returns object/container/account ring md5sums
|
|
/recon/quarantined returns # of quarantined objects/accounts/containers
|
|
/recon/sockstat returns consumable info from /proc/net/sockstat|6
|
|
/recon/devices returns list of devices and devices dir i.e. /srv/node
|
|
/recon/async returns count of async pending
|
|
/recon/replication returns object replication info (for backward compatibility)
|
|
/recon/replication/<type> returns replication info for given type (account, container, object)
|
|
/recon/auditor/<type> returns auditor stats on last reported scan for given type (account, container, object)
|
|
/recon/updater/<type> returns last updater sweep times for given type (container, object)
|
|
========================= ========================================================================================
|
|
|
|
Note that 'object_replication_last' and 'object_replication_time' in object
|
|
replication info are considered to be transitional and will be removed in
|
|
the subsequent releases. Use 'replication_last' and 'replication_time' instead.
|
|
|
|
This information can also be queried via the swift-recon command line utility::
|
|
|
|
fhines@ubuntu:~$ swift-recon -h
|
|
Usage:
|
|
usage: swift-recon <server_type> [-v] [--suppress] [-a] [-r] [-u] [-d]
|
|
[-l] [-T] [--md5] [--auditor] [--updater] [--expirer] [--sockstat]
|
|
|
|
<server_type> account|container|object
|
|
Defaults to object server.
|
|
|
|
ex: swift-recon container -l --auditor
|
|
|
|
|
|
Options:
|
|
-h, --help show this help message and exit
|
|
-v, --verbose Print verbose info
|
|
--suppress Suppress most connection related errors
|
|
-a, --async Get async stats
|
|
-r, --replication Get replication stats
|
|
--auditor Get auditor stats
|
|
--updater Get updater stats
|
|
--expirer Get expirer stats
|
|
-u, --unmounted Check cluster for unmounted devices
|
|
-d, --diskusage Get disk usage stats
|
|
-l, --loadstats Get cluster load average stats
|
|
-q, --quarantined Get cluster quarantine stats
|
|
--md5 Get md5sum of servers ring and compare to local copy
|
|
--sockstat Get cluster socket usage stats
|
|
-T, --time Check time synchronization
|
|
--all Perform all checks. Equal to
|
|
-arudlqT --md5 --sockstat --auditor --updater
|
|
--expirer --driveaudit --validate-servers
|
|
-z ZONE, --zone=ZONE Only query servers in specified zone
|
|
-t SECONDS, --timeout=SECONDS
|
|
Time to wait for a response from a server
|
|
--swiftdir=SWIFTDIR Default = /etc/swift
|
|
|
|
For example, to obtain container replication info from all hosts in zone "3"::
|
|
|
|
fhines@ubuntu:~$ swift-recon container -r --zone 3
|
|
===============================================================================
|
|
--> Starting reconnaissance on 1 hosts
|
|
===============================================================================
|
|
[2012-04-02 02:45:48] Checking on replication
|
|
[failure] low: 0.000, high: 0.000, avg: 0.000, reported: 1
|
|
[success] low: 486.000, high: 486.000, avg: 486.000, reported: 1
|
|
[replication_time] low: 20.853, high: 20.853, avg: 20.853, reported: 1
|
|
[attempted] low: 243.000, high: 243.000, avg: 243.000, reported: 1
|
|
|
|
---------------------------
|
|
Reporting Metrics to StatsD
|
|
---------------------------
|
|
|
|
If you have a StatsD_ server running, Swift may be configured to send it
|
|
real-time operational metrics. To enable this, set the following
|
|
configuration entries (see the sample configuration files)::
|
|
|
|
log_statsd_host = localhost
|
|
log_statsd_port = 8125
|
|
log_statsd_default_sample_rate = 1.0
|
|
log_statsd_sample_rate_factor = 1.0
|
|
log_statsd_metric_prefix = [empty-string]
|
|
|
|
If `log_statsd_host` is not set, this feature is disabled. The default values
|
|
for the other settings are given above. The `log_statsd_host` can be a
|
|
hostname, an IPv4 address, or an IPv6 address (not surrounded with brackets, as
|
|
this is unnecessary since the port is specified separately). If a hostname
|
|
resolves to an IPv4 address, an IPv4 socket will be used to send StatsD UDP
|
|
packets, even if the hostname would also resolve to an IPv6 address.
|
|
|
|
.. _StatsD: http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
|
|
.. _Graphite: http://graphite.wikidot.com/
|
|
.. _Ganglia: http://ganglia.sourceforge.net/
|
|
|
|
The sample rate is a real number between 0 and 1 which defines the
|
|
probability of sending a sample for any given event or timing measurement.
|
|
This sample rate is sent with each sample to StatsD and used to
|
|
multiply the value. For example, with a sample rate of 0.5, StatsD will
|
|
multiply that counter's value by 2 when flushing the metric to an upstream
|
|
monitoring system (Graphite_, Ganglia_, etc.).
|
|
|
|
Some relatively high-frequency metrics have a default sample rate less than
|
|
one. If you want to override the default sample rate for all metrics whose
|
|
default sample rate is not specified in the Swift source, you may set
|
|
`log_statsd_default_sample_rate` to a value less than one. This is NOT
|
|
recommended (see next paragraph). A better way to reduce StatsD load is to
|
|
adjust `log_statsd_sample_rate_factor` to a value less than one. The
|
|
`log_statsd_sample_rate_factor` is multiplied to any sample rate (either the
|
|
global default or one specified by the actual metric logging call in the Swift
|
|
source) prior to handling. In other words, this one tunable can lower the
|
|
frequency of all StatsD logging by a proportional amount.
|
|
|
|
To get the best data, start with the default `log_statsd_default_sample_rate`
|
|
and `log_statsd_sample_rate_factor` values of 1 and only lower
|
|
`log_statsd_sample_rate_factor` if needed. The
|
|
`log_statsd_default_sample_rate` should not be used and remains for backward
|
|
compatibility only.
|
|
|
|
The metric prefix will be prepended to every metric sent to the StatsD server
|
|
For example, with::
|
|
|
|
log_statsd_metric_prefix = proxy01
|
|
|
|
the metric `proxy-server.errors` would be sent to StatsD as
|
|
`proxy01.proxy-server.errors`. This is useful for differentiating different
|
|
servers when sending statistics to a central StatsD server. If you run a local
|
|
StatsD server per node, you could configure a per-node metrics prefix there and
|
|
leave `log_statsd_metric_prefix` blank.
|
|
|
|
Note that metrics reported to StatsD are counters or timing data (which are
|
|
sent in units of milliseconds). StatsD usually expands timing data out to min,
|
|
max, avg, count, and 90th percentile per timing metric, but the details of
|
|
this behavior will depend on the configuration of your StatsD server. Some
|
|
important "gauge" metrics may still need to be collected using another method.
|
|
For example, the `object-server.async_pendings` StatsD metric counts the generation
|
|
of async_pendings in real-time, but will not tell you the current number of
|
|
async_pending container updates on disk at any point in time.
|
|
|
|
Note also that the set of metrics collected, their names, and their semantics
|
|
are not locked down and will change over time.
|
|
|
|
Metrics for `account-auditor`:
|
|
|
|
========================== =========================================================
|
|
Metric Name Description
|
|
-------------------------- ---------------------------------------------------------
|
|
`account-auditor.errors` Count of audit runs (across all account databases) which
|
|
caught an Exception.
|
|
`account-auditor.passes` Count of individual account databases which passed audit.
|
|
`account-auditor.failures` Count of individual account databases which failed audit.
|
|
`account-auditor.timing` Timing data for individual account database audits.
|
|
========================== =========================================================
|
|
|
|
Metrics for `account-reaper`:
|
|
|
|
============================================== ====================================================
|
|
Metric Name Description
|
|
---------------------------------------------- ----------------------------------------------------
|
|
`account-reaper.errors` Count of devices failing the mount check.
|
|
`account-reaper.timing` Timing data for each reap_account() call.
|
|
`account-reaper.return_codes.X` Count of HTTP return codes from various operations
|
|
(e.g. object listing, container deletion, etc.). The
|
|
value for X is the first digit of the return code
|
|
(2 for 201, 4 for 404, etc.).
|
|
`account-reaper.containers_failures` Count of failures to delete a container.
|
|
`account-reaper.containers_deleted` Count of containers successfully deleted.
|
|
`account-reaper.containers_remaining` Count of containers which failed to delete with
|
|
zero successes.
|
|
`account-reaper.containers_possibly_remaining` Count of containers which failed to delete with
|
|
at least one success.
|
|
`account-reaper.objects_failures` Count of failures to delete an object.
|
|
`account-reaper.objects_deleted` Count of objects successfully deleted.
|
|
`account-reaper.objects_remaining` Count of objects which failed to delete with zero
|
|
successes.
|
|
`account-reaper.objects_possibly_remaining` Count of objects which failed to delete with at
|
|
least one success.
|
|
============================================== ====================================================
|
|
|
|
Metrics for `account-server` ("Not Found" is not considered an error and requests
|
|
which increment `errors` are not included in the timing data):
|
|
|
|
======================================== =======================================================
|
|
Metric Name Description
|
|
---------------------------------------- -------------------------------------------------------
|
|
`account-server.DELETE.errors.timing` Timing data for each DELETE request resulting in an
|
|
error: bad request, not mounted, missing timestamp.
|
|
`account-server.DELETE.timing` Timing data for each DELETE request not resulting in
|
|
an error.
|
|
`account-server.PUT.errors.timing` Timing data for each PUT request resulting in an error:
|
|
bad request, not mounted, conflict, recently-deleted.
|
|
`account-server.PUT.timing` Timing data for each PUT request not resulting in an
|
|
error.
|
|
`account-server.HEAD.errors.timing` Timing data for each HEAD request resulting in an
|
|
error: bad request, not mounted.
|
|
`account-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
|
an error.
|
|
`account-server.GET.errors.timing` Timing data for each GET request resulting in an
|
|
error: bad request, not mounted, bad delimiter,
|
|
account listing limit too high, bad accept header.
|
|
`account-server.GET.timing` Timing data for each GET request not resulting in
|
|
an error.
|
|
`account-server.REPLICATE.errors.timing` Timing data for each REPLICATE request resulting in an
|
|
error: bad request, not mounted.
|
|
`account-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
|
in an error.
|
|
`account-server.POST.errors.timing` Timing data for each POST request resulting in an
|
|
error: bad request, bad or missing timestamp, not
|
|
mounted.
|
|
`account-server.POST.timing` Timing data for each POST request not resulting in
|
|
an error.
|
|
======================================== =======================================================
|
|
|
|
Metrics for `account-replicator`:
|
|
|
|
===================================== ====================================================
|
|
Metric Name Description
|
|
------------------------------------- ----------------------------------------------------
|
|
`account-replicator.diffs` Count of syncs handled by sending differing rows.
|
|
`account-replicator.diff_caps` Count of "diffs" operations which failed because
|
|
"max_diffs" was hit.
|
|
`account-replicator.no_changes` Count of accounts found to be in sync.
|
|
`account-replicator.hashmatches` Count of accounts found to be in sync via hash
|
|
comparison (`broker.merge_syncs` was called).
|
|
`account-replicator.rsyncs` Count of completely missing accounts which were sent
|
|
via rsync.
|
|
`account-replicator.remote_merges` Count of syncs handled by sending entire database
|
|
via rsync.
|
|
`account-replicator.attempts` Count of database replication attempts.
|
|
`account-replicator.failures` Count of database replication attempts which failed
|
|
due to corruption (quarantined) or inability to read
|
|
as well as attempts to individual nodes which
|
|
failed.
|
|
`account-replicator.removes.<device>` Count of databases on <device> deleted because the
|
|
delete_timestamp was greater than the put_timestamp
|
|
and the database had no rows or because it was
|
|
successfully sync'ed to other locations and doesn't
|
|
belong here anymore.
|
|
`account-replicator.successes` Count of replication attempts to an individual node
|
|
which were successful.
|
|
`account-replicator.timing` Timing data for each database replication attempt
|
|
not resulting in a failure.
|
|
===================================== ====================================================
|
|
|
|
Metrics for `container-auditor`:
|
|
|
|
============================ ====================================================
|
|
Metric Name Description
|
|
---------------------------- ----------------------------------------------------
|
|
`container-auditor.errors` Incremented when an Exception is caught in an audit
|
|
pass (only once per pass, max).
|
|
`container-auditor.passes` Count of individual containers passing an audit.
|
|
`container-auditor.failures` Count of individual containers failing an audit.
|
|
`container-auditor.timing` Timing data for each container audit.
|
|
============================ ====================================================
|
|
|
|
Metrics for `container-replicator`:
|
|
|
|
======================================= ====================================================
|
|
Metric Name Description
|
|
--------------------------------------- ----------------------------------------------------
|
|
`container-replicator.diffs` Count of syncs handled by sending differing rows.
|
|
`container-replicator.diff_caps` Count of "diffs" operations which failed because
|
|
"max_diffs" was hit.
|
|
`container-replicator.no_changes` Count of containers found to be in sync.
|
|
`container-replicator.hashmatches` Count of containers found to be in sync via hash
|
|
comparison (`broker.merge_syncs` was called).
|
|
`container-replicator.rsyncs` Count of completely missing containers where were sent
|
|
via rsync.
|
|
`container-replicator.remote_merges` Count of syncs handled by sending entire database
|
|
via rsync.
|
|
`container-replicator.attempts` Count of database replication attempts.
|
|
`container-replicator.failures` Count of database replication attempts which failed
|
|
due to corruption (quarantined) or inability to read
|
|
as well as attempts to individual nodes which
|
|
failed.
|
|
`container-replicator.removes.<device>` Count of databases deleted on <device> because the
|
|
delete_timestamp was greater than the put_timestamp
|
|
and the database had no rows or because it was
|
|
successfully sync'ed to other locations and doesn't
|
|
belong here anymore.
|
|
`container-replicator.successes` Count of replication attempts to an individual node
|
|
which were successful.
|
|
`container-replicator.timing` Timing data for each database replication attempt
|
|
not resulting in a failure.
|
|
======================================= ====================================================
|
|
|
|
Metrics for `container-server` ("Not Found" is not considered an error and requests
|
|
which increment `errors` are not included in the timing data):
|
|
|
|
========================================== ====================================================
|
|
Metric Name Description
|
|
------------------------------------------ ----------------------------------------------------
|
|
`container-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
|
|
not mounted, missing timestamp, conflict.
|
|
`container-server.DELETE.timing` Timing data for each DELETE request not resulting in
|
|
an error.
|
|
`container-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
|
|
missing timestamp, not mounted, conflict.
|
|
`container-server.PUT.timing` Timing data for each PUT request not resulting in an
|
|
error.
|
|
`container-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
|
|
not mounted.
|
|
`container-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
|
an error.
|
|
`container-server.GET.errors.timing` Timing data for GET request errors: bad request,
|
|
not mounted, parameters not utf8, bad accept header.
|
|
`container-server.GET.timing` Timing data for each GET request not resulting in
|
|
an error.
|
|
`container-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
|
|
request, not mounted.
|
|
`container-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
|
in an error.
|
|
`container-server.POST.errors.timing` Timing data for POST request errors: bad request,
|
|
bad x-container-sync-to, not mounted.
|
|
`container-server.POST.timing` Timing data for each POST request not resulting in
|
|
an error.
|
|
========================================== ====================================================
|
|
|
|
Metrics for `container-sync`:
|
|
|
|
=============================== ====================================================
|
|
Metric Name Description
|
|
------------------------------- ----------------------------------------------------
|
|
`container-sync.skips` Count of containers skipped because they don't have
|
|
sync'ing enabled.
|
|
`container-sync.failures` Count of failures sync'ing of individual containers.
|
|
`container-sync.syncs` Count of individual containers sync'ed successfully.
|
|
`container-sync.deletes` Count of container database rows sync'ed by
|
|
deletion.
|
|
`container-sync.deletes.timing` Timing data for each container database row
|
|
synchronization via deletion.
|
|
`container-sync.puts` Count of container database rows sync'ed by PUTing.
|
|
`container-sync.puts.timing` Timing data for each container database row
|
|
synchronization via PUTing.
|
|
=============================== ====================================================
|
|
|
|
Metrics for `container-updater`:
|
|
|
|
============================== ====================================================
|
|
Metric Name Description
|
|
------------------------------ ----------------------------------------------------
|
|
`container-updater.successes` Count of containers which successfully updated their
|
|
account.
|
|
`container-updater.failures` Count of containers which failed to update their
|
|
account.
|
|
`container-updater.no_changes` Count of containers which didn't need to update
|
|
their account.
|
|
`container-updater.timing` Timing data for processing a container; only
|
|
includes timing for containers which needed to
|
|
update their accounts (i.e. "successes" and
|
|
"failures" but not "no_changes").
|
|
============================== ====================================================
|
|
|
|
Metrics for `object-auditor`:
|
|
|
|
============================ ====================================================
|
|
Metric Name Description
|
|
---------------------------- ----------------------------------------------------
|
|
`object-auditor.quarantines` Count of objects failing audit and quarantined.
|
|
`object-auditor.errors` Count of errors encountered while auditing objects.
|
|
`object-auditor.timing` Timing data for each object audit (does not include
|
|
any rate-limiting sleep time for
|
|
max_files_per_second, but does include rate-limiting
|
|
sleep time for max_bytes_per_second).
|
|
============================ ====================================================
|
|
|
|
Metrics for `object-expirer`:
|
|
|
|
======================== ====================================================
|
|
Metric Name Description
|
|
------------------------ ----------------------------------------------------
|
|
`object-expirer.objects` Count of objects expired.
|
|
`object-expirer.errors` Count of errors encountered while attempting to
|
|
expire an object.
|
|
`object-expirer.timing` Timing data for each object expiration attempt,
|
|
including ones resulting in an error.
|
|
======================== ====================================================
|
|
|
|
Metrics for `object-reconstructor`:
|
|
|
|
====================================================== ======================================================
|
|
Metric Name Description
|
|
------------------------------------------------------ ------------------------------------------------------
|
|
`object-reconstructor.partition.delete.count.<device>` A count of partitions on <device> which were
|
|
reconstructed and synced to another node because they
|
|
didn't belong on this node. This metric is tracked
|
|
per-device to allow for "quiescence detection" for
|
|
object reconstruction activity on each device.
|
|
`object-reconstructor.partition.delete.timing` Timing data for partitions reconstructed and synced to
|
|
another node because they didn't belong on this node.
|
|
This metric is not tracked per device.
|
|
`object-reconstructor.partition.update.count.<device>` A count of partitions on <device> which were
|
|
reconstructed and synced to another node, but also
|
|
belong on this node. As with delete.count, this metric
|
|
is tracked per-device.
|
|
`object-reconstructor.partition.update.timing` Timing data for partitions reconstructed which also
|
|
belong on this node. This metric is not tracked
|
|
per-device.
|
|
`object-reconstructor.suffix.hashes` Count of suffix directories whose hash (of filenames)
|
|
was recalculated.
|
|
`object-reconstructor.suffix.syncs` Count of suffix directories reconstructed with ssync.
|
|
====================================================== ======================================================
|
|
|
|
Metrics for `object-replicator`:
|
|
|
|
=================================================== ====================================================
|
|
Metric Name Description
|
|
--------------------------------------------------- ----------------------------------------------------
|
|
`object-replicator.partition.delete.count.<device>` A count of partitions on <device> which were
|
|
replicated to another node because they didn't
|
|
belong on this node. This metric is tracked
|
|
per-device to allow for "quiescence detection" for
|
|
object replication activity on each device.
|
|
`object-replicator.partition.delete.timing` Timing data for partitions replicated to another
|
|
node because they didn't belong on this node. This
|
|
metric is not tracked per device.
|
|
`object-replicator.partition.update.count.<device>` A count of partitions on <device> which were
|
|
replicated to another node, but also belong on this
|
|
node. As with delete.count, this metric is tracked
|
|
per-device.
|
|
`object-replicator.partition.update.timing` Timing data for partitions replicated which also
|
|
belong on this node. This metric is not tracked
|
|
per-device.
|
|
`object-replicator.suffix.hashes` Count of suffix directories whose hash (of filenames)
|
|
was recalculated.
|
|
`object-replicator.suffix.syncs` Count of suffix directories replicated with rsync.
|
|
=================================================== ====================================================
|
|
|
|
Metrics for `object-server`:
|
|
|
|
======================================= ====================================================
|
|
Metric Name Description
|
|
--------------------------------------- ----------------------------------------------------
|
|
`object-server.quarantines` Count of objects (files) found bad and moved to
|
|
quarantine.
|
|
`object-server.async_pendings` Count of container updates saved as async_pendings
|
|
(may result from PUT or DELETE requests).
|
|
`object-server.POST.errors.timing` Timing data for POST request errors: bad request,
|
|
missing timestamp, delete-at in past, not mounted.
|
|
`object-server.POST.timing` Timing data for each POST request not resulting in
|
|
an error.
|
|
`object-server.PUT.errors.timing` Timing data for PUT request errors: bad request,
|
|
not mounted, missing timestamp, object creation
|
|
constraint violation, delete-at in past.
|
|
`object-server.PUT.timeouts` Count of object PUTs which exceeded max_upload_time.
|
|
`object-server.PUT.timing` Timing data for each PUT request not resulting in an
|
|
error.
|
|
`object-server.PUT.<device>.timing` Timing data per kB transferred (ms/kB) for each
|
|
non-zero-byte PUT request on each device.
|
|
Monitoring problematic devices, higher is bad.
|
|
`object-server.GET.errors.timing` Timing data for GET request errors: bad request,
|
|
not mounted, header timestamps before the epoch,
|
|
precondition failed.
|
|
File errors resulting in a quarantine are not
|
|
counted here.
|
|
`object-server.GET.timing` Timing data for each GET request not resulting in an
|
|
error. Includes requests which couldn't find the
|
|
object (including disk errors resulting in file
|
|
quarantine).
|
|
`object-server.HEAD.errors.timing` Timing data for HEAD request errors: bad request,
|
|
not mounted.
|
|
`object-server.HEAD.timing` Timing data for each HEAD request not resulting in
|
|
an error. Includes requests which couldn't find the
|
|
object (including disk errors resulting in file
|
|
quarantine).
|
|
`object-server.DELETE.errors.timing` Timing data for DELETE request errors: bad request,
|
|
missing timestamp, not mounted, precondition
|
|
failed. Includes requests which couldn't find or
|
|
match the object.
|
|
`object-server.DELETE.timing` Timing data for each DELETE request not resulting
|
|
in an error.
|
|
`object-server.REPLICATE.errors.timing` Timing data for REPLICATE request errors: bad
|
|
request, not mounted.
|
|
`object-server.REPLICATE.timing` Timing data for each REPLICATE request not resulting
|
|
in an error.
|
|
======================================= ====================================================
|
|
|
|
Metrics for `object-updater`:
|
|
|
|
============================ ====================================================
|
|
Metric Name Description
|
|
---------------------------- ----------------------------------------------------
|
|
`object-updater.errors` Count of drives not mounted or async_pending files
|
|
with an unexpected name.
|
|
`object-updater.timing` Timing data for object sweeps to flush async_pending
|
|
container updates. Does not include object sweeps
|
|
which did not find an existing async_pending storage
|
|
directory.
|
|
`object-updater.quarantines` Count of async_pending container updates which were
|
|
corrupted and moved to quarantine.
|
|
`object-updater.successes` Count of successful container updates.
|
|
`object-updater.failures` Count of failed container updates.
|
|
`object-updater.unlinks` Count of async_pending files unlinked. An
|
|
async_pending file is unlinked either when it is
|
|
successfully processed or when the replicator sees
|
|
that there is a newer async_pending file for the
|
|
same object.
|
|
============================ ====================================================
|
|
|
|
Metrics for `proxy-server` (in the table, `<type>` is the proxy-server
|
|
controller responsible for the request and will be one of "account",
|
|
"container", or "object"):
|
|
|
|
======================================== ====================================================
|
|
Metric Name Description
|
|
---------------------------------------- ----------------------------------------------------
|
|
`proxy-server.errors` Count of errors encountered while serving requests
|
|
before the controller type is determined. Includes
|
|
invalid Content-Length, errors finding the internal
|
|
controller to handle the request, invalid utf8, and
|
|
bad URLs.
|
|
`proxy-server.<type>.handoff_count` Count of node hand-offs; only tracked if log_handoffs
|
|
is set in the proxy-server config.
|
|
`proxy-server.<type>.handoff_all_count` Count of times *only* hand-off locations were
|
|
utilized; only tracked if log_handoffs is set in the
|
|
proxy-server config.
|
|
`proxy-server.<type>.client_timeouts` Count of client timeouts (client did not read within
|
|
`client_timeout` seconds during a GET or did not
|
|
supply data within `client_timeout` seconds during
|
|
a PUT).
|
|
`proxy-server.<type>.client_disconnects` Count of detected client disconnects during PUT
|
|
operations (does NOT include caught Exceptions in
|
|
the proxy-server which caused a client disconnect).
|
|
======================================== ====================================================
|
|
|
|
Metrics for `proxy-logging` middleware (in the table, `<type>` is either the
|
|
proxy-server controller responsible for the request: "account", "container",
|
|
"object", or the string "SOS" if the request came from the `Swift Origin Server`_
|
|
middleware. The `<verb>` portion will be one of "GET", "HEAD", "POST", "PUT",
|
|
"DELETE", "COPY", "OPTIONS", or "BAD_METHOD". The list of valid HTTP methods
|
|
is configurable via the `log_statsd_valid_http_methods` config variable and
|
|
the default setting yields the above behavior):
|
|
|
|
.. _Swift Origin Server: https://github.com/dpgoetz/sos
|
|
|
|
==================================================== ============================================
|
|
Metric Name Description
|
|
---------------------------------------------------- --------------------------------------------
|
|
`proxy-server.<type>.<verb>.<status>.timing` Timing data for requests, start to finish.
|
|
The <status> portion is the numeric HTTP
|
|
status code for the request (e.g. "200" or
|
|
"404").
|
|
`proxy-server.<type>.GET.<status>.first-byte.timing` Timing data up to completion of sending the
|
|
response headers (only for GET requests).
|
|
<status> and <type> are as for the main
|
|
timing metric.
|
|
`proxy-server.<type>.<verb>.<status>.xfer` This counter metric is the sum of bytes
|
|
transferred in (from clients) and out (to
|
|
clients) for requests. The <type>, <verb>,
|
|
and <status> portions of the metric are just
|
|
like the main timing metric.
|
|
==================================================== ============================================
|
|
|
|
The `proxy-logging` middleware also groups these metrics by policy. The
|
|
`<policy-index>` portion represents a policy index):
|
|
|
|
========================================================================== =====================================
|
|
Metric Name Description
|
|
-------------------------------------------------------------------------- -------------------------------------
|
|
`proxy-server.object.policy.<policy-index>.<verb>.<status>.timing` Timing data for requests, aggregated
|
|
by policy index.
|
|
`proxy-server.object.policy.<policy-index>.GET.<status>.first-byte.timing` Timing data up to completion of
|
|
sending the response headers,
|
|
aggregated by policy index.
|
|
`proxy-server.object.policy.<policy-index>.<verb>.<status>.xfer` Sum of bytes transferred in and out,
|
|
aggregated by policy index.
|
|
========================================================================== =====================================
|
|
|
|
Metrics for `tempauth` middleware (in the table, `<reseller_prefix>` represents
|
|
the actual configured reseller_prefix or "`NONE`" if the reseller_prefix is the
|
|
empty string):
|
|
|
|
========================================= ====================================================
|
|
Metric Name Description
|
|
----------------------------------------- ----------------------------------------------------
|
|
`tempauth.<reseller_prefix>.unauthorized` Count of regular requests which were denied with
|
|
HTTPUnauthorized.
|
|
`tempauth.<reseller_prefix>.forbidden` Count of regular requests which were denied with
|
|
HTTPForbidden.
|
|
`tempauth.<reseller_prefix>.token_denied` Count of token requests which were denied.
|
|
`tempauth.<reseller_prefix>.errors` Count of errors.
|
|
========================================= ====================================================
|
|
|
|
|
|
------------------------
|
|
Debugging Tips and Tools
|
|
------------------------
|
|
|
|
When a request is made to Swift, it is given a unique transaction id. This
|
|
id should be in every log line that has to do with that request. This can
|
|
be useful when looking at all the services that are hit by a single request.
|
|
|
|
If you need to know where a specific account, container or object is in the
|
|
cluster, `swift-get-nodes` will show the location where each replica should be.
|
|
|
|
If you are looking at an object on the server and need more info,
|
|
`swift-object-info` will display the account, container, replica locations
|
|
and metadata of the object.
|
|
|
|
If you are looking at a container on the server and need more info,
|
|
`swift-container-info` will display all the information like the account,
|
|
container, replica locations and metadata of the container.
|
|
|
|
If you are looking at an account on the server and need more info,
|
|
`swift-account-info` will display the account, replica locations
|
|
and metadata of the account.
|
|
|
|
If you want to audit the data for an account, `swift-account-audit` can be
|
|
used to crawl the account, checking that all containers and objects can be
|
|
found.
|
|
|
|
-----------------
|
|
Managing Services
|
|
-----------------
|
|
|
|
Swift services are generally managed with `swift-init`. the general usage is
|
|
``swift-init <service> <command>``, where service is the swift service to
|
|
manage (for example object, container, account, proxy) and command is one of:
|
|
|
|
========== ===============================================
|
|
Command Description
|
|
---------- -----------------------------------------------
|
|
start Start the service
|
|
stop Stop the service
|
|
restart Restart the service
|
|
shutdown Attempt to gracefully shutdown the service
|
|
reload Attempt to gracefully restart the service
|
|
========== ===============================================
|
|
|
|
A graceful shutdown or reload will finish any current requests before
|
|
completely stopping the old service. There is also a special case of
|
|
`swift-init all <command>`, which will run the command for all swift services.
|
|
|
|
In cases where there are multiple configs for a service, a specific config
|
|
can be managed with ``swift-init <service>.<config> <command>``.
|
|
For example, when a separate replication network is used, there might be
|
|
`/etc/swift/object-server/public.conf` for the object server and
|
|
`/etc/swift/object-server/replication.conf` for the replication services.
|
|
In this case, the replication services could be restarted with
|
|
``swift-init object-server.replication restart``.
|
|
|
|
--------------
|
|
Object Auditor
|
|
--------------
|
|
|
|
On system failures, the XFS file system can sometimes truncate files it's
|
|
trying to write and produce zero-byte files. The object-auditor will catch
|
|
these problems but in the case of a system crash it would be advisable to run
|
|
an extra, less rate limited sweep to check for these specific files. You can
|
|
run this command as follows:
|
|
`swift-object-auditor /path/to/object-server/config/file.conf once -z 1000`
|
|
"-z" means to only check for zero-byte files at 1000 files per second.
|
|
|
|
At times it is useful to be able to run the object auditor on a specific
|
|
device or set of devices. You can run the object-auditor as follows:
|
|
swift-object-auditor /path/to/object-server/config/file.conf once --devices=sda,sdb
|
|
|
|
This will run the object auditor on only the sda and sdb devices. This param
|
|
accepts a comma separated list of values.
|
|
|
|
-----------------
|
|
Object Replicator
|
|
-----------------
|
|
|
|
At times it is useful to be able to run the object replicator on a specific
|
|
device or partition. You can run the object-replicator as follows:
|
|
swift-object-replicator /path/to/object-server/config/file.conf once --devices=sda,sdb
|
|
|
|
This will run the object replicator on only the sda and sdb devices. You can
|
|
likewise run that command with --partitions. Both params accept a comma
|
|
separated list of values. If both are specified they will be ANDed together.
|
|
These can only be run in "once" mode.
|
|
|
|
-------------
|
|
Swift Orphans
|
|
-------------
|
|
|
|
Swift Orphans are processes left over after a reload of a Swift server.
|
|
|
|
For example, when upgrading a proxy server you would probably finish
|
|
with a `swift-init proxy-server reload` or `/etc/init.d/swift-proxy
|
|
reload`. This kills the parent proxy server process and leaves the
|
|
child processes running to finish processing whatever requests they
|
|
might be handling at the time. It then starts up a new parent proxy
|
|
server process and its children to handle new incoming requests. This
|
|
allows zero-downtime upgrades with no impact to existing requests.
|
|
|
|
The orphaned child processes may take a while to exit, depending on
|
|
the length of the requests they were handling. However, sometimes an
|
|
old process can be hung up due to some bug or hardware issue. In these
|
|
cases, these orphaned processes will hang around
|
|
forever. `swift-orphans` can be used to find and kill these orphans.
|
|
|
|
`swift-orphans` with no arguments will just list the orphans it finds
|
|
that were started more than 24 hours ago. You shouldn't really check
|
|
for orphans until 24 hours after you perform a reload, as some
|
|
requests can take a long time to process. `swift-orphans -k TERM` will
|
|
send the SIG_TERM signal to the orphans processes, or you can `kill
|
|
-TERM` the pids yourself if you prefer.
|
|
|
|
You can run `swift-orphans --help` for more options.
|
|
|
|
|
|
------------
|
|
Swift Oldies
|
|
------------
|
|
|
|
Swift Oldies are processes that have just been around for a long
|
|
time. There's nothing necessarily wrong with this, but it might
|
|
indicate a hung process if you regularly upgrade and reload/restart
|
|
services. You might have so many servers that you don't notice when a
|
|
reload/restart fails; `swift-oldies` can help with this.
|
|
|
|
For example, if you upgraded and reloaded/restarted everything 2 days
|
|
ago, and you've already cleaned up any orphans with `swift-orphans`,
|
|
you can run `swift-oldies -a 48` to find any Swift processes still
|
|
around that were started more than 2 days ago and then investigate
|
|
them accordingly.
|
|
|
|
|
|
|
|
-------------------
|
|
Custom Log Handlers
|
|
-------------------
|
|
|
|
Swift supports setting up custom log handlers for services by specifying a
|
|
comma-separated list of functions to invoke when logging is setup. It does so
|
|
via the `log_custom_handlers` configuration option. Logger hooks invoked are
|
|
passed the same arguments as Swift's get_logger function (as well as the
|
|
getLogger and LogAdapter object):
|
|
|
|
============== ===============================================
|
|
Name Description
|
|
-------------- -----------------------------------------------
|
|
conf Configuration dict to read settings from
|
|
name Name of the logger received
|
|
log_to_console (optional) Write log messages to console on stderr
|
|
log_route Route for the logging received
|
|
fmt Override log format received
|
|
logger The logging.getLogger object
|
|
adapted_logger The LogAdapter object
|
|
============== ===============================================
|
|
|
|
A basic example that sets up a custom logger might look like the
|
|
following:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
def my_logger(conf, name, log_to_console, log_route, fmt, logger,
|
|
adapted_logger):
|
|
my_conf_opt = conf.get('some_custom_setting')
|
|
my_handler = third_party_logstore_handler(my_conf_opt)
|
|
logger.addHandler(my_handler)
|
|
|
|
See :ref:`custom-logger-hooks-label` for sample use cases.
|
|
|
|
------------------------
|
|
Securing OpenStack Swift
|
|
------------------------
|
|
|
|
Please refer to the security guides at:
|
|
|
|
* http://docs.openstack.org/sec/
|
|
* http://docs.openstack.org/security-guide/content/object-storage.html
|