Operational procedures guide

This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
2016-02-10 17:58:05 +10:00 · 2016-02-10 17:58:05 +10:00 · 3c61ab4678
commit 3c61ab4678
parent 30624a866a
8 changed files with 2277 additions and 0 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -86,6 +86,7 @@ Administrator Documentation
    admin_guide
    replication_network
    logs
+    ops_runbook/index

 Object Storage v1 REST API Documentation
 ========================================
--- a/doc/source/ops_runbook/diagnose.rst
+++ b/doc/source/ops_runbook/diagnose.rst
--- a/doc/source/ops_runbook/general.rst
+++ b/doc/source/ops_runbook/general.rst
@ -0,0 +1,36 @@
+==================
+General Procedures
+==================
+
+Getting a swift account stats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+   ``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
+   ``swifty`` for an alternate, this is an example.
+
+This procedure describes how you determine the swift usage for a given
+swift account, that is the number of containers, number of objects and
+total bytes used. To do this you will need the project ID.
+
+Log onto one of the swift proxy servers.
+
+Use swift-direct to show this accounts usage:
+
+.. code::
+
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
+   Status: 200
+         Content-Length: 0
+         Accept-Ranges: bytes
+         X-Timestamp: 1379698586.88364
+         X-Account-Bytes-Used: 67440225625994
+         X-Account-Container-Count: 1
+         Content-Type: text/plain; charset=utf-8
+         X-Account-Object-Count: 8436776
+         Status: 200
+         name: my_container  count: 8436776  bytes: 67440225625994
+
+This account has 1 container. That container has 8436776 objects. The
+total bytes used is 67440225625994.
--- a/doc/source/ops_runbook/index.rst
+++ b/doc/source/ops_runbook/index.rst
@ -0,0 +1,79 @@
+=================
+Swift Ops Runbook
+=================
+
+This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate
+and monitor the Swift system within the HPE Helion Public Cloud. This
+document is an excerpt of a larger product-specific handbook. As such,
+the material may appear incomplete. The suggestions and recommendations
+made in this document are for our particular environment, and may not be
+suitable for your environment or situation. We make no representations
+concerning the accuracy, adequacy, completeness or suitability of the
+information, suggestions or recommendations. This document are provided
+for reference only. We are not responsible for your use of any
+information, suggestions or recommendations contained herein.
+
+This document also contains references to certain tools that we use to
+operate the Swift system within the HPE Helion Public Cloud.
+Descriptions of these tools are provided for reference only, as the tools themselves
+are not publically available at this time.
+
+-  ``swift-direct``: This is similar to the ``swiftly`` tool.
+
+
+.. toctree::
+   :maxdepth: 2
+
+   general.rst
+   diagnose.rst
+   procedures.rst
+   maintenance.rst
+   troubleshooting.rst
+
+Is the system up?
+~~~~~~~~~~~~~~~~~
+
+If you have a report that Swift is down, perform the following basic checks:
+
+#. Run swift functional tests.
+
+#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
+
+#. If you have a monitoring system, check your monitoring system.
+
+#. Check on your hardware load balancers infrastructure.
+
+#. Run swift-recon on a proxy node.
+
+Run swift function tests
+------------------------
+
+We would recommend that you set up your function tests against your production
+system.
+
+A script for running the function tests is located in ``swift/.functests``.
+
+
+External monitoring
+-------------------
+
+-  We use pingdom.com to monitor the external Swift API. We suggest the
+   following:
+
+   -  Do a GET on ``/healthcheck``
+
+   -  Create a container, make it public (x-container-read:
+      .r\*,.rlistings), create a small file in the container; do a GET
+      on the object
+
+Reference information
+~~~~~~~~~~~~~~~~~~~~~
+
+Reference: Swift startup/shutdown
+---------------------------------
+
+-  Use reload - not stop/start/restart.
+
+-  Try to roll sets of servers (especially proxy) in groups of less
+   than 20% of your servers.
+
--- a/doc/source/ops_runbook/maintenance.rst
+++ b/doc/source/ops_runbook/maintenance.rst
@ -0,0 +1,322 @@
+==================
+Server maintenance
+==================
+
+General assumptions
+~~~~~~~~~~~~~~~~~~~
+
+-  It is assumed that anyone attempting to replace hardware components
+   will have already read and understood the appropriate maintenance and
+   service guides.
+
+-  It is assumed that where servers need to be taken off-line for
+   hardware replacement, that this will be done in series, bringing the
+   server back on-line before taking the next off-line.
+
+-  It is assumed that the operations directed procedure will be used for
+   identifying hardware for replacement.
+
+Assessing the health of swift
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can run the swift-recon tool on a Swift proxy node to get a quick
+check of how Swift is doing. Please note that the numbers below are
+necessarily somewhat subjective. Sometimes parameters for which we
+say 'low values are good' will have pretty high values for a time. Often
+if you wait a while things get better.
+
+For example:
+
+.. code::
+
+   sudo swift-recon -rla
+   ===============================================================================
+   [2012-03-10 12:57:21] Checking async pendings on 384 hosts...
+   Async stats: low: 0, high: 1, avg: 0, total: 1
+   ===============================================================================
+
+   [2012-03-10 12:57:22] Checking replication times on 384 hosts...
+   [Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
+   ===============================================================================
+
+   [2012-03-10 12:57:22] Checking load avg's on 384 hosts...
+   [5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
+   [15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
+   [1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
+   ===============================================================================
+
+In the example above we ask for information on replication times (-r),
+load averages (-l) and async pendings (-a). This is a healthy Swift
+system. Rules-of-thumb for 'good' recon output are:
+
+-  Nodes that respond are up and running Swift. If all nodes respond,
+   that is a good sign. But some nodes may time out. For example:
+
+   .. code::
+
+      \-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
+      \-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
+
+-  That could be okay or could require investigation.
+
+-  Low values (say < 10 for high and average) for async pendings are
+   good. Higher values occur when disks are down and/or when the system
+   is heavily loaded. Many simultaneous PUTs to the same container can
+   drive async pendings up. This may be normal, and may resolve itself
+   after a while. If it persists, one way to track down the problem is
+   to find a node with high async pendings (with ``swift-recon -av | sort
+   -n -k4``), then check its Swift logs, Often async pendings are high
+   because a node cannot write to a container on another node. Often
+   this is because the node or disk is offline or bad. This may be okay
+   if we know about it.
+
+-  Low values for replication times are good. These values rise when new
+   rings are pushed, and when nodes and devices are brought back on
+   line.
+
+-  Our 'high' load average values are typically in the 9-15 range. If
+   they are a lot bigger it is worth having a look at the systems
+   pushing the average up. Run ``swift-recon -av`` to get the individual
+   averages. To sort the entries with the highest at the end,
+   run ``swift-recon -av | sort -n -k4``.
+
+For comparison here is the recon output for the same system above when
+two entire racks of Swift are down:
+
+.. code::
+
+   [2012-03-10 16:56:33] Checking async pendings on 384 hosts...
+   -> http://<redacted>.22:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.18:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.16:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.13:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.30:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.6:6000/recon/async: <urlopen error timed out>
+   .........
+   -> http://<redacted>.5:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.15:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.9:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.27:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.4:6000/recon/async: <urlopen error timed out>
+   -> http://<redacted>.8:6000/recon/async: <urlopen error timed out>
+   Async stats: low: 243, high: 659, avg: 413, total: 132275
+   ===============================================================================
+   [2012-03-10 16:57:48] Checking replication times on 384 hosts...
+   -> http://<redacted>.22:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.18:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.16:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.13:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.30:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.6:6000/recon/replication: <urlopen error timed out>
+   ............
+   -> http://<redacted>.5:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.15:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.9:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.27:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.4:6000/recon/replication: <urlopen error timed out>
+   -> http://<redacted>.8:6000/recon/replication: <urlopen error timed out>
+   [Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
+   9475361
+   ===============================================================================
+   [2012-03-10 16:59:03] Checking load avg's on 384 hosts...
+   -> http://<redacted>.22:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.18:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.16:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.13:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.30:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.6:6000/recon/load: <urlopen error timed out>
+   ............
+   -> http://<redacted>.15:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.9:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.27:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.4:6000/recon/load: <urlopen error timed out>
+   -> http://<redacted>.8:6000/recon/load: <urlopen error timed out>
+   [5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
+   [15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
+   [1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
+   ===============================================================================
+
+.. note::
+
+   The replication times and load averages are within reasonable
+   parameters, even with 80 object stores down. Async pendings, however is
+   quite high. This is due to the fact that the containers on the servers
+   which are down cannot be updated. When those servers come back up, async
+   pendings should drop. If async pendings were at this level without an
+   explanation, we have a problem.
+
+Recon examples
+~~~~~~~~~~~~~~
+
+Here is an example of noting and tracking down a problem with recon.
+
+Running reccon shows some async pendings:
+
+.. code::
+
+   bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
+   ===============================================================================
+   \[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
+   Async stats: low: 0, high: 23, avg: 8, total: 3356
+   ===============================================================================
+   \[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
+   \[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
+   ===============================================================================
+   \[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
+   \[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
+   \[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
+   \[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
+    ===============================================================================
+
+Why? Running recon again with -av swift (not shown here) tells us that
+the node with the highest (23) is <redacted>.72.61. Looking at the log
+files on <redacted>.72.61 we see:
+
+.. code::
+
+   souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
+   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
+   Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
+   Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
+   Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
+   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
+   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+   Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
+   {'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
+
+That is why this node has a lot of async pendings: a bunch of disks that
+are not mounted on <redacted> and <redacted>. There may be other issues,
+but clearing this up will likely drop the async pendings a fair bit, as
+other nodes will be having the same problem.
+
+Assessing the availability risk when multiple storage servers are down
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+   This procedure will tell you if you have a problem, however, in practice
+   you will find that you will not use this procedure frequently.
+
+If three storage nodes (or, more precisely, three disks on three
+different storage nodes) are down, there is a small but nonzero
+probability that user objects, containers, or accounts will not be
+available.
+
+Procedure
+---------
+
+.. note::
+
+   swift has three rings: one each for objects, containers and accounts.
+   This procedure should be run three times, each time specifying the
+   appropriate ``*.builder`` file.
+
+#. Determine whether all three nodes are different Swift zones by
+   running the ring builder on a proxy node to determine which zones
+   the storage nodes are in. For example:
+
+   .. code::
+
+      % sudo swift-ring-builder /etc/swift/object.builder
+      /etc/swift/object.builder, build version 1467
+      2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
+      The minimum number of hours before a partition can be reassigned is 24
+      Devices:    id  zone      ip address  port      name weight partitions balance meta
+      0     1     <redacted>.4  6000     disk0 1708.00       4259   -0.00
+      1     1     <redacted>.4  6000     disk1 1708.00       4260    0.02
+      2     1     <redacted>.4  6000     disk2 1952.00       4868    0.01
+      3     1     <redacted>.4  6000     disk3 1952.00       4868    0.01
+      4     1     <redacted>.4  6000     disk4 1952.00       4867   -0.01
+
+#. Here, node <redacted>.4 is in zone 1. If two or more of the three
+   nodes under consideration are in the same Swift zone, they do not
+   have any ring partitions in common; there is little/no data
+   availability risk if all three nodes are down.
+
+#. If the nodes are in three distinct Swift zonesit is necessary to
+   whether the nodes have ring partitions in common. Run ``swift-ring``
+   builder again, this time with the ``list_parts`` option and specify
+   the nodes under consideration. For example (all on one line):
+
+   .. code::
+
+      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
+      Partition   Matches
+      91           2
+      729          2
+      3754         2
+      3769         2
+      3947         2
+      5818         2
+      7918         2
+      8733         2
+      9509         2
+      10233        2
+
+#. The ``list_parts`` option to the ring builder indicates how many ring
+   partitions the nodes have in common. If, as in this case,  the
+   first entry in the list has a ‘Matches’ column of 2 or less,  there
+   is no data availability risk if all three nodes are down.
+
+#. If the ‘Matches’ column has entries equal to 3, there is some data
+   availability risk if all three nodes are down. The risk is generally
+   small, and is proportional to the number of entries that have a 3 in
+   the Matches column. For example:
+
+   .. code::
+
+      Partition   Matches
+      26865          3
+      362367         3
+      745940         3
+      778715         3
+      797559         3
+      820295         3
+      822118         3
+      839603         3
+      852332         3
+      855965         3
+      858016         3
+
+#. A quick way to count the number of rows with 3 matches is:
+
+   .. code::
+
+      % sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
+
+      30
+
+#. In this case the nodes have 30 out of a total of 2097152 partitions
+   in common; about 0.001%. In this case the risk is small nonzero.
+   Recall that a partition is simply a portion of the ring mapping
+   space, not actual data. So having partitions in common is a necessary
+   but not sufficient condition for data unavailability.
+
+   .. note::
+
+      We should not bring down a node for repair if it shows
+      Matches entries of 3 with other nodes that are also down.
+
+      If three nodes that have 3 partitions in common are all down, there is
+      a nonzero probability that data are unavailable and we should work to
+      bring some or all of the nodes up ASAP.
--- a/doc/source/ops_runbook/procedures.rst
+++ b/doc/source/ops_runbook/procedures.rst
@ -0,0 +1,367 @@
+=================================
+Software configuration procedures
+=================================
+
+Fix broken GPT table (broken disk partition)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+-  If a GPT table is broken, a message like the following should be
+   observed when the command...
+
+   .. code::
+
+      $ sudo parted -l
+
+-  ... is run.
+
+   .. code::
+
+      ...
+      Error: The backup GPT table is corrupt, but the primary appears OK, so that will
+      be used.
+      OK/Cancel?
+
+#. To fix this, firstly install the ``gdisk`` program to fix this:
+
+   .. code::
+
+      $ sudo aptitude install gdisk
+
+#. Run ``gdisk`` for the particular drive with the damaged partition:
+
+   .. code:
+
+      $ sudo gdisk /dev/sd*a-l*
+      GPT fdisk (gdisk) version 0.6.14
+
+      Caution: invalid backup GPT header, but valid main header; regenerating
+      backup header from main header.
+
+      Warning! One or more CRCs don't match. You should repair the disk!
+
+      Partition table scan:
+         MBR: protective
+         BSD: not present
+         APM: not present
+         GPT: damaged
+      /dev/sd
+      *****************************************************************************
+      Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
+      verification and recovery are STRONGLY recommended.
+      *****************************************************************************
+
+#. On the command prompt, type ``r`` (recovery and transformation
+   options), followed by ``d`` (use main GPT header) , ``v`` (verify disk)
+   and finally ``w`` (write table to disk and exit). Will also need to
+   enter ``Y`` when prompted in order to confirm actions.
+
+   .. code::
+
+      Command (? for help): r
+
+      Recovery/transformation command (? for help): d
+
+      Recovery/transformation command (? for help): v
+
+      Caution: The CRC for the backup partition table is invalid. This table may
+      be corrupt. This program will automatically create a new backup partition
+      table when you save your partitions.
+
+      Caution: Partition 1 doesn't begin on a 8-sector boundary. This may
+      result in degraded performance on some modern (2009 and later) hard disks.
+
+      Caution: Partition 2 doesn't begin on a 8-sector boundary. This may
+      result in degraded performance on some modern (2009 and later) hard disks.
+
+      Caution: Partition 3 doesn't begin on a 8-sector boundary. This may
+      result in degraded performance on some modern (2009 and later) hard disks.
+
+      Identified 1 problems!
+
+      Recovery/transformation command (? for help): w
+
+      Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
+      PARTITIONS!!
+
+      Do you want to proceed, possibly destroying your data? (Y/N): Y
+
+      OK; writing new GUID partition table (GPT).
+      The operation has completed successfully.
+
+#. Running the command:
+
+   .. code::
+
+      $ sudo parted /dev/sd#
+
+#. Should now show that the partition is recovered and healthy again.
+
+#. Finally, uninstall ``gdisk`` from the node:
+
+   .. code::
+
+      $ sudo aptitude remove gdisk
+
+Procedure: Fix broken XFS filesystem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+#. A filesystem may be corrupt or broken if the following output is
+   observed when checking its label:
+
+   .. code::
+
+      $ sudo xfs_admin -l /dev/sd#
+        cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
+        xfs_admin: cannot read root inode (117)
+        cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
+        xfs_admin: cannot read realtime bitmap inode (117)
+        bad sb magic # 0 in AG 1
+        failed to read label in AG 1
+
+#. Run the following commands to remove the broken/corrupt filesystem and replace.
+   (This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
+
+   .. code::
+
+      $ sudo parted
+      GNU Parted 2.3
+      Using /dev/sda
+      Welcome to GNU Parted! Type 'help' to view a list of commands.
+      (parted) select /dev/sdb
+      Using /dev/sdb
+      (parted) p
+      Model: HP LOGICAL VOLUME (scsi)
+      Disk /dev/sdb: 2000GB
+      Sector size (logical/physical): 512B/512B
+      Partition Table: gpt
+
+      Number  Start   End     Size    File system  Name   Flags
+      1      17.4kB  1024MB  1024MB  ext3                 boot
+      2      1024MB  1751GB  1750GB  xfs          sw-aw2az1-object045-disk1
+      3      1751GB  2000GB  249GB                        lvm
+
+      (parted) rm 2
+      (parted) mkpart primary 2 -1
+      Warning: You requested a partition from 2000kB to 2000GB.
+      The closest location we can manage is 1024MB to 1751GB.
+      Is this still acceptable to you?
+      Yes/No? Yes
+      Warning: The resulting partition is not properly aligned for best performance.
+      Ignore/Cancel? Ignore
+      (parted) p
+      Model: HP LOGICAL VOLUME (scsi)
+      Disk /dev/sdb: 2000GB
+      Sector size (logical/physical): 512B/512B
+      Partition Table: gpt
+
+      Number  Start   End     Size    File system  Name     Flags
+      1      17.4kB  1024MB  1024MB  ext3                  boot
+      2      1024MB  1751GB  1750GB  xfs          primary
+      3      1751GB  2000GB  249GB                         lvm
+
+      (parted) quit
+
+#. Next step is to scrub the filesystem and format:
+
+   .. code::
+
+      $ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
+      1+0 records in
+      1+0 records out
+      1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
+      $ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
+      meta-data=/dev/sdb2              isize=1024   agcount=4, agsize=106811524 blks
+             =                       sectsz=512   attr=2, projid32bit=0
+    data     =                       bsize=4096   blocks=427246093, imaxpct=5
+             =                       sunit=0      swidth=0 blks
+    naming   =version 2              bsize=4096   ascii-ci=0
+    log      =internal log           bsize=4096   blocks=208616, version=2
+             =                       sectsz=512   sunit=0 blks, lazy-count=1
+    realtime =none                   extsz=4096   blocks=0, rtextents=0
+
+#. You should now label and mount your filesystem.
+
+#. Can now check to see if the filesystem is mounted using the command:
+
+   .. code::
+
+      $ mount
+
+Procedure: Checking if an account is okay
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+   ``swift-direct`` is only available in the HPE Helion Public Cloud.
+   Use ``swiftly`` as an alternate.
+
+If you have a tenant ID you can check the account is okay as follows from a proxy.
+
+.. code::
+
+   $ sudo -u swift  /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
+
+The response will either be similar to a swift list of the account
+containers, or an error indicating that the resource could not be found.
+
+In the latter case you can establish if a backend database exists for
+the tenantId by running the following on a proxy:
+
+.. code::
+
+   $ sudo -u swift  swift-get-nodes /etc/swift/account.ring.gz  <Api-Auth-Hash-or-TenantId>
+
+The response will list ssh commands that will list the replicated
+account databases, if they exist.
+
+Procedure: Revive a deleted account
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Swift accounts are normally not recreated. If a tenant unsubscribes from
+Swift, the account is deleted. To re-subscribe to Swift, you can create
+a new tenant (new tenant ID), and subscribe to Swift. This creates a
+new Swift account with the new tenant ID.
+
+However, until the unsubscribe/new tenant process is supported, you may
+hit a situation where a Swift account is deleted and the user is locked
+out of Swift.
+
+Deleting the account database files
+-----------------------------------
+
+Here is one possible solution. The containers and objects may be lost
+forever. The solution is to delete the account database files and
+re-create the account. This may only be done once the containers and
+objects are completely deleted. This process is untested, but could
+work as follows:
+
+#. Use swift-get-nodes to locate the account's database file (on three
+   servers).
+
+#. Rename the database files (on three servers).
+
+#. Use ``swiftly`` to create the account (use original name).
+
+Renaming account database so it can be revived
+----------------------------------------------
+
+Get the locations of the database files that hold the account data.
+
+   .. code::
+
+      sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
+
+      Account  AUTH_redacted-1856-44ae-97db-31242f7ad7a1
+      Container None
+
+      Object    None
+
+      Partition 18914
+
+      Hash        93c41ef56dd69173a9524193ab813e78
+
+      Server:Port Device 15.184.9.126:6002 disk7
+      Server:Port Device 15.184.9.94:6002 disk11
+      Server:Port Device 15.184.9.103:6002 disk10
+      Server:Port Device 15.184.9.80:6002 disk2  [Handoff]
+      Server:Port Device 15.184.9.120:6002 disk2  [Handoff]
+      Server:Port Device 15.184.9.98:6002 disk2  [Handoff]
+
+      curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
+      curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
+
+      curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
+
+      curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
+      curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
+      curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
+
+      ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
+      ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
+      ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
+      ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+      ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+      ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+
+      $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account  AUTH_redacted-1856-44ae-97db-
+      31242f7ad7a1Container  NoneObject      NonePartition   18914Hash           93c41ef56dd69173a9524193ab813e78Server:Port Device  15.184.9.126:6002 disk7Server:Port Device   15.184.9.94:6002 disk11Server:Port Device   15.184.9.103:6002 disk10Server:Port Device  15.184.9.80:6002
+      disk2   [Handoff]Server:Port Device    15.184.9.120:6002 disk2  [Handoff]Server:Port Device    15.184.9.98:6002 disk2   [Handoff]curl -I -XHEAD
+      "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+
+      "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+
+      "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
+
+      "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
+
+      "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
+
+      "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
+
+      "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
+      "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
+      "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
+
+Check that the handoff nodes do not have account databases:
+
+.. code::
+
+   $ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
+   ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
+
+If the handoff node has a database, wait for rebalancing to occur.
+
+Procedure: Temporarily stop load balancers from directing traffic to a proxy server
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can stop the load balancers sending requests to a proxy server as
+follows. This can be useful when a proxy is misbehaving but you need
+Swift running to help diagnose the problem. By removing from the load
+balancers, customer's are not impacted by the misbehaving proxy.
+
+#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
+   ``/etc/swift/disabled-by-file``.
+
+#. Log onto the proxy node.
+
+#. Shut down Swift as follows:
+
+   .. code::
+
+      sudo swift-init proxy shutdown
+
+      .. note::
+
+         Shutdown, not stop.
+
+#. Create the ``/etc/swift/disabled-by-file`` file. For example:
+
+   .. code::
+
+      sudo touch /etc/swift/disabled-by-file
+
+#. Optional, restart Swift:
+
+   .. code::
+
+      sudo swift-init proxy start
+
+It works because the healthcheck middleware looks for this file. If it
+find it, it will return 503 error instead of 200/OK. This means the load balancer
+should stop sending traffic to the proxy.
+
+``/healthcheck`` will report
+``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
+
+Procedure: Ad-Hoc disk performance test
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can get an idea whether a disk drive is performing as follows:
+
+.. code::
+
+   sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
+
+You can expect ~600MB/sec. If you get a low number, repeat many times as
+Swift itself may also read or write to the disk, hence giving a lower
+number.
--- a/doc/source/ops_runbook/sec-furtherdiagnose.rst
+++ b/doc/source/ops_runbook/sec-furtherdiagnose.rst
@ -0,0 +1,177 @@
+==============================
+Further issues and resolutions
+==============================
+
+.. note::
+
+   The urgency levels in each **Action** column indicates whether or
+   not it is required to take immediate action, or if the problem can be worked
+   on during business hours.
+
+.. list-table::
+   :widths: 33 33 33
+   :header-rows: 1
+
+   * - **Scenario**
+     - **Description**
+     - **Action**
+   * - ``/healthcheck`` latency is high.
+     - The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
+       network issues, rather than the proxies being very busy. A very slow proxy might impact the average
+       number, but it would need to be very slow to shift the number that much.
+     - Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
+       IP address to see if you can pin point a problem in the network.
+
+       Urgency: If there are other indications that your system is slow, you should treat
+       this as an urgent problem.
+   * - Swift process is not running.
+     - You can use ``swift-init`` status to check if swift processes are running on any
+       given server.
+     - Run this command:
+       .. code::
+
+          sudo swift-init all start
+
+       Examine messages in the swift log files to see if there are any
+       error messages related to any of the swift processes since the time you
+       ran the ``swift-init`` command.
+
+       Take any corrective actions that seem necessary.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - ntpd is not running.
+     - NTP is not running.
+     - Configure and start NTP.
+       Urgency: For proxy servers, this is vital.
+
+   * - Host clock is not syncd to an NTP server.
+     - Node time settings does not match NTP server time.
+       This may take some time to sync after a reboot.
+     - Assuming NTP is configured and running, you have to wait until the times sync.
+   * - A swift process has hundreds, to thousands of open file descriptors.
+     - May happen to any of the swift processes.
+       Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
+
+     - Restart the swift processes on the affected node:
+
+       .. code::
+
+          % sudo swift-init all reload
+
+       Urgency:
+                If known performance problem: Immediate
+
+                If system seems fine: Medium
+   * - A swift process is not owned by the swift user.
+     - If the UID of the swift user has changed, then the processes might not be
+       owned by that UID.
+     - Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - Object account or container files not owned by swift.
+     - This typically happens if during a reinstall or a re-image of a server that the UID
+       of the swift user was changed. The data files in the object account and container
+       directories are owned by the original swift UID. As a result, the current swift
+       user does not own these files.
+     - Correct the UID of the swift user to reflect that of the original UID. An alternate
+       action is to change the ownership of every file on all file systems. This alternate
+       action is often impractical and will take considerable time.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - A disk drive has a high IO wait or service time.
+     - If high wait IO times are seen for a single disk, then the disk drive is the problem.
+       If most/all devices are slow, the controller is probably the source of the problem.
+       The controller cache may also be miss configured – which will cause similar long
+       wait or service times.
+     - As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
+       is working.
+
+       Second, reboot the server.
+       If problem persists, file a DC ticket to have the drive or controller replaced.
+       See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
+
+       Urgency: Medium
+   * - The network interface is not up.
+     - Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
+     - You can try restarting the interface. However, generally the interface
+       (or cable) is probably broken, especially if the interface is flapping.
+
+       Urgency: If this only affects one server, and you have more than one,
+       identifying and fixing the problem can wait until business hours.
+       If this same problem affects many servers, then you need to take corrective
+       action immediately.
+   * - Network interface card (NIC) is not operating at the expected speed.
+     - The NIC is running at a slower speed than its nominal rated speed.
+       For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
+     - 1. Try resetting the interface with:
+
+       .. code::
+
+          sudo ethtool -s eth0 speed 1000
+
+       ... and then run:
+
+       .. code::
+
+          sudo lshw -class
+
+       See if size goes to the expected speed. Failing
+       that, check hardware (NIC cable/switch port).
+
+       2. If persistent, consider shutting down the server (especially if a proxy)
+          until the problem is identified and resolved. If you leave this server
+          running it can have a large impact on overall performance.
+
+       Urgency: High
+   * - The interface RX/TX error count is non-zero.
+     - A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
+     - 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
+          3-30 probably indicate that the error count has crept up slowly over a long time.
+          Consider rebooting the server to remove the report from the noise.
+
+          Typically, when a cable or interface is bad, the error count goes to 400+. For example,
+          it stands out. There may be other symptoms such as the interface going up and down or
+          not running at correct speed. A server with a high error count should be watched.
+
+       2. If the error count continue to climb, consider taking the server down until
+          it can be properly investigated. In any case, a reboot should be done to clear
+          the error count.
+
+       Urgency: High, if the error count increasing.
+
+   * - In a swift log you see a message that a process has not replicated in over 24 hours.
+     - The replicator has not successfully completed a run in the last 24 hours.
+       This indicates that the replicator has probably hung.
+     - Use ``swift-init`` to stop and then restart the replicator process.
+
+       Urgency: Low (high if recent adding or replacement of disk drives), however if you
+       recently added or replaced disk drives then you should treat this urgently.
+   * - Container Updater has not run in 4 hour(s).
+     - The service may appear to be running however, it may be hung. Examine their swift
+       logs to see if there are any error messages relating to the container updater. This
+       may potentially explain why the container is not running.
+     - Urgency: Medium
+       This may have been triggered by a recent restart of the  rsyslog daemon.
+       Restart the service with:
+       .. code::
+
+          sudo swift-init <service> reload
+   * - Object replicator: Reports the remaining time and that time is more than 100 hours.
+     - Each replication cycle the object replicator writes a log message to its log
+       reporting statistics about the current cycle. This includes an estimate for the
+       remaining time needed to replicate all objects. If this time is longer than
+       100 hours, there is a problem with the replication process.
+     - Urgency: Medium
+       Restart the service with:
+       .. code::
+
+          sudo swift-init object-replicator reload
+
+       Check that the remaining replication time is going down.
--- a/doc/source/ops_runbook/troubleshooting.rst
+++ b/doc/source/ops_runbook/troubleshooting.rst
@ -0,0 +1,264 @@
+====================
+Troubleshooting tips
+====================
+
+Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This entry is prompted by a real customer issue and exclusively focused on how
+that problem was identified.
+There are many reasons why a http status of 500 could be returned. If
+there are no obvious problems with the swift object store, then it may
+be necessary to take a closer look at the users transactions.
+After finding the users swift account, you can
+search the swift proxy logs on each swift proxy server for
+transactions from this user. The linux ``bzgrep`` command can be used to
+search all the proxy log files on a node including the ``.bz2`` compressed
+files. For example:
+
+.. code::
+
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
+
+    -w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
+    4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*' 
+    dshbak -c
+    .
+    .
+    \---------------\-
+    <redacted>.132.6
+    \---------------\-
+    Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
+    <redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
+    /%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
+    tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
+
+This shows a ``GET`` operation on the users account.
+
+.. note::
+
+   The HTTP status returned is 404, not found, rather than 500 as reported by the user.
+
+Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
+search the swift object servers log files for this transaction ID:
+
+.. code::
+
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
+
+   -R ssh
+   -w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
+   'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
+      | dshbak -c
+   .
+   .
+   \---------------\-
+   <redacted>.72.16
+   \---------------\-
+   Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
+
+   [29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
+
+   0.0016 ""
+    \---------------\-
+    <redacted>.31
+    \---------------\-
+    Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
+    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
+    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
+    \---------------\-
+    <redacted>.204.70
+    \---------------\-
+
+    Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
+    [29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
+    4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
+
+.. note::
+
+   The 3 GET operations to 3 different object servers that hold the 3
+   replicas of this users account. Each ``GET`` returns a HTTP status of 404,
+   not found.
+
+Next, use the ``swift-get-nodes`` command to determine exactly where the
+users account data is stored:
+
+.. code::
+
+   $ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
+   Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
+   Container None
+   Object None
+
+   Partition 198875
+   Hash 1846d99185f8a0edaf65cfbf37439696
+
+   Server:Port Device <redacted>.31:6002 disk6
+   Server:Port Device <redacted>.204.70:6002 disk6
+   Server:Port Device <redacted>.72.16:6002 disk9
+   Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
+   Server:Port Device <redacted>.26:6002 disk11 [Handoff]
+   Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
+
+   curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
+   curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
+   curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
+   curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
+   curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
+   curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
+   <http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
+
+   ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
+   ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+   ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+   ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
+
+Check each of the primary servers, <redacted>.31, <redacted>.204.70  and <redacted>.72.16, for
+this users account. For example on <redacted>.72.16:
+
+.. code::
+
+   $ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
+   total 1.0M
+   drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
+   drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
+   -rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
+   -rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
+
+So this users account db, an sqlite db is present. Use sqlite to
+checkout the account:
+
+.. code::
+
+   $ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
+   $ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
+   sqlite> .mode line
+   sqlite> select * from account_stat;
+   account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
+   created_at = 1328311738.42190
+   put_timestamp = 1330000873.61411
+   delete_timestamp = 1330001026.00514
+   container_count = 0
+   object_count = 0
+   bytes_used = 0
+   hash = eb7e5d0ea3544d9def940b19114e8b43
+   id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
+   status = DELETED
+   status_changed_at = 1330001026.00514
+   metadata =
+
+.. note::
+
+   The status is ``DELETED``. So this account was deleted. This explains
+   why the GET operations are returning 404, not found. Check the account
+   delete date/time:
+
+   .. code::
+
+      $ python
+
+      >>> import time
+      >>> time.ctime(1330001026.00514)
+      'Thu Feb 23 12:43:46 2012'
+
+Next try and find the ``DELETE`` operation for this account in the proxy
+server logs:
+
+.. code::
+
+   $ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
+   139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
+   DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
+   .
+   .
+   Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
+   52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
+
+   tx4471188b0b87406899973d297c55ab53 - 0.0086
+
+From this you can see the operation that resulted in the account being deleted.
+
+Procedure: Deleting objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Simple case - deleting small number of objects and containers
+-------------------------------------------------------------
+
+.. note::
+
+   ``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
+   Use ``swiftly`` as an alternative.
+
+.. note::
+
+   Object and container names are in UTF8. Swift direct accepts UTF8
+   directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
+   URL-encoded). In practice cut and paste of foreign language strings to
+   a terminal window will produce the right result.
+
+   Hint: Use the ``head`` command before any destructive commands.
+
+To delete a small number of objects, log into any proxy node and proceed
+as follows:
+
+Examine the object in question:
+
+.. code::
+
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
+
+See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
+then this is the manifest object and segment objects may be in another
+container.
+
+If the ``X-Object-Manifest`` attribute is set, you need to find the
+name of the objects this means it is a DLO. For example,
+if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
+of the container container2 as follows:
+
+.. code::
+
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
+
+Pick out the objects whose names start with ``seg-blah``.
+Delete the segment objects as follows:
+
+.. code::
+
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
+   $ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
+   etc
+
+If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:
+
+-  Using swift-get-nodes to get the details of the object's location.
+-  Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
+-  This lists a json body listing containers and object names
+-  Delete the objects as described above for DLO segments
+
+Once the segments are deleted, you can delete the object using
+``swift-direct`` as described above.
+
+Finally, use ``swift-direct`` to delete the container.
+
+Procedure: Decommissioning swift nodes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Should Swift nodes need to be decommissioned. For example, where they are being
+re-purposed, it is very important to follow the following steps.
+
+#. In the case of object servers, follow the procedure for removing
+   the node from the rings.
+#. In the case of swift proxy servers, have the network team remove
+   the node from the load balancers.
+#. Open a network ticket to have the node removed from network
+   firewalls.
+#. Make sure that you remove the ``/etc/swift`` directory and everything in it.