Operational procedures guide
This is the operational procedures guide that HPE used to operate and monitor their public Swift systems. It has been made publicly available. Change-Id: Iefb484893056d28beb69265d99ba30c3c84add2b
This commit is contained in:
parent
30624a866a
commit
3c61ab4678
@ -86,6 +86,7 @@ Administrator Documentation
|
||||
admin_guide
|
||||
replication_network
|
||||
logs
|
||||
ops_runbook/index
|
||||
|
||||
Object Storage v1 REST API Documentation
|
||||
========================================
|
||||
|
1031
doc/source/ops_runbook/diagnose.rst
Normal file
1031
doc/source/ops_runbook/diagnose.rst
Normal file
File diff suppressed because it is too large
Load Diff
36
doc/source/ops_runbook/general.rst
Normal file
36
doc/source/ops_runbook/general.rst
Normal file
@ -0,0 +1,36 @@
|
||||
==================
|
||||
General Procedures
|
||||
==================
|
||||
|
||||
Getting a swift account stats
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. note::
|
||||
|
||||
``swift-direct`` is specific to the HPE Helion Public Cloud. Go look at
|
||||
``swifty`` for an alternate, this is an example.
|
||||
|
||||
This procedure describes how you determine the swift usage for a given
|
||||
swift account, that is the number of containers, number of objects and
|
||||
total bytes used. To do this you will need the project ID.
|
||||
|
||||
Log onto one of the swift proxy servers.
|
||||
|
||||
Use swift-direct to show this accounts usage:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct show AUTH_redacted-9a11-45f8-aa1c-9e7b1c7904c8
|
||||
Status: 200
|
||||
Content-Length: 0
|
||||
Accept-Ranges: bytes
|
||||
X-Timestamp: 1379698586.88364
|
||||
X-Account-Bytes-Used: 67440225625994
|
||||
X-Account-Container-Count: 1
|
||||
Content-Type: text/plain; charset=utf-8
|
||||
X-Account-Object-Count: 8436776
|
||||
Status: 200
|
||||
name: my_container count: 8436776 bytes: 67440225625994
|
||||
|
||||
This account has 1 container. That container has 8436776 objects. The
|
||||
total bytes used is 67440225625994.
|
79
doc/source/ops_runbook/index.rst
Normal file
79
doc/source/ops_runbook/index.rst
Normal file
@ -0,0 +1,79 @@
|
||||
=================
|
||||
Swift Ops Runbook
|
||||
=================
|
||||
|
||||
This document contains operational procedures that Hewlett Packard Enterprise (HPE) uses to operate
|
||||
and monitor the Swift system within the HPE Helion Public Cloud. This
|
||||
document is an excerpt of a larger product-specific handbook. As such,
|
||||
the material may appear incomplete. The suggestions and recommendations
|
||||
made in this document are for our particular environment, and may not be
|
||||
suitable for your environment or situation. We make no representations
|
||||
concerning the accuracy, adequacy, completeness or suitability of the
|
||||
information, suggestions or recommendations. This document are provided
|
||||
for reference only. We are not responsible for your use of any
|
||||
information, suggestions or recommendations contained herein.
|
||||
|
||||
This document also contains references to certain tools that we use to
|
||||
operate the Swift system within the HPE Helion Public Cloud.
|
||||
Descriptions of these tools are provided for reference only, as the tools themselves
|
||||
are not publically available at this time.
|
||||
|
||||
- ``swift-direct``: This is similar to the ``swiftly`` tool.
|
||||
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
general.rst
|
||||
diagnose.rst
|
||||
procedures.rst
|
||||
maintenance.rst
|
||||
troubleshooting.rst
|
||||
|
||||
Is the system up?
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you have a report that Swift is down, perform the following basic checks:
|
||||
|
||||
#. Run swift functional tests.
|
||||
|
||||
#. From a server in your data center, use ``curl`` to check ``/healthcheck``.
|
||||
|
||||
#. If you have a monitoring system, check your monitoring system.
|
||||
|
||||
#. Check on your hardware load balancers infrastructure.
|
||||
|
||||
#. Run swift-recon on a proxy node.
|
||||
|
||||
Run swift function tests
|
||||
------------------------
|
||||
|
||||
We would recommend that you set up your function tests against your production
|
||||
system.
|
||||
|
||||
A script for running the function tests is located in ``swift/.functests``.
|
||||
|
||||
|
||||
External monitoring
|
||||
-------------------
|
||||
|
||||
- We use pingdom.com to monitor the external Swift API. We suggest the
|
||||
following:
|
||||
|
||||
- Do a GET on ``/healthcheck``
|
||||
|
||||
- Create a container, make it public (x-container-read:
|
||||
.r\*,.rlistings), create a small file in the container; do a GET
|
||||
on the object
|
||||
|
||||
Reference information
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Reference: Swift startup/shutdown
|
||||
---------------------------------
|
||||
|
||||
- Use reload - not stop/start/restart.
|
||||
|
||||
- Try to roll sets of servers (especially proxy) in groups of less
|
||||
than 20% of your servers.
|
||||
|
322
doc/source/ops_runbook/maintenance.rst
Normal file
322
doc/source/ops_runbook/maintenance.rst
Normal file
@ -0,0 +1,322 @@
|
||||
==================
|
||||
Server maintenance
|
||||
==================
|
||||
|
||||
General assumptions
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- It is assumed that anyone attempting to replace hardware components
|
||||
will have already read and understood the appropriate maintenance and
|
||||
service guides.
|
||||
|
||||
- It is assumed that where servers need to be taken off-line for
|
||||
hardware replacement, that this will be done in series, bringing the
|
||||
server back on-line before taking the next off-line.
|
||||
|
||||
- It is assumed that the operations directed procedure will be used for
|
||||
identifying hardware for replacement.
|
||||
|
||||
Assessing the health of swift
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can run the swift-recon tool on a Swift proxy node to get a quick
|
||||
check of how Swift is doing. Please note that the numbers below are
|
||||
necessarily somewhat subjective. Sometimes parameters for which we
|
||||
say 'low values are good' will have pretty high values for a time. Often
|
||||
if you wait a while things get better.
|
||||
|
||||
For example:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo swift-recon -rla
|
||||
===============================================================================
|
||||
[2012-03-10 12:57:21] Checking async pendings on 384 hosts...
|
||||
Async stats: low: 0, high: 1, avg: 0, total: 1
|
||||
===============================================================================
|
||||
|
||||
[2012-03-10 12:57:22] Checking replication times on 384 hosts...
|
||||
[Replication Times] shortest: 1.4113877813, longest: 36.8293570836, avg: 4.86278064749
|
||||
===============================================================================
|
||||
|
||||
[2012-03-10 12:57:22] Checking load avg's on 384 hosts...
|
||||
[5m load average] lowest: 2.22, highest: 9.5, avg: 4.59578125
|
||||
[15m load average] lowest: 2.36, highest: 9.45, avg: 4.62622395833
|
||||
[1m load average] lowest: 1.84, highest: 9.57, avg: 4.5696875
|
||||
===============================================================================
|
||||
|
||||
In the example above we ask for information on replication times (-r),
|
||||
load averages (-l) and async pendings (-a). This is a healthy Swift
|
||||
system. Rules-of-thumb for 'good' recon output are:
|
||||
|
||||
- Nodes that respond are up and running Swift. If all nodes respond,
|
||||
that is a good sign. But some nodes may time out. For example:
|
||||
|
||||
.. code::
|
||||
|
||||
\-> [http://<redacted>.29:6000/recon/load:] <urlopen error [Errno 111] ECONNREFUSED>
|
||||
\-> [http://<redacted>.31:6000/recon/load:] <urlopen error timed out>
|
||||
|
||||
- That could be okay or could require investigation.
|
||||
|
||||
- Low values (say < 10 for high and average) for async pendings are
|
||||
good. Higher values occur when disks are down and/or when the system
|
||||
is heavily loaded. Many simultaneous PUTs to the same container can
|
||||
drive async pendings up. This may be normal, and may resolve itself
|
||||
after a while. If it persists, one way to track down the problem is
|
||||
to find a node with high async pendings (with ``swift-recon -av | sort
|
||||
-n -k4``), then check its Swift logs, Often async pendings are high
|
||||
because a node cannot write to a container on another node. Often
|
||||
this is because the node or disk is offline or bad. This may be okay
|
||||
if we know about it.
|
||||
|
||||
- Low values for replication times are good. These values rise when new
|
||||
rings are pushed, and when nodes and devices are brought back on
|
||||
line.
|
||||
|
||||
- Our 'high' load average values are typically in the 9-15 range. If
|
||||
they are a lot bigger it is worth having a look at the systems
|
||||
pushing the average up. Run ``swift-recon -av`` to get the individual
|
||||
averages. To sort the entries with the highest at the end,
|
||||
run ``swift-recon -av | sort -n -k4``.
|
||||
|
||||
For comparison here is the recon output for the same system above when
|
||||
two entire racks of Swift are down:
|
||||
|
||||
.. code::
|
||||
|
||||
[2012-03-10 16:56:33] Checking async pendings on 384 hosts...
|
||||
-> http://<redacted>.22:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.18:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.16:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.13:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.30:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.6:6000/recon/async: <urlopen error timed out>
|
||||
.........
|
||||
-> http://<redacted>.5:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.15:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.9:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.27:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.4:6000/recon/async: <urlopen error timed out>
|
||||
-> http://<redacted>.8:6000/recon/async: <urlopen error timed out>
|
||||
Async stats: low: 243, high: 659, avg: 413, total: 132275
|
||||
===============================================================================
|
||||
[2012-03-10 16:57:48] Checking replication times on 384 hosts...
|
||||
-> http://<redacted>.22:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.18:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.16:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.13:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.30:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.6:6000/recon/replication: <urlopen error timed out>
|
||||
............
|
||||
-> http://<redacted>.5:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.15:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.9:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.27:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.4:6000/recon/replication: <urlopen error timed out>
|
||||
-> http://<redacted>.8:6000/recon/replication: <urlopen error timed out>
|
||||
[Replication Times] shortest: 1.38144306739, longest: 112.620954418, avg: 10.285
|
||||
9475361
|
||||
===============================================================================
|
||||
[2012-03-10 16:59:03] Checking load avg's on 384 hosts...
|
||||
-> http://<redacted>.22:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.18:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.16:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.13:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.30:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.6:6000/recon/load: <urlopen error timed out>
|
||||
............
|
||||
-> http://<redacted>.15:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.9:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.27:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.4:6000/recon/load: <urlopen error timed out>
|
||||
-> http://<redacted>.8:6000/recon/load: <urlopen error timed out>
|
||||
[5m load average] lowest: 1.71, highest: 4.91, avg: 2.486375
|
||||
[15m load average] lowest: 1.79, highest: 5.04, avg: 2.506125
|
||||
[1m load average] lowest: 1.46, highest: 4.55, avg: 2.4929375
|
||||
===============================================================================
|
||||
|
||||
.. note::
|
||||
|
||||
The replication times and load averages are within reasonable
|
||||
parameters, even with 80 object stores down. Async pendings, however is
|
||||
quite high. This is due to the fact that the containers on the servers
|
||||
which are down cannot be updated. When those servers come back up, async
|
||||
pendings should drop. If async pendings were at this level without an
|
||||
explanation, we have a problem.
|
||||
|
||||
Recon examples
|
||||
~~~~~~~~~~~~~~
|
||||
|
||||
Here is an example of noting and tracking down a problem with recon.
|
||||
|
||||
Running reccon shows some async pendings:
|
||||
|
||||
.. code::
|
||||
|
||||
bob@notso:~/swift-1.4.4/swift$ ssh \\-q <redacted>.132.7 sudo swift-recon \\-alr
|
||||
===============================================================================
|
||||
\[2012-03-14 17:25:55\\] Checking async pendings on 384 hosts...
|
||||
Async stats: low: 0, high: 23, avg: 8, total: 3356
|
||||
===============================================================================
|
||||
\[2012-03-14 17:25:55\\] Checking replication times on 384 hosts...
|
||||
\[Replication Times\\] shortest: 1.49303831657, longest: 39.6982825994, avg: 4.2418222066
|
||||
===============================================================================
|
||||
\[2012-03-14 17:25:56\\] Checking load avg's on 384 hosts...
|
||||
\[5m load average\\] lowest: 2.35, highest: 8.88, avg: 4.45911458333
|
||||
\[15m load average\\] lowest: 2.41, highest: 9.11, avg: 4.504765625
|
||||
\[1m load average\\] lowest: 1.95, highest: 8.56, avg: 4.40588541667
|
||||
===============================================================================
|
||||
|
||||
Why? Running recon again with -av swift (not shown here) tells us that
|
||||
the node with the highest (23) is <redacted>.72.61. Looking at the log
|
||||
files on <redacted>.72.61 we see:
|
||||
|
||||
.. code::
|
||||
|
||||
souzab@<redacted>:~$ sudo tail -f /var/log/swift/background.log | - grep -i ERROR
|
||||
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||
Mar 14 17:28:06 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||
Mar 14 17:28:09 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:11 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||
Mar 14 17:28:13 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:15 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:19 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:20 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.119', 'id': 5481, 'meta': '', 'device': 'disk6', 'port': 6001}
|
||||
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:21 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
Mar 14 17:28:22 <redacted> container-replicator ERROR Remote drive not mounted
|
||||
{'zone': 5, 'weight': 1952.0, 'ip': '<redacted>.204.20', 'id': 2311, 'meta': '', 'device': 'disk5', 'port': 6001}
|
||||
|
||||
That is why this node has a lot of async pendings: a bunch of disks that
|
||||
are not mounted on <redacted> and <redacted>. There may be other issues,
|
||||
but clearing this up will likely drop the async pendings a fair bit, as
|
||||
other nodes will be having the same problem.
|
||||
|
||||
Assessing the availability risk when multiple storage servers are down
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. note::
|
||||
|
||||
This procedure will tell you if you have a problem, however, in practice
|
||||
you will find that you will not use this procedure frequently.
|
||||
|
||||
If three storage nodes (or, more precisely, three disks on three
|
||||
different storage nodes) are down, there is a small but nonzero
|
||||
probability that user objects, containers, or accounts will not be
|
||||
available.
|
||||
|
||||
Procedure
|
||||
---------
|
||||
|
||||
.. note::
|
||||
|
||||
swift has three rings: one each for objects, containers and accounts.
|
||||
This procedure should be run three times, each time specifying the
|
||||
appropriate ``*.builder`` file.
|
||||
|
||||
#. Determine whether all three nodes are different Swift zones by
|
||||
running the ring builder on a proxy node to determine which zones
|
||||
the storage nodes are in. For example:
|
||||
|
||||
.. code::
|
||||
|
||||
% sudo swift-ring-builder /etc/swift/object.builder
|
||||
/etc/swift/object.builder, build version 1467
|
||||
2097152 partitions, 3 replicas, 5 zones, 1320 devices, 0.02 balance
|
||||
The minimum number of hours before a partition can be reassigned is 24
|
||||
Devices: id zone ip address port name weight partitions balance meta
|
||||
0 1 <redacted>.4 6000 disk0 1708.00 4259 -0.00
|
||||
1 1 <redacted>.4 6000 disk1 1708.00 4260 0.02
|
||||
2 1 <redacted>.4 6000 disk2 1952.00 4868 0.01
|
||||
3 1 <redacted>.4 6000 disk3 1952.00 4868 0.01
|
||||
4 1 <redacted>.4 6000 disk4 1952.00 4867 -0.01
|
||||
|
||||
#. Here, node <redacted>.4 is in zone 1. If two or more of the three
|
||||
nodes under consideration are in the same Swift zone, they do not
|
||||
have any ring partitions in common; there is little/no data
|
||||
availability risk if all three nodes are down.
|
||||
|
||||
#. If the nodes are in three distinct Swift zonesit is necessary to
|
||||
whether the nodes have ring partitions in common. Run ``swift-ring``
|
||||
builder again, this time with the ``list_parts`` option and specify
|
||||
the nodes under consideration. For example (all on one line):
|
||||
|
||||
.. code::
|
||||
|
||||
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2
|
||||
Partition Matches
|
||||
91 2
|
||||
729 2
|
||||
3754 2
|
||||
3769 2
|
||||
3947 2
|
||||
5818 2
|
||||
7918 2
|
||||
8733 2
|
||||
9509 2
|
||||
10233 2
|
||||
|
||||
#. The ``list_parts`` option to the ring builder indicates how many ring
|
||||
partitions the nodes have in common. If, as in this case, the
|
||||
first entry in the list has a ‘Matches’ column of 2 or less, there
|
||||
is no data availability risk if all three nodes are down.
|
||||
|
||||
#. If the ‘Matches’ column has entries equal to 3, there is some data
|
||||
availability risk if all three nodes are down. The risk is generally
|
||||
small, and is proportional to the number of entries that have a 3 in
|
||||
the Matches column. For example:
|
||||
|
||||
.. code::
|
||||
|
||||
Partition Matches
|
||||
26865 3
|
||||
362367 3
|
||||
745940 3
|
||||
778715 3
|
||||
797559 3
|
||||
820295 3
|
||||
822118 3
|
||||
839603 3
|
||||
852332 3
|
||||
855965 3
|
||||
858016 3
|
||||
|
||||
#. A quick way to count the number of rows with 3 matches is:
|
||||
|
||||
.. code::
|
||||
|
||||
% sudo swift-ring-builder /etc/swift/object.builder list_parts <redacted>.8 <redacted>.15 <redacted>.72.2 | grep “3$” - wc \\-l
|
||||
|
||||
30
|
||||
|
||||
#. In this case the nodes have 30 out of a total of 2097152 partitions
|
||||
in common; about 0.001%. In this case the risk is small nonzero.
|
||||
Recall that a partition is simply a portion of the ring mapping
|
||||
space, not actual data. So having partitions in common is a necessary
|
||||
but not sufficient condition for data unavailability.
|
||||
|
||||
.. note::
|
||||
|
||||
We should not bring down a node for repair if it shows
|
||||
Matches entries of 3 with other nodes that are also down.
|
||||
|
||||
If three nodes that have 3 partitions in common are all down, there is
|
||||
a nonzero probability that data are unavailable and we should work to
|
||||
bring some or all of the nodes up ASAP.
|
367
doc/source/ops_runbook/procedures.rst
Normal file
367
doc/source/ops_runbook/procedures.rst
Normal file
@ -0,0 +1,367 @@
|
||||
=================================
|
||||
Software configuration procedures
|
||||
=================================
|
||||
|
||||
Fix broken GPT table (broken disk partition)
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
- If a GPT table is broken, a message like the following should be
|
||||
observed when the command...
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo parted -l
|
||||
|
||||
- ... is run.
|
||||
|
||||
.. code::
|
||||
|
||||
...
|
||||
Error: The backup GPT table is corrupt, but the primary appears OK, so that will
|
||||
be used.
|
||||
OK/Cancel?
|
||||
|
||||
#. To fix this, firstly install the ``gdisk`` program to fix this:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo aptitude install gdisk
|
||||
|
||||
#. Run ``gdisk`` for the particular drive with the damaged partition:
|
||||
|
||||
.. code:
|
||||
|
||||
$ sudo gdisk /dev/sd*a-l*
|
||||
GPT fdisk (gdisk) version 0.6.14
|
||||
|
||||
Caution: invalid backup GPT header, but valid main header; regenerating
|
||||
backup header from main header.
|
||||
|
||||
Warning! One or more CRCs don't match. You should repair the disk!
|
||||
|
||||
Partition table scan:
|
||||
MBR: protective
|
||||
BSD: not present
|
||||
APM: not present
|
||||
GPT: damaged
|
||||
/dev/sd
|
||||
*****************************************************************************
|
||||
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
|
||||
verification and recovery are STRONGLY recommended.
|
||||
*****************************************************************************
|
||||
|
||||
#. On the command prompt, type ``r`` (recovery and transformation
|
||||
options), followed by ``d`` (use main GPT header) , ``v`` (verify disk)
|
||||
and finally ``w`` (write table to disk and exit). Will also need to
|
||||
enter ``Y`` when prompted in order to confirm actions.
|
||||
|
||||
.. code::
|
||||
|
||||
Command (? for help): r
|
||||
|
||||
Recovery/transformation command (? for help): d
|
||||
|
||||
Recovery/transformation command (? for help): v
|
||||
|
||||
Caution: The CRC for the backup partition table is invalid. This table may
|
||||
be corrupt. This program will automatically create a new backup partition
|
||||
table when you save your partitions.
|
||||
|
||||
Caution: Partition 1 doesn't begin on a 8-sector boundary. This may
|
||||
result in degraded performance on some modern (2009 and later) hard disks.
|
||||
|
||||
Caution: Partition 2 doesn't begin on a 8-sector boundary. This may
|
||||
result in degraded performance on some modern (2009 and later) hard disks.
|
||||
|
||||
Caution: Partition 3 doesn't begin on a 8-sector boundary. This may
|
||||
result in degraded performance on some modern (2009 and later) hard disks.
|
||||
|
||||
Identified 1 problems!
|
||||
|
||||
Recovery/transformation command (? for help): w
|
||||
|
||||
Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
|
||||
PARTITIONS!!
|
||||
|
||||
Do you want to proceed, possibly destroying your data? (Y/N): Y
|
||||
|
||||
OK; writing new GUID partition table (GPT).
|
||||
The operation has completed successfully.
|
||||
|
||||
#. Running the command:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo parted /dev/sd#
|
||||
|
||||
#. Should now show that the partition is recovered and healthy again.
|
||||
|
||||
#. Finally, uninstall ``gdisk`` from the node:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo aptitude remove gdisk
|
||||
|
||||
Procedure: Fix broken XFS filesystem
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
#. A filesystem may be corrupt or broken if the following output is
|
||||
observed when checking its label:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo xfs_admin -l /dev/sd#
|
||||
cache_node_purge: refcount was 1, not zero (node=0x25d5ee0)
|
||||
xfs_admin: cannot read root inode (117)
|
||||
cache_node_purge: refcount was 1, not zero (node=0x25d92b0)
|
||||
xfs_admin: cannot read realtime bitmap inode (117)
|
||||
bad sb magic # 0 in AG 1
|
||||
failed to read label in AG 1
|
||||
|
||||
#. Run the following commands to remove the broken/corrupt filesystem and replace.
|
||||
(This example uses the filesystem ``/dev/sdb2``) Firstly need to replace the partition:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo parted
|
||||
GNU Parted 2.3
|
||||
Using /dev/sda
|
||||
Welcome to GNU Parted! Type 'help' to view a list of commands.
|
||||
(parted) select /dev/sdb
|
||||
Using /dev/sdb
|
||||
(parted) p
|
||||
Model: HP LOGICAL VOLUME (scsi)
|
||||
Disk /dev/sdb: 2000GB
|
||||
Sector size (logical/physical): 512B/512B
|
||||
Partition Table: gpt
|
||||
|
||||
Number Start End Size File system Name Flags
|
||||
1 17.4kB 1024MB 1024MB ext3 boot
|
||||
2 1024MB 1751GB 1750GB xfs sw-aw2az1-object045-disk1
|
||||
3 1751GB 2000GB 249GB lvm
|
||||
|
||||
(parted) rm 2
|
||||
(parted) mkpart primary 2 -1
|
||||
Warning: You requested a partition from 2000kB to 2000GB.
|
||||
The closest location we can manage is 1024MB to 1751GB.
|
||||
Is this still acceptable to you?
|
||||
Yes/No? Yes
|
||||
Warning: The resulting partition is not properly aligned for best performance.
|
||||
Ignore/Cancel? Ignore
|
||||
(parted) p
|
||||
Model: HP LOGICAL VOLUME (scsi)
|
||||
Disk /dev/sdb: 2000GB
|
||||
Sector size (logical/physical): 512B/512B
|
||||
Partition Table: gpt
|
||||
|
||||
Number Start End Size File system Name Flags
|
||||
1 17.4kB 1024MB 1024MB ext3 boot
|
||||
2 1024MB 1751GB 1750GB xfs primary
|
||||
3 1751GB 2000GB 249GB lvm
|
||||
|
||||
(parted) quit
|
||||
|
||||
#. Next step is to scrub the filesystem and format:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo dd if=/dev/zero of=/dev/sdb2 bs=$((1024\*1024)) count=1
|
||||
1+0 records in
|
||||
1+0 records out
|
||||
1048576 bytes (1.0 MB) copied, 0.00480617 s, 218 MB/s
|
||||
$ sudo /sbin/mkfs.xfs -f -i size=1024 /dev/sdb2
|
||||
meta-data=/dev/sdb2 isize=1024 agcount=4, agsize=106811524 blks
|
||||
= sectsz=512 attr=2, projid32bit=0
|
||||
data = bsize=4096 blocks=427246093, imaxpct=5
|
||||
= sunit=0 swidth=0 blks
|
||||
naming =version 2 bsize=4096 ascii-ci=0
|
||||
log =internal log bsize=4096 blocks=208616, version=2
|
||||
= sectsz=512 sunit=0 blks, lazy-count=1
|
||||
realtime =none extsz=4096 blocks=0, rtextents=0
|
||||
|
||||
#. You should now label and mount your filesystem.
|
||||
|
||||
#. Can now check to see if the filesystem is mounted using the command:
|
||||
|
||||
.. code::
|
||||
|
||||
$ mount
|
||||
|
||||
Procedure: Checking if an account is okay
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
.. note::
|
||||
|
||||
``swift-direct`` is only available in the HPE Helion Public Cloud.
|
||||
Use ``swiftly`` as an alternate.
|
||||
|
||||
If you have a tenant ID you can check the account is okay as follows from a proxy.
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct show <Api-Auth-Hash-or-TenantId>
|
||||
|
||||
The response will either be similar to a swift list of the account
|
||||
containers, or an error indicating that the resource could not be found.
|
||||
|
||||
In the latter case you can establish if a backend database exists for
|
||||
the tenantId by running the following on a proxy:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift swift-get-nodes /etc/swift/account.ring.gz <Api-Auth-Hash-or-TenantId>
|
||||
|
||||
The response will list ssh commands that will list the replicated
|
||||
account databases, if they exist.
|
||||
|
||||
Procedure: Revive a deleted account
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Swift accounts are normally not recreated. If a tenant unsubscribes from
|
||||
Swift, the account is deleted. To re-subscribe to Swift, you can create
|
||||
a new tenant (new tenant ID), and subscribe to Swift. This creates a
|
||||
new Swift account with the new tenant ID.
|
||||
|
||||
However, until the unsubscribe/new tenant process is supported, you may
|
||||
hit a situation where a Swift account is deleted and the user is locked
|
||||
out of Swift.
|
||||
|
||||
Deleting the account database files
|
||||
-----------------------------------
|
||||
|
||||
Here is one possible solution. The containers and objects may be lost
|
||||
forever. The solution is to delete the account database files and
|
||||
re-create the account. This may only be done once the containers and
|
||||
objects are completely deleted. This process is untested, but could
|
||||
work as follows:
|
||||
|
||||
#. Use swift-get-nodes to locate the account's database file (on three
|
||||
servers).
|
||||
|
||||
#. Rename the database files (on three servers).
|
||||
|
||||
#. Use ``swiftly`` to create the account (use original name).
|
||||
|
||||
Renaming account database so it can be revived
|
||||
----------------------------------------------
|
||||
|
||||
Get the locations of the database files that hold the account data.
|
||||
|
||||
.. code::
|
||||
|
||||
sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
||||
|
||||
Account AUTH_redacted-1856-44ae-97db-31242f7ad7a1
|
||||
Container None
|
||||
|
||||
Object None
|
||||
|
||||
Partition 18914
|
||||
|
||||
Hash 93c41ef56dd69173a9524193ab813e78
|
||||
|
||||
Server:Port Device 15.184.9.126:6002 disk7
|
||||
Server:Port Device 15.184.9.94:6002 disk11
|
||||
Server:Port Device 15.184.9.103:6002 disk10
|
||||
Server:Port Device 15.184.9.80:6002 disk2 [Handoff]
|
||||
Server:Port Device 15.184.9.120:6002 disk2 [Handoff]
|
||||
Server:Port Device 15.184.9.98:6002 disk2 [Handoff]
|
||||
|
||||
curl -I -XHEAD "`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||
curl -I -XHEAD "`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||
|
||||
curl -I -XHEAD "`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_
|
||||
|
||||
curl -I -XHEAD "`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||
curl -I -XHEAD "`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||
curl -I -XHEAD "`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]
|
||||
|
||||
ssh 15.184.9.126 "ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||
ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||
ssh 15.184.9.103 "ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||
ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||
ssh 15.184.9.120 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||
ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||
|
||||
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH\_redacted-1856-44ae-97db-31242f7ad7a1Account AUTH_redacted-1856-44ae-97db-
|
||||
31242f7ad7a1Container NoneObject NonePartition 18914Hash 93c41ef56dd69173a9524193ab813e78Server:Port Device 15.184.9.126:6002 disk7Server:Port Device 15.184.9.94:6002 disk11Server:Port Device 15.184.9.103:6002 disk10Server:Port Device 15.184.9.80:6002
|
||||
disk2 [Handoff]Server:Port Device 15.184.9.120:6002 disk2 [Handoff]Server:Port Device 15.184.9.98:6002 disk2 [Handoff]curl -I -XHEAD
|
||||
"`*http://15.184.9.126:6002/disk7/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"*<http://15.184.9.126:6002/disk7/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||
|
||||
"`*http://15.184.9.94:6002/disk11/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.94:6002/disk11/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||
|
||||
"`*http://15.184.9.103:6002/disk10/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.103:6002/disk10/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ curl -I -XHEAD
|
||||
|
||||
"`*http://15.184.9.80:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.80:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
||||
|
||||
"`*http://15.184.9.120:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.120:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]curl -I -XHEAD
|
||||
|
||||
"`*http://15.184.9.98:6002/disk2/18914/AUTH_redacted-1856-44ae-97db-31242f7ad7a1"* <http://15.184.9.98:6002/disk2/18914/AUTH_cc9ebdb8-1856-44ae-97db-31242f7ad7a1>`_ # [Handoff]ssh 15.184.9.126
|
||||
|
||||
"ls -lah /srv/node/disk7/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.94 "ls -lah /srv/node/disk11/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.103
|
||||
"ls -lah /srv/node/disk10/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.120
|
||||
"ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]ssh 15.184.9.98 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/" # [Handoff]
|
||||
|
||||
Check that the handoff nodes do not have account databases:
|
||||
|
||||
.. code::
|
||||
|
||||
$ ssh 15.184.9.80 "ls -lah /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/"
|
||||
ls: cannot access /srv/node/disk2/accounts/18914/e78/93c41ef56dd69173a9524193ab813e78/: No such file or directory
|
||||
|
||||
If the handoff node has a database, wait for rebalancing to occur.
|
||||
|
||||
Procedure: Temporarily stop load balancers from directing traffic to a proxy server
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can stop the load balancers sending requests to a proxy server as
|
||||
follows. This can be useful when a proxy is misbehaving but you need
|
||||
Swift running to help diagnose the problem. By removing from the load
|
||||
balancers, customer's are not impacted by the misbehaving proxy.
|
||||
|
||||
#. Ensure that in proxyserver.com the ``disable_path`` variable is set to
|
||||
``/etc/swift/disabled-by-file``.
|
||||
|
||||
#. Log onto the proxy node.
|
||||
|
||||
#. Shut down Swift as follows:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo swift-init proxy shutdown
|
||||
|
||||
.. note::
|
||||
|
||||
Shutdown, not stop.
|
||||
|
||||
#. Create the ``/etc/swift/disabled-by-file`` file. For example:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo touch /etc/swift/disabled-by-file
|
||||
|
||||
#. Optional, restart Swift:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo swift-init proxy start
|
||||
|
||||
It works because the healthcheck middleware looks for this file. If it
|
||||
find it, it will return 503 error instead of 200/OK. This means the load balancer
|
||||
should stop sending traffic to the proxy.
|
||||
|
||||
``/healthcheck`` will report
|
||||
``FAIL: disabled by file`` if the ``disabled-by-file`` file exists.
|
||||
|
||||
Procedure: Ad-Hoc disk performance test
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can get an idea whether a disk drive is performing as follows:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo dd bs=1M count=256 if=/dev/zero conv=fdatasync of=/srv/node/disk11/remember-to-delete-this-later
|
||||
|
||||
You can expect ~600MB/sec. If you get a low number, repeat many times as
|
||||
Swift itself may also read or write to the disk, hence giving a lower
|
||||
number.
|
177
doc/source/ops_runbook/sec-furtherdiagnose.rst
Normal file
177
doc/source/ops_runbook/sec-furtherdiagnose.rst
Normal file
@ -0,0 +1,177 @@
|
||||
==============================
|
||||
Further issues and resolutions
|
||||
==============================
|
||||
|
||||
.. note::
|
||||
|
||||
The urgency levels in each **Action** column indicates whether or
|
||||
not it is required to take immediate action, or if the problem can be worked
|
||||
on during business hours.
|
||||
|
||||
.. list-table::
|
||||
:widths: 33 33 33
|
||||
:header-rows: 1
|
||||
|
||||
* - **Scenario**
|
||||
- **Description**
|
||||
- **Action**
|
||||
* - ``/healthcheck`` latency is high.
|
||||
- The ``/healthcheck`` test does not tax the proxy very much so any drop in value is probably related to
|
||||
network issues, rather than the proxies being very busy. A very slow proxy might impact the average
|
||||
number, but it would need to be very slow to shift the number that much.
|
||||
- Check networks. Do a ``curl https://<ip-address>/healthcheck where ip-address`` is individual proxy
|
||||
IP address to see if you can pin point a problem in the network.
|
||||
|
||||
Urgency: If there are other indications that your system is slow, you should treat
|
||||
this as an urgent problem.
|
||||
* - Swift process is not running.
|
||||
- You can use ``swift-init`` status to check if swift processes are running on any
|
||||
given server.
|
||||
- Run this command:
|
||||
.. code::
|
||||
|
||||
sudo swift-init all start
|
||||
|
||||
Examine messages in the swift log files to see if there are any
|
||||
error messages related to any of the swift processes since the time you
|
||||
ran the ``swift-init`` command.
|
||||
|
||||
Take any corrective actions that seem necessary.
|
||||
|
||||
Urgency: If this only affects one server, and you have more than one,
|
||||
identifying and fixing the problem can wait until business hours.
|
||||
If this same problem affects many servers, then you need to take corrective
|
||||
action immediately.
|
||||
* - ntpd is not running.
|
||||
- NTP is not running.
|
||||
- Configure and start NTP.
|
||||
Urgency: For proxy servers, this is vital.
|
||||
|
||||
* - Host clock is not syncd to an NTP server.
|
||||
- Node time settings does not match NTP server time.
|
||||
This may take some time to sync after a reboot.
|
||||
- Assuming NTP is configured and running, you have to wait until the times sync.
|
||||
* - A swift process has hundreds, to thousands of open file descriptors.
|
||||
- May happen to any of the swift processes.
|
||||
Known to have happened with a ``rsyslod restart`` and where ``/tmp`` was hanging.
|
||||
|
||||
- Restart the swift processes on the affected node:
|
||||
|
||||
.. code::
|
||||
|
||||
% sudo swift-init all reload
|
||||
|
||||
Urgency:
|
||||
If known performance problem: Immediate
|
||||
|
||||
If system seems fine: Medium
|
||||
* - A swift process is not owned by the swift user.
|
||||
- If the UID of the swift user has changed, then the processes might not be
|
||||
owned by that UID.
|
||||
- Urgency: If this only affects one server, and you have more than one,
|
||||
identifying and fixing the problem can wait until business hours.
|
||||
If this same problem affects many servers, then you need to take corrective
|
||||
action immediately.
|
||||
* - Object account or container files not owned by swift.
|
||||
- This typically happens if during a reinstall or a re-image of a server that the UID
|
||||
of the swift user was changed. The data files in the object account and container
|
||||
directories are owned by the original swift UID. As a result, the current swift
|
||||
user does not own these files.
|
||||
- Correct the UID of the swift user to reflect that of the original UID. An alternate
|
||||
action is to change the ownership of every file on all file systems. This alternate
|
||||
action is often impractical and will take considerable time.
|
||||
|
||||
Urgency: If this only affects one server, and you have more than one,
|
||||
identifying and fixing the problem can wait until business hours.
|
||||
If this same problem affects many servers, then you need to take corrective
|
||||
action immediately.
|
||||
* - A disk drive has a high IO wait or service time.
|
||||
- If high wait IO times are seen for a single disk, then the disk drive is the problem.
|
||||
If most/all devices are slow, the controller is probably the source of the problem.
|
||||
The controller cache may also be miss configured – which will cause similar long
|
||||
wait or service times.
|
||||
- As a first step, if your controllers have a cache, check that it is enabled and their battery/capacitor
|
||||
is working.
|
||||
|
||||
Second, reboot the server.
|
||||
If problem persists, file a DC ticket to have the drive or controller replaced.
|
||||
See `Diagnose: Slow disk devices` on how to check the drive wait or service times.
|
||||
|
||||
Urgency: Medium
|
||||
* - The network interface is not up.
|
||||
- Use the ``ifconfig`` and ``ethtool`` commands to determine the network state.
|
||||
- You can try restarting the interface. However, generally the interface
|
||||
(or cable) is probably broken, especially if the interface is flapping.
|
||||
|
||||
Urgency: If this only affects one server, and you have more than one,
|
||||
identifying and fixing the problem can wait until business hours.
|
||||
If this same problem affects many servers, then you need to take corrective
|
||||
action immediately.
|
||||
* - Network interface card (NIC) is not operating at the expected speed.
|
||||
- The NIC is running at a slower speed than its nominal rated speed.
|
||||
For example, it is running at 100 Mb/s and the NIC is a 1Ge NIC.
|
||||
- 1. Try resetting the interface with:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo ethtool -s eth0 speed 1000
|
||||
|
||||
... and then run:
|
||||
|
||||
.. code::
|
||||
|
||||
sudo lshw -class
|
||||
|
||||
See if size goes to the expected speed. Failing
|
||||
that, check hardware (NIC cable/switch port).
|
||||
|
||||
2. If persistent, consider shutting down the server (especially if a proxy)
|
||||
until the problem is identified and resolved. If you leave this server
|
||||
running it can have a large impact on overall performance.
|
||||
|
||||
Urgency: High
|
||||
* - The interface RX/TX error count is non-zero.
|
||||
- A value of 0 is typical, but counts of 1 or 2 do not indicate a problem.
|
||||
- 1. For low numbers (For example, 1 or 2), you can simply ignore. Numbers in the range
|
||||
3-30 probably indicate that the error count has crept up slowly over a long time.
|
||||
Consider rebooting the server to remove the report from the noise.
|
||||
|
||||
Typically, when a cable or interface is bad, the error count goes to 400+. For example,
|
||||
it stands out. There may be other symptoms such as the interface going up and down or
|
||||
not running at correct speed. A server with a high error count should be watched.
|
||||
|
||||
2. If the error count continue to climb, consider taking the server down until
|
||||
it can be properly investigated. In any case, a reboot should be done to clear
|
||||
the error count.
|
||||
|
||||
Urgency: High, if the error count increasing.
|
||||
|
||||
* - In a swift log you see a message that a process has not replicated in over 24 hours.
|
||||
- The replicator has not successfully completed a run in the last 24 hours.
|
||||
This indicates that the replicator has probably hung.
|
||||
- Use ``swift-init`` to stop and then restart the replicator process.
|
||||
|
||||
Urgency: Low (high if recent adding or replacement of disk drives), however if you
|
||||
recently added or replaced disk drives then you should treat this urgently.
|
||||
* - Container Updater has not run in 4 hour(s).
|
||||
- The service may appear to be running however, it may be hung. Examine their swift
|
||||
logs to see if there are any error messages relating to the container updater. This
|
||||
may potentially explain why the container is not running.
|
||||
- Urgency: Medium
|
||||
This may have been triggered by a recent restart of the rsyslog daemon.
|
||||
Restart the service with:
|
||||
.. code::
|
||||
|
||||
sudo swift-init <service> reload
|
||||
* - Object replicator: Reports the remaining time and that time is more than 100 hours.
|
||||
- Each replication cycle the object replicator writes a log message to its log
|
||||
reporting statistics about the current cycle. This includes an estimate for the
|
||||
remaining time needed to replicate all objects. If this time is longer than
|
||||
100 hours, there is a problem with the replication process.
|
||||
- Urgency: Medium
|
||||
Restart the service with:
|
||||
.. code::
|
||||
|
||||
sudo swift-init object-replicator reload
|
||||
|
||||
Check that the remaining replication time is going down.
|
264
doc/source/ops_runbook/troubleshooting.rst
Normal file
264
doc/source/ops_runbook/troubleshooting.rst
Normal file
@ -0,0 +1,264 @@
|
||||
====================
|
||||
Troubleshooting tips
|
||||
====================
|
||||
|
||||
Diagnose: Customer complains they receive a HTTP status 500 when trying to browse containers
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
This entry is prompted by a real customer issue and exclusively focused on how
|
||||
that problem was identified.
|
||||
There are many reasons why a http status of 500 could be returned. If
|
||||
there are no obvious problems with the swift object store, then it may
|
||||
be necessary to take a closer look at the users transactions.
|
||||
After finding the users swift account, you can
|
||||
search the swift proxy logs on each swift proxy server for
|
||||
transactions from this user. The linux ``bzgrep`` command can be used to
|
||||
search all the proxy log files on a node including the ``.bz2`` compressed
|
||||
files. For example:
|
||||
|
||||
.. code::
|
||||
|
||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh
|
||||
|
||||
-w <redacted>.68.[4-11,132-139 4-11,132-139],<redacted>.132.[4-11,132-139
|
||||
4-11,132-139] 'sudo bzgrep -w AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\*'
|
||||
dshbak -c
|
||||
.
|
||||
.
|
||||
\---------------\-
|
||||
<redacted>.132.6
|
||||
\---------------\-
|
||||
Feb 29 08:51:57 sw-aw2az2-proxy011 proxy-server <redacted>.16.132
|
||||
<redacted>.66.8 29/Feb/2012/08/51/57 GET /v1.0/AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||
/%3Fformat%3Djson HTTP/1.0 404 - - <REDACTED>_4f4d50c5e4b064d88bd7ab82 - - -
|
||||
tx429fc3be354f434ab7f9c6c4206c1dc3 - 0.0130
|
||||
|
||||
This shows a ``GET`` operation on the users account.
|
||||
|
||||
.. note::
|
||||
|
||||
The HTTP status returned is 404, not found, rather than 500 as reported by the user.
|
||||
|
||||
Using the transaction ID, ``tx429fc3be354f434ab7f9c6c4206c1dc3`` you can
|
||||
search the swift object servers log files for this transaction ID:
|
||||
|
||||
.. code::
|
||||
|
||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername>
|
||||
|
||||
-R ssh
|
||||
-w <redacted>.72.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.[4-67|4-67],<redacted>.204.[4-131| 4-131]
|
||||
'sudo bzgrep tx429fc3be354f434ab7f9c6c4206c1dc3 /var/log/swift/server.log*'
|
||||
| dshbak -c
|
||||
.
|
||||
.
|
||||
\---------------\-
|
||||
<redacted>.72.16
|
||||
\---------------\-
|
||||
Feb 29 08:51:57 sw-aw2az1-object013 account-server <redacted>.132.6 - -
|
||||
|
||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-"
|
||||
|
||||
0.0016 ""
|
||||
\---------------\-
|
||||
<redacted>.31
|
||||
\---------------\-
|
||||
Feb 29 08:51:57 node-az2-object060 account-server <redacted>.132.6 - -
|
||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0011 ""
|
||||
\---------------\-
|
||||
<redacted>.204.70
|
||||
\---------------\-
|
||||
|
||||
Feb 29 08:51:57 sw-aw2az3-object0067 account-server <redacted>.132.6 - -
|
||||
[29/Feb/2012:08:51:57 +0000|] "GET /disk6/198875/AUTH_redacted-4962-
|
||||
4692-98fb-52ddda82a5af" 404 - "tx429fc3be354f434ab7f9c6c4206c1dc3" "-" "-" 0.0014 ""
|
||||
|
||||
.. note::
|
||||
|
||||
The 3 GET operations to 3 different object servers that hold the 3
|
||||
replicas of this users account. Each ``GET`` returns a HTTP status of 404,
|
||||
not found.
|
||||
|
||||
Next, use the ``swift-get-nodes`` command to determine exactly where the
|
||||
users account data is stored:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo swift-get-nodes /etc/swift/account.ring.gz AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||
Account AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||
Container None
|
||||
Object None
|
||||
|
||||
Partition 198875
|
||||
Hash 1846d99185f8a0edaf65cfbf37439696
|
||||
|
||||
Server:Port Device <redacted>.31:6002 disk6
|
||||
Server:Port Device <redacted>.204.70:6002 disk6
|
||||
Server:Port Device <redacted>.72.16:6002 disk9
|
||||
Server:Port Device <redacted>.204.64:6002 disk11 [Handoff]
|
||||
Server:Port Device <redacted>.26:6002 disk11 [Handoff]
|
||||
Server:Port Device <redacted>.72.27:6002 disk11 [Handoff]
|
||||
|
||||
curl -I -XHEAD "`http://<redacted>.31:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.138.31:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||
curl -I -XHEAD "`http://<redacted>.204.70:6002/disk6/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.204.70:6002/disk6/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||
curl -I -XHEAD "`http://<redacted>.72.16:6002/disk9/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.72.16:6002/disk9/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_
|
||||
curl -I -XHEAD "`http://<redacted>.204.64:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.204.64:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||
curl -I -XHEAD "`http://<redacted>.26:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.136.26:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||
curl -I -XHEAD "`http://<redacted>.72.27:6002/disk11/198875/AUTH_redacted-4962-4692-98fb-52ddda82a5af"
|
||||
<http://15.185.72.27:6002/disk11/198875/AUTH_db0050ad-4962-4692-98fb-52ddda82a5af>`_ # [Handoff]
|
||||
|
||||
ssh <redacted>.31 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||
ssh <redacted>.204.70 "ls \-lah /srv/node/disk6/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||
ssh <redacted>.72.16 "ls \-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/"
|
||||
ssh <redacted>.204.64 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||
ssh <redacted>.26 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||
ssh <redacted>.72.27 "ls \-lah /srv/node/disk11/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/" # [Handoff]
|
||||
|
||||
Check each of the primary servers, <redacted>.31, <redacted>.204.70 and <redacted>.72.16, for
|
||||
this users account. For example on <redacted>.72.16:
|
||||
|
||||
.. code::
|
||||
|
||||
$ ls \\-lah /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/
|
||||
total 1.0M
|
||||
drwxrwxrwx 2 swift swift 98 2012-02-23 14:49 .
|
||||
drwxrwxrwx 3 swift swift 45 2012-02-03 23:28 ..
|
||||
-rw-\\-----\\- 1 swift swift 15K 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db
|
||||
-rw-rw-rw- 1 swift swift 0 2012-02-23 14:49 1846d99185f8a0edaf65cfbf37439696.db.pending
|
||||
|
||||
So this users account db, an sqlite db is present. Use sqlite to
|
||||
checkout the account:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo cp /srv/node/disk9/accounts/198875/696/1846d99185f8a0edaf65cfbf37439696/1846d99185f8a0edaf65cfbf37439696.db /tmp
|
||||
$ sudo sqlite3 /tmp/1846d99185f8a0edaf65cfbf37439696.db
|
||||
sqlite> .mode line
|
||||
sqlite> select * from account_stat;
|
||||
account = AUTH_redacted-4962-4692-98fb-52ddda82a5af
|
||||
created_at = 1328311738.42190
|
||||
put_timestamp = 1330000873.61411
|
||||
delete_timestamp = 1330001026.00514
|
||||
container_count = 0
|
||||
object_count = 0
|
||||
bytes_used = 0
|
||||
hash = eb7e5d0ea3544d9def940b19114e8b43
|
||||
id = 2de8c8a8-cef9-4a94-a421-2f845802fe90
|
||||
status = DELETED
|
||||
status_changed_at = 1330001026.00514
|
||||
metadata =
|
||||
|
||||
.. note::
|
||||
|
||||
The status is ``DELETED``. So this account was deleted. This explains
|
||||
why the GET operations are returning 404, not found. Check the account
|
||||
delete date/time:
|
||||
|
||||
.. code::
|
||||
|
||||
$ python
|
||||
|
||||
>>> import time
|
||||
>>> time.ctime(1330001026.00514)
|
||||
'Thu Feb 23 12:43:46 2012'
|
||||
|
||||
Next try and find the ``DELETE`` operation for this account in the proxy
|
||||
server logs:
|
||||
|
||||
.. code::
|
||||
|
||||
$ PDSH_SSH_ARGS_APPEND="-o StrictHostKeyChecking=no" pdsh -l <yourusername> -R ssh -w <redacted>.68.[4-11,132-139 4-11,132-
|
||||
139],<redacted>.132.[4-11,132-139|4-11,132-139] 'sudo bzgrep AUTH_redacted-4962-4692-98fb-52ddda82a5af /var/log/swift/proxy.log\* | grep -w
|
||||
DELETE |awk "{print \\$3,\\$10,\\$12}"' |- dshbak -c
|
||||
.
|
||||
.
|
||||
Feb 23 12:43:46 sw-aw2az2-proxy001 proxy-server 15.203.233.76 <redacted>.66.7 23/Feb/2012/12/43/46 DELETE /v1.0/AUTH_redacted-4962-4692-98fb-
|
||||
52ddda82a5af/ HTTP/1.0 204 - Apache-HttpClient/4.1.2%20%28java%201.5%29 <REDACTED>_4f458ee4e4b02a869c3aad02 - - -
|
||||
|
||||
tx4471188b0b87406899973d297c55ab53 - 0.0086
|
||||
|
||||
From this you can see the operation that resulted in the account being deleted.
|
||||
|
||||
Procedure: Deleting objects
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Simple case - deleting small number of objects and containers
|
||||
-------------------------------------------------------------
|
||||
|
||||
.. note::
|
||||
|
||||
``swift-direct`` is specific to the Hewlett Packard Enterprise Helion Public Cloud.
|
||||
Use ``swiftly`` as an alternative.
|
||||
|
||||
.. note::
|
||||
|
||||
Object and container names are in UTF8. Swift direct accepts UTF8
|
||||
directly, not URL-encoded UTF8 (the REST API expects UTF8 and then
|
||||
URL-encoded). In practice cut and paste of foreign language strings to
|
||||
a terminal window will produce the right result.
|
||||
|
||||
Hint: Use the ``head`` command before any destructive commands.
|
||||
|
||||
To delete a small number of objects, log into any proxy node and proceed
|
||||
as follows:
|
||||
|
||||
Examine the object in question:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct head 132345678912345 container_name obj_name
|
||||
|
||||
See if ``X-Object-Manifest`` or ``X-Static-Large-Object`` is set,
|
||||
then this is the manifest object and segment objects may be in another
|
||||
container.
|
||||
|
||||
If the ``X-Object-Manifest`` attribute is set, you need to find the
|
||||
name of the objects this means it is a DLO. For example,
|
||||
if ``X-Object-Manifest`` is ``container2/seg-blah``, list the contents
|
||||
of the container container2 as follows:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct show 132345678912345 container2
|
||||
|
||||
Pick out the objects whose names start with ``seg-blah``.
|
||||
Delete the segment objects as follows:
|
||||
|
||||
.. code::
|
||||
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah01
|
||||
$ sudo -u swift /opt/hp/swift/bin/swift-direct delete 132345678912345 container2 seg-blah02
|
||||
etc
|
||||
|
||||
If ``X-Static-Large-Object`` is set, you need to read the contents. Do this by:
|
||||
|
||||
- Using swift-get-nodes to get the details of the object's location.
|
||||
- Change the ``-X HEAD`` to ``-X GET`` and run ``curl`` against one copy.
|
||||
- This lists a json body listing containers and object names
|
||||
- Delete the objects as described above for DLO segments
|
||||
|
||||
Once the segments are deleted, you can delete the object using
|
||||
``swift-direct`` as described above.
|
||||
|
||||
Finally, use ``swift-direct`` to delete the container.
|
||||
|
||||
Procedure: Decommissioning swift nodes
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Should Swift nodes need to be decommissioned. For example, where they are being
|
||||
re-purposed, it is very important to follow the following steps.
|
||||
|
||||
#. In the case of object servers, follow the procedure for removing
|
||||
the node from the rings.
|
||||
#. In the case of swift proxy servers, have the network team remove
|
||||
the node from the load balancers.
|
||||
#. Open a network ticket to have the node removed from network
|
||||
firewalls.
|
||||
#. Make sure that you remove the ``/etc/swift`` directory and everything in it.
|
Loading…
x
Reference in New Issue
Block a user