[ops-guide] Add RabbitMQ troubleshooting information

Add troubleshooting information from enterprise ops documentation and reviewers Change-Id: I6017ce4e798b034fff6f2430e43807a73acff885 Implements: blueprint improve-ops-guide
2016-07-12 17:26:17 +10:00 · 2016-07-12 17:26:17 +10:00 · a29d03742b
commit a29d03742b
parent 41cfb639ad
2 changed files with 148 additions and 0 deletions
--- a/doc/ops-guide/source/ops_maintenance.rst
+++ b/doc/ops-guide/source/ops_maintenance.rst
@ -12,6 +12,7 @@ Maintenance, Failures, and Debugging
   ops_maintenance_configuration.rst
   ops_maintenance_hardware.rst
   ops_maintenance_database.rst
+   ops_maintenance_rabbitmq.rst
   ops_maintenance_hdmwy.rst
   ops_maintenance_determine.rst
   ops_maintenance_slow.rst
--- a/doc/ops-guide/source/ops_maintenance_rabbitmq.rst
+++ b/doc/ops-guide/source/ops_maintenance_rabbitmq.rst
@ -0,0 +1,147 @@
+========================
+RabbitMQ troubleshooting
+========================
+
+This section provides tips on resolving common RabbitMQ issues.
+
+RabbitMQ service hangs
+~~~~~~~~~~~~~~~~~~~~~~
+
+It is quite common for the RabbitMQ service to hang when it is
+restarted or stopped. Therefore, it is highly recommended that
+you manually restart RabbitMQ on each controller node.
+
+.. note::
+
+   The RabbitMQ service name may vary depending on your operating
+   system or vendor who supplies your RabbitMQ service.
+
+#. Restart the RabbitMQ service on the first controller node. The
+   :command:`service rabbitmq-server restart` command may not work
+   in certain situations, so it is best to use:
+
+   .. code-block:: console
+
+      # service rabbitmq-server stop
+      # service rabbitmq-server start
+
+
+#. If the service refuses to stop, then run the :command:`pkill` command
+   to stop the service, then restart the service:
+
+   .. code-block:: console
+
+      # pkill -KILL -u rabbitmq
+      # service rabbitmq-server start
+
+#. Verify RabbitMQ processes are running:
+
+   .. code-block:: console
+
+      # ps -ef | grep rabbitmq
+      # rabbitmqctl list_queues
+      # rabbitmqctl list_queues 2>&1 | grep -i error
+
+#. If there are errors, run the :command:`cluster-status` command to make sure
+   there are no partitions:
+
+   .. code-block:: console
+
+      # rabbitmqctl cluster_status
+
+   For more information, see https://www.rabbitmq.com/partitions.html.
+
+#. Go back to the first step and try restarting the RabbitMQ service again. If
+   you still have errors, remove the contents in the
+   ``/var/lib/rabbitmq/mnesia/`` directory between stopping and starting the
+   RabbitMQ service.
+
+#. If there are no errors, restart the RabbitMQ service on the next controller
+   node.
+
+Since the Liberty release, OpenStack services will automatically recover from
+a RabbitMQ outage. You should only consider restarting OpenStack services
+after checking if RabbitMQ heartbeat functionality is enabled, and if Openstack
+services are not picking up messages from RabbitMQ queues.
+
+RabbitMQ alerts
+~~~~~~~~~~~~~~~
+
+If you receive alerts for RabbitMQ, take the following steps to troubleshoot
+and resolve the issue:
+
+#. Determine which servers the RabbitMQ alarms are coming from.
+#. Attempt to boot a nova instance in the affected environment.
+#. If you cannot launch an instance, continue to troubleshoot the issue.
+#. Log in to each of the controller nodes for the affected environment, and
+   check the :file:`/var/log/rabbitmq` log files for any reported issues.
+#. Look for connection issues identified in the log files.
+#. For each controller node in your environment, view the ``/etc/init.d``
+   directory to check it contains nova*, cinder*, neutron*, or
+   glance*. Also check RabbitMQ message queues that are growing without being
+   consumed which will indicate which OpenStack service is affected. Restart
+   the affected OpenStack service.
+#. For each compute node your environment, view the ``/etc/init.d`` directory
+   and check if it contains nova*, cinder*, neutron*, or glance*,  Also check
+   RabbitMQ message queues that are growing without being consumed which will
+   indicate which OpenStack services are affected. Restart the affected
+   OpenStack services.
+#. Open OpenStack Dashboard and launch an instance. If the instance launches,
+   the issue is resolved.
+#. If you cannot launch an instance, check the :file:`/var/log/rabbitmq` log
+   files for reported connection issues.
+#. Restart the RabbitMQ service on all of the controller nodes:
+
+   .. code-block:: console
+
+      # service rabbitmq-server stop
+      # service rabbitmq-server start
+
+   .. note::
+
+      This step applies if you have already restarted only the OpenStack components, and
+      cannot connect to the RabbitMQ service.
+
+#. Repeat steps 7-8.
+
+Excessive database management memory consumption
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the Liberty release, OpenStack with RabbitMQ 3.4.x or 3.6.x has an issue
+with the management database consuming the memory allocated to RabbitMQ.
+This is caused by statistics collection and processing. When a single node
+with RabbitMQ reaches its memory threshold, all exchange and queue processing
+is halted until the memory alarm recovers.
+
+To address this issue:
+
+#. Check memory consumption:
+
+   .. code-block: console
+
+      # rabbitmqctl status
+
+#. Edit the file:`etc/rabbitmq/rabbitmq.config` configuration file, and change
+   the ``collect_statistics_interval`` parameter between 30000-60000
+   milliseconds. Alternatively you can turn off statistics collection by
+   setting ``collect_statistics`` parameter to "none".
+
+File descriptor limits when scaling a cloud environment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A cloud environment that is scaled to a certain size will require the file
+descriptor limits to be adjusted.
+
+Run the :command:`rabbitmqctl status` to view the current file descriptor
+limits:
+
+.. code-block:: console
+
+   "{file_descriptors,
+        [{total_limit,3996},
+         {total_used,135},
+         {sockets_limit,3594},
+         {sockets_used,133}]},"
+
+Adjust the appropriate limits in the
+:file:`/etc/security/limits.conf` configuration file.