From a29d03742bcae7e56df75adf15a01749657ba664 Mon Sep 17 00:00:00 2001 From: daz Date: Tue, 12 Jul 2016 17:26:17 +1000 Subject: [PATCH] [ops-guide] Add RabbitMQ troubleshooting information Add troubleshooting information from enterprise ops documentation and reviewers Change-Id: I6017ce4e798b034fff6f2430e43807a73acff885 Implements: blueprint improve-ops-guide --- doc/ops-guide/source/ops_maintenance.rst | 1 + .../source/ops_maintenance_rabbitmq.rst | 147 ++++++++++++++++++ 2 files changed, 148 insertions(+) create mode 100644 doc/ops-guide/source/ops_maintenance_rabbitmq.rst diff --git a/doc/ops-guide/source/ops_maintenance.rst b/doc/ops-guide/source/ops_maintenance.rst index df937796c4..feef8e98a7 100644 --- a/doc/ops-guide/source/ops_maintenance.rst +++ b/doc/ops-guide/source/ops_maintenance.rst @@ -12,6 +12,7 @@ Maintenance, Failures, and Debugging ops_maintenance_configuration.rst ops_maintenance_hardware.rst ops_maintenance_database.rst + ops_maintenance_rabbitmq.rst ops_maintenance_hdmwy.rst ops_maintenance_determine.rst ops_maintenance_slow.rst diff --git a/doc/ops-guide/source/ops_maintenance_rabbitmq.rst b/doc/ops-guide/source/ops_maintenance_rabbitmq.rst new file mode 100644 index 0000000000..3f817d35b5 --- /dev/null +++ b/doc/ops-guide/source/ops_maintenance_rabbitmq.rst @@ -0,0 +1,147 @@ +======================== +RabbitMQ troubleshooting +======================== + +This section provides tips on resolving common RabbitMQ issues. + +RabbitMQ service hangs +~~~~~~~~~~~~~~~~~~~~~~ + +It is quite common for the RabbitMQ service to hang when it is +restarted or stopped. Therefore, it is highly recommended that +you manually restart RabbitMQ on each controller node. + +.. note:: + + The RabbitMQ service name may vary depending on your operating + system or vendor who supplies your RabbitMQ service. + +#. Restart the RabbitMQ service on the first controller node. The + :command:`service rabbitmq-server restart` command may not work + in certain situations, so it is best to use: + + .. code-block:: console + + # service rabbitmq-server stop + # service rabbitmq-server start + + +#. If the service refuses to stop, then run the :command:`pkill` command + to stop the service, then restart the service: + + .. code-block:: console + + # pkill -KILL -u rabbitmq + # service rabbitmq-server start + +#. Verify RabbitMQ processes are running: + + .. code-block:: console + + # ps -ef | grep rabbitmq + # rabbitmqctl list_queues + # rabbitmqctl list_queues 2>&1 | grep -i error + +#. If there are errors, run the :command:`cluster-status` command to make sure + there are no partitions: + + .. code-block:: console + + # rabbitmqctl cluster_status + + For more information, see https://www.rabbitmq.com/partitions.html. + +#. Go back to the first step and try restarting the RabbitMQ service again. If + you still have errors, remove the contents in the + ``/var/lib/rabbitmq/mnesia/`` directory between stopping and starting the + RabbitMQ service. + +#. If there are no errors, restart the RabbitMQ service on the next controller + node. + +Since the Liberty release, OpenStack services will automatically recover from +a RabbitMQ outage. You should only consider restarting OpenStack services +after checking if RabbitMQ heartbeat functionality is enabled, and if Openstack +services are not picking up messages from RabbitMQ queues. + +RabbitMQ alerts +~~~~~~~~~~~~~~~ + +If you receive alerts for RabbitMQ, take the following steps to troubleshoot +and resolve the issue: + +#. Determine which servers the RabbitMQ alarms are coming from. +#. Attempt to boot a nova instance in the affected environment. +#. If you cannot launch an instance, continue to troubleshoot the issue. +#. Log in to each of the controller nodes for the affected environment, and + check the :file:`/var/log/rabbitmq` log files for any reported issues. +#. Look for connection issues identified in the log files. +#. For each controller node in your environment, view the ``/etc/init.d`` + directory to check it contains nova*, cinder*, neutron*, or + glance*. Also check RabbitMQ message queues that are growing without being + consumed which will indicate which OpenStack service is affected. Restart + the affected OpenStack service. +#. For each compute node your environment, view the ``/etc/init.d`` directory + and check if it contains nova*, cinder*, neutron*, or glance*, Also check + RabbitMQ message queues that are growing without being consumed which will + indicate which OpenStack services are affected. Restart the affected + OpenStack services. +#. Open OpenStack Dashboard and launch an instance. If the instance launches, + the issue is resolved. +#. If you cannot launch an instance, check the :file:`/var/log/rabbitmq` log + files for reported connection issues. +#. Restart the RabbitMQ service on all of the controller nodes: + + .. code-block:: console + + # service rabbitmq-server stop + # service rabbitmq-server start + + .. note:: + + This step applies if you have already restarted only the OpenStack components, and + cannot connect to the RabbitMQ service. + +#. Repeat steps 7-8. + +Excessive database management memory consumption +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Since the Liberty release, OpenStack with RabbitMQ 3.4.x or 3.6.x has an issue +with the management database consuming the memory allocated to RabbitMQ. +This is caused by statistics collection and processing. When a single node +with RabbitMQ reaches its memory threshold, all exchange and queue processing +is halted until the memory alarm recovers. + +To address this issue: + +#. Check memory consumption: + + .. code-block: console + + # rabbitmqctl status + +#. Edit the file:`etc/rabbitmq/rabbitmq.config` configuration file, and change + the ``collect_statistics_interval`` parameter between 30000-60000 + milliseconds. Alternatively you can turn off statistics collection by + setting ``collect_statistics`` parameter to "none". + +File descriptor limits when scaling a cloud environment +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A cloud environment that is scaled to a certain size will require the file +descriptor limits to be adjusted. + +Run the :command:`rabbitmqctl status` to view the current file descriptor +limits: + +.. code-block:: console + + "{file_descriptors, + [{total_limit,3996}, + {total_used,135}, + {sockets_limit,3594}, + {sockets_used,133}]}," + +Adjust the appropriate limits in the +:file:`/etc/security/limits.conf` configuration file.