[ops-guide] Add RabbitMQ troubleshooting information
Add troubleshooting information from enterprise ops documentation and reviewers Change-Id: I6017ce4e798b034fff6f2430e43807a73acff885 Implements: blueprint improve-ops-guide
This commit is contained in:
parent
41cfb639ad
commit
a29d03742b
@ -12,6 +12,7 @@ Maintenance, Failures, and Debugging
|
||||
ops_maintenance_configuration.rst
|
||||
ops_maintenance_hardware.rst
|
||||
ops_maintenance_database.rst
|
||||
ops_maintenance_rabbitmq.rst
|
||||
ops_maintenance_hdmwy.rst
|
||||
ops_maintenance_determine.rst
|
||||
ops_maintenance_slow.rst
|
||||
|
147
doc/ops-guide/source/ops_maintenance_rabbitmq.rst
Normal file
147
doc/ops-guide/source/ops_maintenance_rabbitmq.rst
Normal file
@ -0,0 +1,147 @@
|
||||
========================
|
||||
RabbitMQ troubleshooting
|
||||
========================
|
||||
|
||||
This section provides tips on resolving common RabbitMQ issues.
|
||||
|
||||
RabbitMQ service hangs
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
It is quite common for the RabbitMQ service to hang when it is
|
||||
restarted or stopped. Therefore, it is highly recommended that
|
||||
you manually restart RabbitMQ on each controller node.
|
||||
|
||||
.. note::
|
||||
|
||||
The RabbitMQ service name may vary depending on your operating
|
||||
system or vendor who supplies your RabbitMQ service.
|
||||
|
||||
#. Restart the RabbitMQ service on the first controller node. The
|
||||
:command:`service rabbitmq-server restart` command may not work
|
||||
in certain situations, so it is best to use:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# service rabbitmq-server stop
|
||||
# service rabbitmq-server start
|
||||
|
||||
|
||||
#. If the service refuses to stop, then run the :command:`pkill` command
|
||||
to stop the service, then restart the service:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# pkill -KILL -u rabbitmq
|
||||
# service rabbitmq-server start
|
||||
|
||||
#. Verify RabbitMQ processes are running:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# ps -ef | grep rabbitmq
|
||||
# rabbitmqctl list_queues
|
||||
# rabbitmqctl list_queues 2>&1 | grep -i error
|
||||
|
||||
#. If there are errors, run the :command:`cluster-status` command to make sure
|
||||
there are no partitions:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# rabbitmqctl cluster_status
|
||||
|
||||
For more information, see https://www.rabbitmq.com/partitions.html.
|
||||
|
||||
#. Go back to the first step and try restarting the RabbitMQ service again. If
|
||||
you still have errors, remove the contents in the
|
||||
``/var/lib/rabbitmq/mnesia/`` directory between stopping and starting the
|
||||
RabbitMQ service.
|
||||
|
||||
#. If there are no errors, restart the RabbitMQ service on the next controller
|
||||
node.
|
||||
|
||||
Since the Liberty release, OpenStack services will automatically recover from
|
||||
a RabbitMQ outage. You should only consider restarting OpenStack services
|
||||
after checking if RabbitMQ heartbeat functionality is enabled, and if Openstack
|
||||
services are not picking up messages from RabbitMQ queues.
|
||||
|
||||
RabbitMQ alerts
|
||||
~~~~~~~~~~~~~~~
|
||||
|
||||
If you receive alerts for RabbitMQ, take the following steps to troubleshoot
|
||||
and resolve the issue:
|
||||
|
||||
#. Determine which servers the RabbitMQ alarms are coming from.
|
||||
#. Attempt to boot a nova instance in the affected environment.
|
||||
#. If you cannot launch an instance, continue to troubleshoot the issue.
|
||||
#. Log in to each of the controller nodes for the affected environment, and
|
||||
check the :file:`/var/log/rabbitmq` log files for any reported issues.
|
||||
#. Look for connection issues identified in the log files.
|
||||
#. For each controller node in your environment, view the ``/etc/init.d``
|
||||
directory to check it contains nova*, cinder*, neutron*, or
|
||||
glance*. Also check RabbitMQ message queues that are growing without being
|
||||
consumed which will indicate which OpenStack service is affected. Restart
|
||||
the affected OpenStack service.
|
||||
#. For each compute node your environment, view the ``/etc/init.d`` directory
|
||||
and check if it contains nova*, cinder*, neutron*, or glance*, Also check
|
||||
RabbitMQ message queues that are growing without being consumed which will
|
||||
indicate which OpenStack services are affected. Restart the affected
|
||||
OpenStack services.
|
||||
#. Open OpenStack Dashboard and launch an instance. If the instance launches,
|
||||
the issue is resolved.
|
||||
#. If you cannot launch an instance, check the :file:`/var/log/rabbitmq` log
|
||||
files for reported connection issues.
|
||||
#. Restart the RabbitMQ service on all of the controller nodes:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
# service rabbitmq-server stop
|
||||
# service rabbitmq-server start
|
||||
|
||||
.. note::
|
||||
|
||||
This step applies if you have already restarted only the OpenStack components, and
|
||||
cannot connect to the RabbitMQ service.
|
||||
|
||||
#. Repeat steps 7-8.
|
||||
|
||||
Excessive database management memory consumption
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Since the Liberty release, OpenStack with RabbitMQ 3.4.x or 3.6.x has an issue
|
||||
with the management database consuming the memory allocated to RabbitMQ.
|
||||
This is caused by statistics collection and processing. When a single node
|
||||
with RabbitMQ reaches its memory threshold, all exchange and queue processing
|
||||
is halted until the memory alarm recovers.
|
||||
|
||||
To address this issue:
|
||||
|
||||
#. Check memory consumption:
|
||||
|
||||
.. code-block: console
|
||||
|
||||
# rabbitmqctl status
|
||||
|
||||
#. Edit the file:`etc/rabbitmq/rabbitmq.config` configuration file, and change
|
||||
the ``collect_statistics_interval`` parameter between 30000-60000
|
||||
milliseconds. Alternatively you can turn off statistics collection by
|
||||
setting ``collect_statistics`` parameter to "none".
|
||||
|
||||
File descriptor limits when scaling a cloud environment
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
A cloud environment that is scaled to a certain size will require the file
|
||||
descriptor limits to be adjusted.
|
||||
|
||||
Run the :command:`rabbitmqctl status` to view the current file descriptor
|
||||
limits:
|
||||
|
||||
.. code-block:: console
|
||||
|
||||
"{file_descriptors,
|
||||
[{total_limit,3996},
|
||||
{total_used,135},
|
||||
{sockets_limit,3594},
|
||||
{sockets_used,133}]},"
|
||||
|
||||
Adjust the appropriate limits in the
|
||||
:file:`/etc/security/limits.conf` configuration file.
|
Loading…
Reference in New Issue
Block a user