0265a39140
This patch use "project" to replace "tenant" term in ha-guide for cleanup. Partial-Bug: #1475005 Change-Id: I8b6ea24340f0e9eea500000d01ab65fa2ea45b03
97 lines
4.0 KiB
ReStructuredText
97 lines
4.0 KiB
ReStructuredText
============================
|
|
The keepalived architecture
|
|
============================
|
|
|
|
High availability strategies
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The following diagram shows a very simplified view of the different
|
|
strategies used to achieve high availability for the OpenStack
|
|
services:
|
|
|
|
.. image:: /figures/keepalived-arch.jpg
|
|
:width: 100%
|
|
|
|
Depending on the method used to communicate with the service, the
|
|
following availability strategies will be followed:
|
|
|
|
- Keepalived, for the HAProxy instances.
|
|
- Access via an HAProxy virtual IP, for services such as HTTPd that
|
|
are accessed via a TCP socket that can be load balanced
|
|
- Built-in application clustering, when available from the application.
|
|
Galera is one example of this.
|
|
- Starting up one instance of the service on several controller nodes,
|
|
when they can coexist and coordinate by other means. RPC in
|
|
``nova-conductor`` is one example of this.
|
|
- No high availability, when the service can only work in
|
|
active/passive mode.
|
|
|
|
There are known issues with cinder-volume that recommend setting it as
|
|
active-passive for now, see:
|
|
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
|
|
|
While there will be multiple neutron LBaaS agents running, each agent
|
|
will manage a set of load balancers, that cannot be failed over to
|
|
another node.
|
|
|
|
Architecture limitations
|
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This architecture has some inherent limitations that should be kept in
|
|
mind during deployment and daily operations.
|
|
The following sections describe these limitations.
|
|
|
|
#. Keepalived and network partitions
|
|
|
|
In case of a network partitioning, there is a chance that two or
|
|
more nodes running keepalived claim to hold the same VIP, which may
|
|
lead to an undesired behaviour. Since keepalived uses VRRP over
|
|
multicast to elect a master (VIP owner), a network partition in
|
|
which keepalived nodes cannot communicate will result in the VIPs
|
|
existing on two nodes. When the network partition is resolved, the
|
|
duplicate VIPs should also be resolved. Note that this network
|
|
partition problem with VRRP is a known limitation for this
|
|
architecture.
|
|
|
|
#. Cinder-volume as a single point of failure
|
|
|
|
There are currently concerns over the cinder-volume service ability
|
|
to run as a fully active-active service. During the Mitaka
|
|
timeframe, this is being worked on, see:
|
|
https://blueprints.launchpad.net/cinder/+spec/cinder-volume-active-active-support
|
|
Thus, cinder-volume will only be running on one of the controller
|
|
nodes, even if it will be configured on all nodes. In case of a
|
|
failure in the node running cinder-volume, it should be started in
|
|
a surviving controller node.
|
|
|
|
#. Neutron-lbaas-agent as a single point of failure
|
|
|
|
The current design of the neutron LBaaS agent using the HAProxy
|
|
driver does not allow high availability for the project load
|
|
balancers. The neutron-lbaas-agent service will be enabled and
|
|
running on all controllers, allowing for load balancers to be
|
|
distributed across all nodes. However, a controller node failure
|
|
will stop all load balancers running on that node until the service
|
|
is recovered or the load balancer is manually removed and created
|
|
again.
|
|
|
|
#. Service monitoring and recovery required
|
|
|
|
An external service monitoring infrastructure is required to check
|
|
the OpenStack service health, and notify operators in case of any
|
|
failure. This architecture does not provide any facility for that,
|
|
so it would be necessary to integrate the OpenStack deployment with
|
|
any existing monitoring environment.
|
|
|
|
#. Manual recovery after a full cluster restart
|
|
|
|
Some support services used by RDO or RHEL OSP use their own form of
|
|
application clustering. Usually, these services maintain a cluster
|
|
quorum, that may be lost in case of a simultaneous restart of all
|
|
cluster nodes, for example during a power outage. Each service will
|
|
require its own procedure to regain quorum.
|
|
|
|
If you find any or all of these limitations concerning, you are
|
|
encouraged to refer to the
|
|
:doc:`Pacemaker HA architecture<intro-ha-arch-pacemaker>` instead.
|