[[ch-pacemaker]]
=== The Pacemaker Cluster Stack

OpenStack infrastructure high availability relies on the
http://www.clusterlabs.org[Pacemaker] cluster stack, the
state-of-the-art high availability and load balancing stack for the
Linux platform. Pacemaker is storage and application-agnostic, and is
in no way specific to OpenStack.

Pacemaker relies on the http://www.corosync.org[Corosync] messaging
layer for reliable cluster communications. Corosync implements the
Totem single-ring ordering and membership protocol. It also provides UDP
and InfiniBand based messaging, quorum, and cluster membership to
Pacemaker.

Pacemaker interacts with applications through _resource agents_ (RAs),
of which it supports over 70 natively. Pacemaker can also easily use
third-party RAs. An OpenStack high-availability configuration uses
existing native Pacemaker RAs (such as those managing MySQL
databases or virtual IP addresses), existing third-party RAs (such as
for RabbitMQ), and native OpenStack RAs (such as those managing the
OpenStack Identity and Image Services).

==== Installing Packages

On any host that is meant to be part of a Pacemaker cluster, you must
first establish cluster communications through the Corosync messaging
layer. This involves installing the following packages (and their
dependencies, which your package manager will normally install
automatically):

* +pacemaker+ Note that the crm shell should be downloaded separately.
* +crmsh+
* +corosync+
* +cluster-glue+
* +fence-agents+ (Fedora only; all other distributions use fencing
  agents from +cluster-glue+)
* +resource-agents+

==== Setting up Corosync

Besides installing the +corosync+ package, you will also have to
create a configuration file, stored in
+/etc/corosync/corosync.conf+. Most distributions ship an example
configuration file (+corosync.conf.example+) as part of the
documentation bundled with the +corosync+ package. An example Corosync
configuration file is shown below:

.Corosync configuration file (+corosync.conf+)
----
include::includes/corosync.conf[]
----

<1> The +token+ value specifies the time, in milliseconds, during
which the Corosync token is expected to be transmitted around the
ring. When this timeout expires, the token is declared lost, and after
+token_retransmits_before_loss_const+ lost tokens the non-responding
_processor_ (cluster node) is declared dead. In other words,
+token+ &times; +token_retransmits_before_loss_const+ is the maximum
time a node is allowed to not respond to cluster messages before being
considered dead. The default for +token+ is 1000 (1 second), with 4
allowed retransmits. These defaults are intended to minimize failover
times, but can cause frequent "false alarms" and unintended failovers
in case of short network interruptions. The values used here are
safer, albeit with slightly extended failover times.

<2> With +secauth+ enabled, Corosync nodes mutually authenticate using
a 128-byte shared secret stored in +/etc/corosync/authkey+, which may
be generated with the +corosync-keygen+ utility. When using +secauth+,
cluster communications are also encrypted.

<3> In Corosync configurations using redundant networking (with more
than one +interface+), you must select a Redundant Ring Protocol (RRP)
mode other than +none+. +active+ is the recommended RRP mode.

<4> There are several things to note about the recommended interface
configuration:
* The +ringnumber+ must differ between all configured interfaces,
  starting with 0.
* The +bindnetaddr+ is the _network_ address of the interfaces to bind
  to. The example uses two network addresses of +/24+ IPv4 subnets.
* Multicast groups (+mcastaddr+) _must not_ be reused across cluster
  boundaries. In other words, no two distinct clusters should ever use
  the same multicast group. Be sure to select multicast addresses
  compliant with http://www.ietf.org/rfc/rfc2365.txt[RFC 2365,
  "Administratively Scoped IP Multicast"].
* For firewall configurations, note that Corosync communicates over
  UDP only, and uses +mcastport+ (for receives) and +mcastport+-1 (for
  sends).

<5> The +service+ declaration for the +pacemaker+ service may be
placed in the +corosync.conf+ file directly, or in its own separate
file, +/etc/corosync/service.d/pacemaker+.

Once created, the +corosync.conf+ file (and the +authkey+ file if the
+secauth+ option is enabled) must be synchronized across all cluster
nodes.

==== Starting Corosync

Corosync is started as a regular system service. Depending on your
distribution, it may ship with a LSB (System V style) init script, an
upstart job, or a systemd unit file. Either way, the service is
usually named +corosync+:

* +/etc/init.d/corosync start+ (LSB)
* +service corosync start+ (LSB, alternate)
* +start corosync+ (upstart)
* +systemctl start corosync+ (systemd)

You can now check the Corosync connectivity with two tools.

The +corosync-cfgtool+ utility, when invoked with the +-s+ option,
gives a summary of the health of the communication rings:

----
# corosync-cfgtool -s
Printing ring status.
Local node ID 435324542
RING ID 0
	id	= 192.168.42.82
	status	= ring 0 active with no faults
RING ID 1
	id	= 10.0.42.100
	status	= ring 1 active with no faults
----

The +corosync-objctl+ utility can be used to dump the Corosync cluster
member list:

----
# corosync-objctl runtime.totem.pg.mrp.srp.members
runtime.totem.pg.mrp.srp.435324542.ip=r(0) ip(192.168.42.82) r(1) ip(10.0.42.100) 
runtime.totem.pg.mrp.srp.435324542.join_count=1
runtime.totem.pg.mrp.srp.435324542.status=joined
runtime.totem.pg.mrp.srp.983895584.ip=r(0) ip(192.168.42.87) r(1) ip(10.0.42.254) 
runtime.totem.pg.mrp.srp.983895584.join_count=1
runtime.totem.pg.mrp.srp.983895584.status=joined
----

You should see a +status=joined+ entry for each of your constituent
cluster nodes.

==== Starting Pacemaker

Once the Corosync services have been started, and you have established
that the cluster is communicating properly, it is safe to start
+pacemakerd+, the Pacemaker master control process:

* +/etc/init.d/pacemaker start+ (LSB)
* +service pacemaker start+ (LSB, alternate)
* +start pacemaker+ (upstart)
* +systemctl start pacemaker+ (systemd)

Once Pacemaker services have started, Pacemaker will create a default
empty cluster configuration with no resources. You may observe
Pacemaker's status with the +crm_mon+ utility:

----
============
Last updated: Sun Oct  7 21:07:52 2012
Last change: Sun Oct  7 20:46:00 2012 via cibadmin on node2
Stack: openais
Current DC: node2 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ node2 node1 ]
----

==== Setting basic cluster properties

Once your Pacemaker cluster is set up, it is recommended to set a few
basic cluster properties. To do so, start the +crm+ shell and change
into the configuration menu by entering
+configure+. Alternatively. you may jump straight into the Pacemaker
configuration menu by typing +crm configure+ directly from a shell
prompt.

Then, set the following properties:

----
include::includes/pacemaker-properties.crm[]
----

<1> Setting +no-quorum-policy="ignore"+ is required in 2-node Pacemaker
clusters for the following reason: if quorum enforcement is enabled,
and one of the two nodes fails, then the remaining node can not
establish a _majority_ of quorum votes necessary to run services, and
thus it is unable to take over any resources. The appropriate
workaround is to ignore loss of quorum in the cluster. This is safe
and necessary _only_ in 2-node clusters. Do not set this property in
Pacemaker clusters with more than two nodes.

<2> Setting +pe-warn-series-max+, +pe-input-series-max+ and
+pe-error-series-max+ to 1000 instructs Pacemaker to keep a longer
history of the inputs processed, and errors and warnings generated, by
its Policy Engine. This history is typically useful in case cluster
troubleshooting becomes necessary.

<3> Pacemaker uses an event-driven approach to cluster state
processing. However, certain Pacemaker actions occur at a configurable
interval, +cluster-recheck-interval+, which defaults to 15 minutes. It
is usually prudent to reduce this to a shorter interval, such as 5 or
3 minutes.

Once you have made these changes, you may +commit+ the updated
configuration.