64b6c9261e
Current folder name New folder name Book title ---------------------------------------------------------- basic-install DELETE cli-guide DELETE common common NEW admin-guide-cloud Cloud Administrators Guide docbkx-example DELETE openstack-block-storage-admin DELETE openstack-compute-admin DELETE openstack-config config-reference OpenStack Configuration Reference openstack-ha high-availability-guide OpenStack High Availabilty Guide openstack-image image-guide OpenStack Virtual Machine Image Guide openstack-install install-guide OpenStack Installation Guide openstack-network-connectivity-admin admin-guide-network OpenStack Networking Administration Guide openstack-object-storage-admin DELETE openstack-security security-guide OpenStack Security Guide openstack-training training-guide OpenStack Training Guide openstack-user user-guide OpenStack End User Guide openstack-user-admin user-guide-admin OpenStack Admin User Guide glossary NEW OpenStack Glossary bug: #1220407 Change-Id: Id5ffc774b966ba7b9a591743a877aa10ab3094c7 author: diane fleming
76 lines
5.3 KiB
Plaintext
76 lines
5.3 KiB
Plaintext
[[ch-intro]]
|
|
== Introduction to OpenStack High Availability
|
|
|
|
High Availability systems, fundamentally, seek to minimize two things:
|
|
|
|
* *System downtime* -- the unavailability of a _user-facing_ service
|
|
beyond a specified maximum amount of time, and
|
|
* *Data loss* -- the accidental deletion or destruction of data.
|
|
|
|
It is important to understand that most high availability systems can
|
|
_guarantee_ protection against these issues only in the face of a
|
|
single failure event. They are also expected to protect against
|
|
cascading failures, where an originally singular failure deteriorates
|
|
into a series of consequential failures.
|
|
|
|
A crucial aspect of high availability is thus the elimination of
|
|
single points of failure (SPOFs). A SPOF is an individual piece of
|
|
equipment, or software, whose failure can cause system downtime or
|
|
data loss. Eliminating SPOFs typically includes
|
|
|
|
* Redundancy of network components, such as switches and routers,
|
|
* Redundancy of applications and automatic service migration,
|
|
* Redundancy of storage components,
|
|
* Redundancy of facility services such as power, air conditioning,
|
|
fire protection, and others.
|
|
|
|
In contrast, in the face of multiple _independent_ (non-interrelated)
|
|
failures, high availability becomes a best-effort affair. In such an
|
|
event, most high-availability systems tend to favor protecting data
|
|
over maintaining availability.
|
|
|
|
High-availability systems typically achieve uptime of 99.99% or more,
|
|
meaning less than roughly an hour of cumulative downtime per
|
|
year. From this, it follows that highly-available systems are
|
|
generally expected to keep recovery times in the face of a failure on
|
|
the order of 1-2 minutes, sometimes significantly less.
|
|
|
|
OpenStack can currently meet such availability requirements for its
|
|
own infrastructure services, meaning that an uptime of 99.99% is
|
|
feasible for the OpenStack infrastructure proper. At this time,
|
|
OpenStack _cannot_ guarantee 99.99% availability for individual guest
|
|
instances.
|
|
|
|
[[stateless-vs-stateful]]
|
|
=== Stateless vs. Stateful services
|
|
|
|
How to prevent a single point of failure depends partially on whether a service is stateless or not. For example, the nova-scheduler service is stateless; you make a request, it provides a response, and no further attention is required; subsequent requests do not depend on the results of the first. All you need to do to make nova-scheduler highly-available is to provide redundant instances and load balance them. OpenStack services that are stateless include nova-api, nova-conductor, glance-api, keystone-api, neutron-api and nova-scheduler.
|
|
|
|
Stateful services, on the other hand, are more difficult to manage; a single action typically involves more than one request, so simply providing additional instances and load balancing will not solve the problem. One user-facing example of a stateful OpenStack service would be Horizon; if the UI reset itself every time the user went to a new page, it wouldn't be very useful. On a more basic level, the OpenStack database and message queue are also stateful, and must be managed accordingly to provide high-availability.
|
|
|
|
How you manage stateful services depends partially on whether you choose an Active/Passive or Active/Active configuration.
|
|
|
|
[[ap-intro]]
|
|
=== Active/Passive
|
|
|
|
In an active/passive configuration, systems are set up to prevent availability and data loss problems by ensuring that in the event of a problem, additional resources can be brought online to replace those that have failed. For example, OpenStack would write to the main database while maintaining a disaster recovery database that can be brought online in the event that the main database fails.
|
|
|
|
Typically, an Active/Passive installation looks like this:
|
|
|
|
* Redundant instances of stateless services are available, and requests are load balanced using a virtual IP address and a load balancer such as HAProxy.
|
|
* Stateful services are managed in such a way that in the event of a system failure, a replacement resource can be brought online. A separate application (such as Pacemaker/Corosync) monitors these services, bringing the backup online as necessary.
|
|
|
|
[[aa-intro]]
|
|
=== Active/Active
|
|
|
|
In an active/active configuration, the system is also set up with a backup, but rather than bringing it online when there's a problem, the system manages both the main and redundant systems concurrently. This way, if there's a problem, the user is unlikely to even notice; the backup system is already online, and takes on the increased load while the problem system is dealt with.
|
|
|
|
Typically, an Active/Active installation looks like this:
|
|
|
|
* Redundant instances of stateless services are available, and requests are load balanced using a virtual IP address and a load balancer such as HAProxy.
|
|
* Stateful services are managed in such a way that services are redundant, and that all instances have an identical state. For example, updates to one instance of a database would also update all other instances. This way a request to one instance is the same as a request to any other. A load balancer manages the traffic to these systems, ensuring that working systems always handle the request.
|
|
|
|
This document discusses some of the more common ways to implement these architectures, but they are by no means the only ways to do it. The important thing is to make sure that your services are redundant, and available; how you achieve that is up to you. Let's look at some options for making that happen.
|
|
|
|
|