Merge "[arch-design-draft] Reorganise operators requirements chapter"
This commit is contained in:
commit
dc5eeb2cb5
@ -0,0 +1,26 @@
|
||||
=================
|
||||
The bleeding edge
|
||||
=================
|
||||
|
||||
The number of organizations running at massive scales is a small proportion of
|
||||
the OpenStack community, therefore it is important to share related issues
|
||||
with the community and be a vocal advocate for resolving them. Some issues
|
||||
only manifest when operating at large scale, and the number of organizations
|
||||
able to duplicate and validate an issue is small, so it is important to
|
||||
document and dedicate resources to their resolution.
|
||||
|
||||
In some cases, the resolution to the problem is ultimately to deploy a more
|
||||
recent version of OpenStack. Alternatively, when you must resolve an issue in
|
||||
a production environment where rebuilding the entire environment is not an
|
||||
option, it is sometimes possible to deploy updates to specific underlying
|
||||
components in order to resolve issues or gain significant performance
|
||||
improvements. Although this may appear to expose the deployment to increased
|
||||
risk and instability, in many cases it could be an undiscovered issue.
|
||||
|
||||
We recommend building a development and operations organization that is
|
||||
responsible for creating desired features, diagnosing and resolving issues,
|
||||
and building the infrastructure for large scale continuous integration tests
|
||||
and continuous deployment. This helps catch bugs early and makes deployments
|
||||
faster and easier. In addition to development resources, we also recommend the
|
||||
recruitment of experts in the fields of message queues, databases, distributed
|
||||
systems, networking, cloud, and storage.
|
@ -0,0 +1,61 @@
|
||||
=================
|
||||
Capacity planning
|
||||
=================
|
||||
|
||||
An important consideration in running a cloud over time is projecting growth
|
||||
and utilization trends in order to plan capital expenditures for the short and
|
||||
long term. Gather utilization meters for compute, network, and storage, along
|
||||
with historical records of these meters. While securing major anchor tenants
|
||||
can lead to rapid jumps in the utilization rates of all resources, the steady
|
||||
adoption of the cloud inside an organization or by consumers in a public
|
||||
offering also creates a steady trend of increased utilization.
|
||||
|
||||
Capacity constraints for a general purpose cloud environment include:
|
||||
|
||||
* Compute limits
|
||||
* Storage limits
|
||||
|
||||
A relationship exists between the size of the compute environment and
|
||||
the supporting OpenStack infrastructure controller nodes requiring
|
||||
support.
|
||||
|
||||
Increasing the size of the supporting compute environment increases the
|
||||
network traffic and messages, adding load to the controller or
|
||||
networking nodes. Effective monitoring of the environment will help with
|
||||
capacity decisions on scaling.
|
||||
|
||||
Compute nodes automatically attach to OpenStack clouds, resulting in a
|
||||
horizontally scaling process when adding extra compute capacity to an
|
||||
OpenStack cloud. Additional processes are required to place nodes into
|
||||
appropriate availability zones and host aggregates. When adding
|
||||
additional compute nodes to environments, ensure identical or functional
|
||||
compatible CPUs are used, otherwise live migration features will break.
|
||||
It is necessary to add rack capacity or network switches as scaling out
|
||||
compute hosts directly affects network and datacenter resources.
|
||||
|
||||
Compute host components can also be upgraded to account for increases in
|
||||
demand; this is known as vertical scaling. Upgrading CPUs with more
|
||||
cores, or increasing the overall server memory, can add extra needed
|
||||
capacity depending on whether the running applications are more CPU
|
||||
intensive or memory intensive.
|
||||
|
||||
Another option is to assess the average workloads and increase the
|
||||
number of instances that can run within the compute environment by
|
||||
adjusting the overcommit ratio.
|
||||
|
||||
.. note::
|
||||
|
||||
It is important to remember that changing the CPU overcommit ratio
|
||||
can have a detrimental effect and cause a potential increase in a
|
||||
noisy neighbor.
|
||||
|
||||
Insufficient disk capacity could also have a negative effect on overall
|
||||
performance including CPU and memory usage. Depending on the back-end
|
||||
architecture of the OpenStack Block Storage layer, capacity includes
|
||||
adding disk shelves to enterprise storage systems or installing
|
||||
additional block storage nodes. Upgrading directly attached storage
|
||||
installed in compute hosts, and adding capacity to the shared storage
|
||||
for additional ephemeral storage to instances, may be necessary.
|
||||
|
||||
For a deeper discussion on many of these topics, refer to the `OpenStack
|
||||
Operations Guide <http://docs.openstack.org/ops>`_.
|
@ -0,0 +1,5 @@
|
||||
=============================
|
||||
Integration with external IDP
|
||||
=============================
|
||||
|
||||
.. TODO
|
@ -0,0 +1,449 @@
|
||||
==================
|
||||
Hardware selection
|
||||
==================
|
||||
|
||||
Hardware selection involves three key areas:
|
||||
|
||||
* Network
|
||||
|
||||
* Compute
|
||||
|
||||
* Storage
|
||||
|
||||
Network hardware selection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The network architecture determines which network hardware will be
|
||||
used. Networking software is determined by the selected networking
|
||||
hardware.
|
||||
|
||||
There are more subtle design impacts that need to be considered. The
|
||||
selection of certain networking hardware (and the networking software)
|
||||
affects the management tools that can be used. There are exceptions to
|
||||
this; the rise of *open* networking software that supports a range of
|
||||
networking hardware means there are instances where the relationship
|
||||
between networking hardware and networking software are not as tightly
|
||||
defined.
|
||||
|
||||
For a compute-focus architecture, we recommend designing the network
|
||||
architecture using a scalable network model that makes it easy to add
|
||||
capacity and bandwidth. A good example of such a model is the leaf-spline
|
||||
model. In this type of network design, it is possible to easily add additional
|
||||
bandwidth as well as scale out to additional racks of gear. It is important to
|
||||
select network hardware that supports the required port count, port speed, and
|
||||
port density while also allowing for future growth as workload demands
|
||||
increase. It is also important to evaluate where in the network architecture
|
||||
it is valuable to provide redundancy.
|
||||
|
||||
Some of the key considerations that should be included in the selection
|
||||
of networking hardware include:
|
||||
|
||||
Port count
|
||||
The design will require networking hardware that has the requisite
|
||||
port count.
|
||||
|
||||
Port density
|
||||
The network design will be affected by the physical space that is
|
||||
required to provide the requisite port count. A higher port density
|
||||
is preferred, as it leaves more rack space for compute or storage
|
||||
components that may be required by the design. This can also lead
|
||||
into considerations about fault domains and power density. Higher
|
||||
density switches are more expensive, therefore it is important not
|
||||
to over design the network.
|
||||
|
||||
Port speed
|
||||
The networking hardware must support the proposed network speed, for
|
||||
example: 1 GbE, 10 GbE, or 40 GbE (or even 100 GbE).
|
||||
|
||||
Redundancy
|
||||
User requirements for high availability and cost considerations
|
||||
influence the required level of network hardware redundancy.
|
||||
Network redundancy can be achieved by adding redundant power
|
||||
supplies or paired switches.
|
||||
|
||||
.. note::
|
||||
|
||||
If this is a requirement, the hardware must support this
|
||||
configuration. User requirements determine if a completely
|
||||
redundant network infrastructure is required.
|
||||
|
||||
Power requirements
|
||||
Ensure that the physical data center provides the necessary power
|
||||
for the selected network hardware.
|
||||
|
||||
.. note::
|
||||
|
||||
This is not an issue for top of rack (ToR) switches. This may be an issue
|
||||
for spine switches in a leaf and spine fabric, or end of row (EoR)
|
||||
switches.
|
||||
|
||||
Protocol support
|
||||
It is possible to gain more performance out of a single storage
|
||||
system by using specialized network technologies such as RDMA, SRP,
|
||||
iSER and SCST. The specifics for using these technologies is beyond
|
||||
the scope of this book.
|
||||
|
||||
There is no single best practice architecture for the networking
|
||||
hardware supporting an OpenStack cloud that will apply to all implementations.
|
||||
Some of the key factors that will have a major influence on selection of
|
||||
networking hardware include:
|
||||
|
||||
Connectivity
|
||||
All nodes within an OpenStack cloud require network connectivity. In
|
||||
some cases, nodes require access to more than one network segment.
|
||||
The design must encompass sufficient network capacity and bandwidth
|
||||
to ensure that all communications within the cloud, both north-south
|
||||
and east-west traffic have sufficient resources available.
|
||||
|
||||
Scalability
|
||||
The network design should encompass a physical and logical network
|
||||
design that can be easily expanded upon. Network hardware should
|
||||
offer the appropriate types of interfaces and speeds that are
|
||||
required by the hardware nodes.
|
||||
|
||||
Availability
|
||||
To ensure access to nodes within the cloud is not interrupted,
|
||||
we recommend that the network architecture identify any single
|
||||
points of failure and provide some level of redundancy or fault
|
||||
tolerance. The network infrastructure often involves use of
|
||||
networking protocols such as LACP, VRRP or others to achieve a highly
|
||||
available network connection. It is also important to consider the
|
||||
networking implications on API availability. We recommend a load balancing
|
||||
solution is designed within the network architecture to ensure that the APIs,
|
||||
and potentially other services in the cloud are highly available.
|
||||
|
||||
Compute (server) hardware selection
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Consider the following factors when selecting compute (server) hardware:
|
||||
|
||||
* Server density
|
||||
A measure of how many servers can fit into a given measure of
|
||||
physical space, such as a rack unit [U].
|
||||
|
||||
* Resource capacity
|
||||
The number of CPU cores, how much RAM, or how much storage a given
|
||||
server delivers.
|
||||
|
||||
* Expandability
|
||||
The number of additional resources you can add to a server before it
|
||||
reaches capacity.
|
||||
|
||||
* Cost
|
||||
The relative cost of the hardware weighed against the level of
|
||||
design effort needed to build the system.
|
||||
|
||||
Weigh these considerations against each other to determine the best
|
||||
design for the desired purpose. For example, increasing server density
|
||||
means sacrificing resource capacity or expandability. Increasing resource
|
||||
capacity and expandability can increase cost but decrease server density.
|
||||
Decreasing cost often means decreasing supportability, server density,
|
||||
resource capacity, and expandability.
|
||||
|
||||
Compute capacity (CPU cores and RAM capacity) is a secondary
|
||||
consideration for selecting server hardware. The required
|
||||
server hardware must supply adequate CPU sockets, additional CPU cores,
|
||||
and more RAM; network connectivity and storage capacity are not as
|
||||
critical. The hardware needs to provide enough network connectivity and
|
||||
storage capacity to meet the user requirements.
|
||||
|
||||
For a compute-focused cloud, emphasis should be on server
|
||||
hardware that can offer more CPU sockets, more CPU cores, and more RAM.
|
||||
Network connectivity and storage capacity are less critical.
|
||||
|
||||
When designing a OpenStack cloud architecture, you must
|
||||
consider whether you intend to scale up or scale out. Selecting a
|
||||
smaller number of larger hosts, or a larger number of smaller hosts,
|
||||
depends on a combination of factors: cost, power, cooling, physical rack
|
||||
and floor space, support-warranty, and manageability.
|
||||
|
||||
Consider the following in selecting server hardware form factor suited for
|
||||
your OpenStack design architecture:
|
||||
|
||||
* Most blade servers can support dual-socket multi-core CPUs. To avoid
|
||||
this CPU limit, select ``full width`` or ``full height`` blades. Be
|
||||
aware, however, that this also decreases server density. For example,
|
||||
high density blade servers such as HP BladeSystem or Dell PowerEdge
|
||||
M1000e support up to 16 servers in only ten rack units. Using
|
||||
half-height blades is twice as dense as using full-height blades,
|
||||
which results in only eight servers per ten rack units.
|
||||
|
||||
* 1U rack-mounted servers have the ability to offer greater server density
|
||||
than a blade server solution, but are often limited to dual-socket,
|
||||
multi-core CPU configurations. It is possible to place forty 1U servers
|
||||
in a rack, providing space for the top of rack (ToR) switches, compared
|
||||
to 32 full width blade servers.
|
||||
|
||||
To obtain greater than dual-socket support in a 1U rack-mount form
|
||||
factor, customers need to buy their systems from Original Design
|
||||
Manufacturers (ODMs) or second-tier manufacturers.
|
||||
|
||||
.. warning::
|
||||
|
||||
This may cause issues for organizations that have preferred
|
||||
vendor policies or concerns with support and hardware warranties
|
||||
of non-tier 1 vendors.
|
||||
|
||||
* 2U rack-mounted servers provide quad-socket, multi-core CPU support,
|
||||
but with a corresponding decrease in server density (half the density
|
||||
that 1U rack-mounted servers offer).
|
||||
|
||||
* Larger rack-mounted servers, such as 4U servers, often provide even
|
||||
greater CPU capacity, commonly supporting four or even eight CPU
|
||||
sockets. These servers have greater expandability, but such servers
|
||||
have much lower server density and are often more expensive.
|
||||
|
||||
* ``Sled servers`` are rack-mounted servers that support multiple
|
||||
independent servers in a single 2U or 3U enclosure. These deliver
|
||||
higher density as compared to typical 1U or 2U rack-mounted servers.
|
||||
For example, many sled servers offer four independent dual-socket
|
||||
nodes in 2U for a total of eight CPU sockets in 2U.
|
||||
|
||||
Other factors that influence server hardware selection for an OpenStack
|
||||
design architecture include:
|
||||
|
||||
Instance density
|
||||
More hosts are required to support the anticipated scale
|
||||
if the design architecture uses dual-socket hardware designs.
|
||||
|
||||
For a general purpose OpenStack cloud, sizing is an important consideration.
|
||||
The expected or anticipated number of instances that each hypervisor can
|
||||
host is a common meter used in sizing the deployment. The selected server
|
||||
hardware needs to support the expected or anticipated instance density.
|
||||
|
||||
Host density
|
||||
Another option to address the higher host count is to use a
|
||||
quad-socket platform. Taking this approach decreases host density
|
||||
which also increases rack count. This configuration affects the
|
||||
number of power connections and also impacts network and cooling
|
||||
requirements.
|
||||
|
||||
Physical data centers have limited physical space, power, and
|
||||
cooling. The number of hosts (or hypervisors) that can be fitted
|
||||
into a given metric (rack, rack unit, or floor tile) is another
|
||||
important method of sizing. Floor weight is an often overlooked
|
||||
consideration. The data center floor must be able to support the
|
||||
weight of the proposed number of hosts within a rack or set of
|
||||
racks. These factors need to be applied as part of the host density
|
||||
calculation and server hardware selection.
|
||||
|
||||
Power and cooling density
|
||||
The power and cooling density requirements might be lower than with
|
||||
blade, sled, or 1U server designs due to lower host density (by
|
||||
using 2U, 3U or even 4U server designs). For data centers with older
|
||||
infrastructure, this might be a desirable feature.
|
||||
|
||||
Data centers have a specified amount of power fed to a given rack or
|
||||
set of racks. Older data centers may have a power density as power
|
||||
as low as 20 AMPs per rack, while more recent data centers can be
|
||||
architected to support power densities as high as 120 AMP per rack.
|
||||
The selected server hardware must take power density into account.
|
||||
|
||||
Network connectivity
|
||||
The selected server hardware must have the appropriate number of
|
||||
network connections, as well as the right type of network
|
||||
connections, in order to support the proposed architecture. Ensure
|
||||
that, at a minimum, there are at least two diverse network
|
||||
connections coming into each rack.
|
||||
|
||||
The selection of form factors or architectures affects the selection of
|
||||
server hardware. Ensure that the selected server hardware is configured
|
||||
to support enough storage capacity (or storage expandability) to match
|
||||
the requirements of selected scale-out storage solution. Similarly, the
|
||||
network architecture impacts the server hardware selection and vice
|
||||
versa.
|
||||
|
||||
Hardware for general purpose OpenStack cloud
|
||||
--------------------------------------------
|
||||
|
||||
Hardware for a general purpose OpenStack cloud should reflect a cloud
|
||||
with no pre-defined usage model, designed to run a wide variety of
|
||||
applications with varying resource usage requirements. These
|
||||
applications include any of the following:
|
||||
|
||||
* RAM-intensive
|
||||
|
||||
* CPU-intensive
|
||||
|
||||
* Storage-intensive
|
||||
|
||||
Certain hardware form factors may better suit a general purpose
|
||||
OpenStack cloud due to the requirement for equal (or nearly equal)
|
||||
balance of resources. Server hardware must provide the following:
|
||||
|
||||
* Equal (or nearly equal) balance of compute capacity (RAM and CPU)
|
||||
|
||||
* Network capacity (number and speed of links)
|
||||
|
||||
* Storage capacity (gigabytes or terabytes as well as Input/Output
|
||||
Operations Per Second (:term:`IOPS`)
|
||||
|
||||
The best form factor for server hardware supporting a general purpose
|
||||
OpenStack cloud is driven by outside business and cost factors. No
|
||||
single reference architecture applies to all implementations; the
|
||||
decision must flow from user requirements, technical considerations, and
|
||||
operational considerations.
|
||||
|
||||
Selecting storage hardware
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Storage hardware architecture is determined by selecting specific storage
|
||||
architecture. Determine the selection of storage architecture by
|
||||
evaluating possible solutions against the critical factors, the user
|
||||
requirements, technical considerations, and operational considerations.
|
||||
Consider the following factors when selecting storage hardware:
|
||||
|
||||
Cost
|
||||
Storage can be a significant portion of the overall system cost. For
|
||||
an organization that is concerned with vendor support, a commercial
|
||||
storage solution is advisable, although it comes with a higher price
|
||||
tag. If initial capital expenditure requires minimization, designing
|
||||
a system based on commodity hardware would apply. The trade-off is
|
||||
potentially higher support costs and a greater risk of
|
||||
incompatibility and interoperability issues.
|
||||
|
||||
Performance
|
||||
The latency of storage I/O requests indicates performance. Performance
|
||||
requirements affect which solution you choose.
|
||||
|
||||
Scalability
|
||||
Scalability, along with expandability, is a major consideration in a
|
||||
general purpose OpenStack cloud. It might be difficult to predict
|
||||
the final intended size of the implementation as there are no
|
||||
established usage patterns for a general purpose cloud. It might
|
||||
become necessary to expand the initial deployment in order to
|
||||
accommodate growth and user demand.
|
||||
|
||||
Expandability
|
||||
Expandability is a major architecture factor for storage solutions
|
||||
with general purpose OpenStack cloud. A storage solution that
|
||||
expands to 50 PB is considered more expandable than a solution that
|
||||
only scales to 10 PB. This meter is related to scalability, which is
|
||||
the measure of a solution's performance as it expands.
|
||||
|
||||
General purpose cloud storage requirements
|
||||
------------------------------------------
|
||||
Using a scale-out storage solution with direct-attached storage (DAS) in
|
||||
the servers is well suited for a general purpose OpenStack cloud. Cloud
|
||||
services requirements determine your choice of scale-out solution. You
|
||||
need to determine if a single, highly expandable and highly vertical,
|
||||
scalable, centralized storage array is suitable for your design. After
|
||||
determining an approach, select the storage hardware based on this
|
||||
criteria.
|
||||
|
||||
This list expands upon the potential impacts for including a particular
|
||||
storage architecture (and corresponding storage hardware) into the
|
||||
design for a general purpose OpenStack cloud:
|
||||
|
||||
Connectivity
|
||||
If storage protocols other than Ethernet are part of the storage solution,
|
||||
ensure the appropriate hardware has been selected. If a centralized storage
|
||||
array is selected, ensure that the hypervisor will be able to connect to
|
||||
that storage array for image storage.
|
||||
|
||||
Usage
|
||||
How the particular storage architecture will be used is critical for
|
||||
determining the architecture. Some of the configurations that will
|
||||
influence the architecture include whether it will be used by the
|
||||
hypervisors for ephemeral instance storage, or if OpenStack Object
|
||||
Storage will use it for object storage.
|
||||
|
||||
Instance and image locations
|
||||
Where instances and images will be stored will influence the
|
||||
architecture.
|
||||
|
||||
Server hardware
|
||||
If the solution is a scale-out storage architecture that includes
|
||||
DAS, it will affect the server hardware selection. This could ripple
|
||||
into the decisions that affect host density, instance density, power
|
||||
density, OS-hypervisor, management tools and others.
|
||||
|
||||
A general purpose OpenStack cloud has multiple options. The key factors
|
||||
that will have an influence on selection of storage hardware for a
|
||||
general purpose OpenStack cloud are as follows:
|
||||
|
||||
Capacity
|
||||
Hardware resources selected for the resource nodes should be capable
|
||||
of supporting enough storage for the cloud services. Defining the
|
||||
initial requirements and ensuring the design can support adding
|
||||
capacity is important. Hardware nodes selected for object storage
|
||||
should be capable of support a large number of inexpensive disks
|
||||
with no reliance on RAID controller cards. Hardware nodes selected
|
||||
for block storage should be capable of supporting high speed storage
|
||||
solutions and RAID controller cards to provide performance and
|
||||
redundancy to storage at a hardware level. Selecting hardware RAID
|
||||
controllers that automatically repair damaged arrays will assist
|
||||
with the replacement and repair of degraded or deleted storage
|
||||
devices.
|
||||
|
||||
Performance
|
||||
Disks selected for object storage services do not need to be fast
|
||||
performing disks. We recommend that object storage nodes take
|
||||
advantage of the best cost per terabyte available for storage.
|
||||
Contrastingly, disks chosen for block storage services should take
|
||||
advantage of performance boosting features that may entail the use
|
||||
of SSDs or flash storage to provide high performance block storage
|
||||
pools. Storage performance of ephemeral disks used for instances
|
||||
should also be taken into consideration.
|
||||
|
||||
Fault tolerance
|
||||
Object storage resource nodes have no requirements for hardware
|
||||
fault tolerance or RAID controllers. It is not necessary to plan for
|
||||
fault tolerance within the object storage hardware because the
|
||||
object storage service provides replication between zones as a
|
||||
feature of the service. Block storage nodes, compute nodes, and
|
||||
cloud controllers should all have fault tolerance built in at the
|
||||
hardware level by making use of hardware RAID controllers and
|
||||
varying levels of RAID configuration. The level of RAID chosen
|
||||
should be consistent with the performance and availability
|
||||
requirements of the cloud.
|
||||
|
||||
Storage-focus cloud storage requirements
|
||||
----------------------------------------
|
||||
|
||||
Storage-focused OpenStack clouds must address I/O intensive workloads.
|
||||
These workloads are not CPU intensive, nor are they consistently network
|
||||
intensive. The network may be heavily utilized to transfer storage, but
|
||||
they are not otherwise network intensive.
|
||||
|
||||
The selection of storage hardware determines the overall performance and
|
||||
scalability of a storage-focused OpenStack design architecture. Several
|
||||
factors impact the design process, including:
|
||||
|
||||
Latency is a key consideration in a storage-focused OpenStack cloud.
|
||||
Using solid-state disks (SSDs) to minimize latency and, to reduce CPU
|
||||
delays caused by waiting for the storage, increases performance. Use
|
||||
RAID controller cards in compute hosts to improve the performance of the
|
||||
underlying disk subsystem.
|
||||
|
||||
Depending on the storage architecture, you can adopt a scale-out
|
||||
solution, or use a highly expandable and scalable centralized storage
|
||||
array. If a centralized storage array meets your requirements, then the
|
||||
array vendor determines the hardware selection. It is possible to build
|
||||
a storage array using commodity hardware with Open Source software, but
|
||||
requires people with expertise to build such a system.
|
||||
|
||||
On the other hand, a scale-out storage solution that uses
|
||||
direct-attached storage (DAS) in the servers may be an appropriate
|
||||
choice. This requires configuration of the server hardware to support
|
||||
the storage solution.
|
||||
|
||||
Considerations affecting storage architecture (and corresponding storage
|
||||
hardware) of a Storage-focused OpenStack cloud include:
|
||||
|
||||
Connectivity
|
||||
Ensure the connectivity matches the storage solution requirements. We
|
||||
recommended confirming that the network characteristics minimize latency
|
||||
to boost the overall performance of the design.
|
||||
|
||||
Latency
|
||||
Determine if the use case has consistent or highly variable latency.
|
||||
|
||||
Throughput
|
||||
Ensure that the storage solution throughput is optimized for your
|
||||
application requirements.
|
||||
|
||||
Server hardware
|
||||
Use of DAS impacts the server hardware choice and affects host
|
||||
density, instance density, power density, OS-hypervisor, and
|
||||
management tools.
|
@ -0,0 +1,33 @@
|
||||
=========
|
||||
Licensing
|
||||
=========
|
||||
|
||||
The many different forms of license agreements for software are often written
|
||||
with the use of dedicated hardware in mind. This model is relevant for the
|
||||
cloud platform itself, including the hypervisor operating system, supporting
|
||||
software for items such as database, RPC, backup, and so on. Consideration
|
||||
must be made when offering Compute service instances and applications to end
|
||||
users of the cloud, since the license terms for that software may need some
|
||||
adjustment to be able to operate economically in the cloud.
|
||||
|
||||
Multi-site OpenStack deployments present additional licensing
|
||||
considerations over and above regular OpenStack clouds, particularly
|
||||
where site licenses are in use to provide cost efficient access to
|
||||
software licenses. The licensing for host operating systems, guest
|
||||
operating systems, OpenStack distributions (if applicable),
|
||||
software-defined infrastructure including network controllers and
|
||||
storage systems, and even individual applications need to be evaluated.
|
||||
|
||||
Topics to consider include:
|
||||
|
||||
* The definition of what constitutes a site in the relevant licenses,
|
||||
as the term does not necessarily denote a geographic or otherwise
|
||||
physically isolated location.
|
||||
|
||||
* Differentiations between "hot" (active) and "cold" (inactive) sites,
|
||||
where significant savings may be made in situations where one site is
|
||||
a cold standby for disaster recovery purposes only.
|
||||
|
||||
* Certain locations might require local vendors to provide support and
|
||||
services for each site which may vary with the licensing agreement in
|
||||
place.
|
@ -0,0 +1,27 @@
|
||||
======================
|
||||
Logging and monitoring
|
||||
======================
|
||||
|
||||
OpenStack clouds require appropriate monitoring platforms to catch and
|
||||
manage errors.
|
||||
|
||||
.. note::
|
||||
|
||||
We recommend leveraging existing monitoring systems to see if they
|
||||
are able to effectively monitor an OpenStack environment.
|
||||
|
||||
Specific meters that are critically important to capture include:
|
||||
|
||||
* Image disk utilization
|
||||
|
||||
* Response time to the Compute API
|
||||
|
||||
Logging and monitoring does not significantly differ for a multi-site OpenStack
|
||||
cloud. The tools described in the `Logging and monitoring chapter
|
||||
<http://docs.openstack.org/openstack-ops/content/logging_monitoring.html>`__ of
|
||||
the Operations Guide remain applicable. Logging and monitoring can be provided
|
||||
on a per-site basis, and in a common centralized location.
|
||||
|
||||
When attempting to deploy logging and monitoring facilities to a centralized
|
||||
location, care must be taken with the load placed on the inter-site networking
|
||||
links.
|
@ -0,0 +1,59 @@
|
||||
==============
|
||||
Network design
|
||||
==============
|
||||
|
||||
The network design for an OpenStack cluster includes decisions regarding
|
||||
the interconnect needs within the cluster, plus the need to allow clients to
|
||||
access their resources, and for operators to access the cluster for
|
||||
maintenance. The bandwidth, latency, and reliability of these networks needs
|
||||
consideration.
|
||||
|
||||
Make additional design decisions about monitoring and alarming. This can
|
||||
be an internal responsibility or the responsibility of the external
|
||||
provider. In the case of using an external provider, service level
|
||||
agreements (SLAs) likely apply. In addition, other operational
|
||||
considerations such as bandwidth, latency, and jitter can be part of an
|
||||
SLA.
|
||||
|
||||
Consider the ability to upgrade the infrastructure. As demand for
|
||||
network resources increase, operators add additional IP address blocks
|
||||
and add additional bandwidth capacity. In addition, consider managing
|
||||
hardware and software life cycle events, for example upgrades,
|
||||
decommissioning, and outages, while avoiding service interruptions for
|
||||
tenants.
|
||||
|
||||
Factor maintainability into the overall network design. This includes
|
||||
the ability to manage and maintain IP addresses as well as the use of
|
||||
overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
|
||||
tags. As an example, if you may need to change all of the IP addresses
|
||||
on a network, a process known as renumbering, then the design must
|
||||
support this function.
|
||||
|
||||
Address network-focused applications when considering certain
|
||||
operational realities. For example, consider the impending exhaustion of
|
||||
IPv4 addresses, the migration to IPv6, and the use of private networks
|
||||
to segregate different types of traffic that an application receives or
|
||||
generates. In the case of IPv4 to IPv6 migrations, applications should
|
||||
follow best practices for storing IP addresses. We recommend you avoid
|
||||
relying on IPv4 features that did not carry over to the IPv6 protocol or
|
||||
have differences in implementation.
|
||||
|
||||
To segregate traffic, allow applications to create a private tenant
|
||||
network for database and storage network traffic. Use a public network
|
||||
for services that require direct client access from the internet. Upon
|
||||
segregating the traffic, consider quality of service (QoS) and security
|
||||
to ensure each network has the required level of service.
|
||||
|
||||
Finally, consider the routing of network traffic. For some applications,
|
||||
develop a complex policy framework for routing. To create a routing
|
||||
policy that satisfies business requirements, consider the economic cost
|
||||
of transmitting traffic over expensive links versus cheaper links, in
|
||||
addition to bandwidth, latency, and jitter requirements.
|
||||
|
||||
Additionally, consider how to respond to network events. As an example,
|
||||
how load transfers from one link to another during a failure scenario
|
||||
could be a factor in the design. If you do not plan network capacity
|
||||
correctly, failover traffic could overwhelm other ports or network links
|
||||
and create a cascading failure scenario. In this case, traffic that
|
||||
fails over to one link overwhelms that link and then moves to the
|
||||
subsequent links until all network traffic stops.
|
@ -0,0 +1,20 @@
|
||||
==========================
|
||||
Operator access to systems
|
||||
==========================
|
||||
|
||||
As more and more applications are migrated into a Cloud based environment, we
|
||||
get to a position where systems that are critical for Cloud operations are
|
||||
hosted within the cloud that is being operated. Consideration must be given to
|
||||
the ability for operators to be able to access the systems and tools required
|
||||
in order to resolve a major incident.
|
||||
|
||||
If a significant portion of the cloud is on externally managed systems,
|
||||
prepare for situations where it may not be possible to make changes.
|
||||
Additionally, providers may differ on how infrastructure must be managed and
|
||||
exposed. This can lead to delays in root cause analysis where each insists the
|
||||
blame lies with the other provider.
|
||||
|
||||
Ensure that the network structure connects all clouds to form an integrated
|
||||
system, keeping in mind the state of handoffs. These handoffs must both be as
|
||||
reliable as possible and include as little latency as possible to ensure the
|
||||
best performance of the overall system.
|
@ -0,0 +1,14 @@
|
||||
=================
|
||||
Policy management
|
||||
=================
|
||||
|
||||
OpenStack provides a default set of Role Based Access Control (RBAC)
|
||||
policies, defined in a ``policy.json`` file, for each service. Operators
|
||||
edit these files to customize the policies for their OpenStack
|
||||
installation. If the application of consistent RBAC policies across
|
||||
sites is a requirement, then it is necessary to ensure proper
|
||||
synchronization of the ``policy.json`` files to all installations.
|
||||
|
||||
This must be done using system administration tools such as rsync as
|
||||
functionality for synchronizing policies across regions is not currently
|
||||
provided within OpenStack.
|
@ -0,0 +1,24 @@
|
||||
================
|
||||
Quota management
|
||||
================
|
||||
|
||||
Quotas are used to set operational limits to prevent system capacities
|
||||
from being exhausted without notification. They are currently enforced
|
||||
at the tenant (or project) level rather than at the user level.
|
||||
|
||||
Quotas are defined on a per-region basis. Operators can define identical
|
||||
quotas for tenants in each region of the cloud to provide a consistent
|
||||
experience, or even create a process for synchronizing allocated quotas
|
||||
across regions. It is important to note that only the operational limits
|
||||
imposed by the quotas will be aligned consumption of quotas by users
|
||||
will not be reflected between regions.
|
||||
|
||||
For example, given a cloud with two regions, if the operator grants a
|
||||
user a quota of 25 instances in each region then that user may launch a
|
||||
total of 50 instances spread across both regions. They may not, however,
|
||||
launch more than 25 instances in any single region.
|
||||
|
||||
For more information on managing quotas refer to the `Managing projects
|
||||
and users
|
||||
chapter <http://docs.openstack.org/openstack-ops/content/projects_users.html>`__
|
||||
of the OpenStack Operators Guide.
|
@ -0,0 +1,12 @@
|
||||
===================
|
||||
Skills and training
|
||||
===================
|
||||
|
||||
Projecting growth for storage, networking, and compute is only one aspect of a
|
||||
growth plan for running OpenStack at massive scale. Growing and nurturing
|
||||
development and operational staff is an additional consideration. Sending team
|
||||
members to OpenStack conferences, meetup events, and encouraging active
|
||||
participation in the mailing lists and committees is a very important way to
|
||||
maintain skills and forge relationships in the community. For a list of
|
||||
OpenStack training providers in the marketplace, see:
|
||||
http://www.openstack.org/marketplace/training/.
|
89
doc/arch-design-draft/source/operator-requirements-sla.rst
Normal file
89
doc/arch-design-draft/source/operator-requirements-sla.rst
Normal file
@ -0,0 +1,89 @@
|
||||
==================
|
||||
SLA considerations
|
||||
==================
|
||||
|
||||
Service-level agreements (SLAs) are contractual obligations that ensure the
|
||||
availability of a service. When designing an OpenStack cloud, factoring in
|
||||
promises of availability implies a certain level of redundancy and resiliency.
|
||||
|
||||
Expectations set by the Service Level Agreements (SLAs) directly affect
|
||||
knowing when and where you should implement redundancy and high
|
||||
availability. SLAs are contractual obligations that provide assurances
|
||||
for service availability. They define the levels of availability that
|
||||
drive the technical design, often with penalties for not meeting
|
||||
contractual obligations.
|
||||
|
||||
SLA terms that affect design include:
|
||||
|
||||
* API availability guarantees implying multiple infrastructure services
|
||||
and highly available load balancers.
|
||||
|
||||
* Network uptime guarantees affecting switch design, which might
|
||||
require redundant switching and power.
|
||||
|
||||
* Factor in networking security policy requirements in to your
|
||||
deployments.
|
||||
|
||||
In any environment larger than just a few hosts, it is important to note that
|
||||
there are two separate areas that might be subject to an SLA. Firstly, the
|
||||
services that provide actual virtualization, networking and storage, are
|
||||
subject to an SLA that customers of the environment are most likely to want to
|
||||
be continuously available. This is often referred to as the 'Data Plane'.
|
||||
|
||||
Secondly, there are the ancillary services such as API endpoints, and the
|
||||
various services that control CRUD operations. These are often referred to as
|
||||
the 'Control Plane'. The services in this category are usually subject to a
|
||||
different SLA expectation and therefore may be better suited on separate
|
||||
hardware or at least containers from the Data Plane services.
|
||||
|
||||
To effectively run cloud installations, initial downtime planning
|
||||
includes creating processes and architectures that support the
|
||||
following:
|
||||
|
||||
* Planned (maintenance)
|
||||
* Unplanned (system faults)
|
||||
|
||||
It is important to determine as part of the SLA negotiation which party is
|
||||
responsible for monitoring and starting up Compute service Instances should an
|
||||
outage occur which shuts them down.
|
||||
|
||||
Resiliency of overall system and individual components are going to be
|
||||
dictated by the requirements of the SLA, meaning designing for
|
||||
:term:`high availability (HA)` can have cost ramifications.
|
||||
|
||||
When upgrading, patching and changing configuration items this may require
|
||||
downtime for some services. In these cases, stopping services that form the
|
||||
Control Plane may leave the Data Plane unaffected, while actions such as
|
||||
live-migration of Compute instances may be required in order to perform any
|
||||
actions that require downtime to Data Plane components whilst still meeting SLA
|
||||
expectations.
|
||||
|
||||
Note that there are many services that are outside the realms of pure OpenStack
|
||||
code which affects the ability of any given design to meet SLA, including:
|
||||
|
||||
* Database services, such as ``MySQL`` or ``PostgreSQL``.
|
||||
* Services providing RPC, such as ``RabbitMQ``.
|
||||
* External network attachments.
|
||||
* Physical constraints such as power, rack space, network cabling, etc.
|
||||
* Shared storage including SAN based arrays, storage clusters such as ``Ceph``,
|
||||
and/or NFS services.
|
||||
|
||||
Depending on the design, some Network service functions may fall into both the
|
||||
Control and Data Plane categories. E.g. the neutron L3 Agent service may be
|
||||
considered a Control Plane component, but the routers themselves would be Data
|
||||
Plane.
|
||||
|
||||
It may be that a given set of requirements could dictate an SLA that suggests
|
||||
some services need HA, and some do not.
|
||||
|
||||
In a design with multiple regions, the SLA would also need to take into
|
||||
consideration the use of shared services such as the Identity service,
|
||||
Dashboard, and so on.
|
||||
|
||||
Any SLA negotiation must also take into account the reliance on 3rd parties for
|
||||
critical aspects of the design - for example, if there is an existing SLA on a
|
||||
component such as a storage system, the cloud SLA must take into account this
|
||||
limitation. If the required SLA for the cloud exceeds the agreed uptime levels
|
||||
of the components comprising that cloud, additional redundancy would be
|
||||
required. This consideration is critical to review in a hybrid cloud design,
|
||||
where there are multiple 3rd parties involved.
|
@ -0,0 +1,226 @@
|
||||
==================
|
||||
Software selection
|
||||
==================
|
||||
|
||||
Software selection, particularly for a general purpose OpenStack architecture
|
||||
design involves three areas:
|
||||
|
||||
* Operating system (OS) and hypervisor
|
||||
|
||||
* OpenStack components
|
||||
|
||||
* Supplemental software
|
||||
|
||||
Operating system and hypervisor
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
The operating system (OS) and hypervisor have a significant impact on
|
||||
the overall design. Selecting a particular operating system and
|
||||
hypervisor can directly affect server hardware selection. Make sure the
|
||||
storage hardware and topology support the selected operating system and
|
||||
hypervisor combination. Also ensure the networking hardware selection
|
||||
and topology will work with the chosen operating system and hypervisor
|
||||
combination.
|
||||
|
||||
Some areas that could be impacted by the selection of OS and hypervisor
|
||||
include:
|
||||
|
||||
Cost
|
||||
Selecting a commercially supported hypervisor, such as Microsoft
|
||||
Hyper-V, will result in a different cost model rather than
|
||||
community-supported open source hypervisors including
|
||||
:term:`KVM<kernel-based VM (KVM)>`, Kinstance or :term:`Xen`. When
|
||||
comparing open source OS solutions, choosing Ubuntu over Red Hat
|
||||
(or vice versa) will have an impact on cost due to support
|
||||
contracts.
|
||||
|
||||
Support
|
||||
Depending on the selected hypervisor, staff should have the
|
||||
appropriate training and knowledge to support the selected OS and
|
||||
hypervisor combination. If they do not, training will need to be
|
||||
provided which could have a cost impact on the design.
|
||||
|
||||
Management tools
|
||||
The management tools used for Ubuntu and Kinstance differ from the
|
||||
management tools for VMware vSphere. Although both OS and hypervisor
|
||||
combinations are supported by OpenStack, there will be very
|
||||
different impacts to the rest of the design as a result of the
|
||||
selection of one combination versus the other.
|
||||
|
||||
Scale and performance
|
||||
Ensure that selected OS and hypervisor combinations meet the
|
||||
appropriate scale and performance requirements. The chosen
|
||||
architecture will need to meet the targeted instance-host ratios
|
||||
with the selected OS-hypervisor combinations.
|
||||
|
||||
Security
|
||||
Ensure that the design can accommodate regular periodic
|
||||
installations of application security patches while maintaining
|
||||
required workloads. The frequency of security patches for the
|
||||
proposed OS-hypervisor combination will have an impact on
|
||||
performance and the patch installation process could affect
|
||||
maintenance windows.
|
||||
|
||||
Supported features
|
||||
Determine which OpenStack features are required. This will often
|
||||
determine the selection of the OS-hypervisor combination. Some
|
||||
features are only available with specific operating systems or
|
||||
hypervisors.
|
||||
|
||||
Interoperability
|
||||
You will need to consider how the OS and hypervisor combination
|
||||
interactions with other operating systems and hypervisors, including
|
||||
other software solutions. Operational troubleshooting tools for one
|
||||
OS-hypervisor combination may differ from the tools used for another
|
||||
OS-hypervisor combination and, as a result, the design will need to
|
||||
address if the two sets of tools need to interoperate.
|
||||
|
||||
OpenStack components
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Selecting which OpenStack components are included in the overall design
|
||||
is important. Some OpenStack components, like compute and Image service,
|
||||
are required in every architecture. Other components, like
|
||||
Orchestration, are not always required.
|
||||
|
||||
A compute-focused OpenStack design architecture may contain the following
|
||||
components:
|
||||
|
||||
* Identity (keystone)
|
||||
|
||||
* Dashboard (horizon)
|
||||
|
||||
* Compute (nova)
|
||||
|
||||
* Object Storage (swift)
|
||||
|
||||
* Image (glance)
|
||||
|
||||
* Networking (neutron)
|
||||
|
||||
* Orchestration (heat)
|
||||
|
||||
.. note::
|
||||
|
||||
A compute-focused design is less likely to include OpenStack Block
|
||||
Storage. However, there may be some situations where the need for
|
||||
performance requires a block storage component to improve data I-O.
|
||||
|
||||
Excluding certain OpenStack components can limit or constrain the
|
||||
functionality of other components. For example, if the architecture
|
||||
includes Orchestration but excludes Telemetry, then the design will not
|
||||
be able to take advantage of Orchestrations' auto scaling functionality.
|
||||
It is important to research the component interdependencies in
|
||||
conjunction with the technical requirements before deciding on the final
|
||||
architecture.
|
||||
|
||||
Networking software
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack Networking (neutron) provides a wide variety of networking
|
||||
services for instances. There are many additional networking software
|
||||
packages that can be useful when managing OpenStack components. Some
|
||||
examples include:
|
||||
|
||||
* Software to provide load balancing
|
||||
|
||||
* Network redundancy protocols
|
||||
|
||||
* Routing daemons
|
||||
|
||||
Some of these software packages are described in more detail in the
|
||||
`OpenStack network nodes chapter <http://docs.openstack.org/ha-guide
|
||||
/networking-ha.html>`_ in the OpenStack High Availability Guide.
|
||||
|
||||
For a general purpose OpenStack cloud, the OpenStack infrastructure
|
||||
components need to be highly available. If the design does not include
|
||||
hardware load balancing, networking software packages like HAProxy will
|
||||
need to be included.
|
||||
|
||||
For a compute-focused OpenStack cloud, the OpenStack infrastructure
|
||||
components must be highly available. If the design does not include
|
||||
hardware load balancing, you must add networking software packages, for
|
||||
example, HAProxy.
|
||||
|
||||
Management software
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Management software includes software for providing:
|
||||
|
||||
* Clustering
|
||||
|
||||
* Logging
|
||||
|
||||
* Monitoring
|
||||
|
||||
* Alerting
|
||||
|
||||
.. important::
|
||||
|
||||
The factors for determining which software packages in this category
|
||||
to select is outside the scope of this design guide.
|
||||
|
||||
The selected supplemental software solution impacts and affects the overall
|
||||
OpenStack cloud design. This includes software for providing clustering,
|
||||
logging, monitoring and alerting.
|
||||
|
||||
The inclusion of clustering software, such as Corosync or Pacemaker, is
|
||||
primarily determined by the availability of the cloud infrastructure and
|
||||
the complexity of supporting the configuration after it is deployed. The
|
||||
`OpenStack High Availability Guide <http://docs.openstack.org/ha-guide/>`_
|
||||
provides more details on the installation and configuration of Corosync
|
||||
and Pacemaker, should these packages need to be included in the design.
|
||||
|
||||
Operational considerations determine the requirements for logging,
|
||||
monitoring, and alerting. Each of these sub-categories include various
|
||||
options.
|
||||
|
||||
For example, in the logging sub-category you could select Logstash,
|
||||
Splunk, Log Insight, or another log aggregation-consolidation tool.
|
||||
Store logs in a centralized location to facilitate performing analytics
|
||||
against the data. Log data analytics engines can also provide automation
|
||||
and issue notification, by providing a mechanism to both alert and
|
||||
automatically attempt to remediate some of the more commonly known
|
||||
issues.
|
||||
|
||||
If these software packages are required, the design must account for the
|
||||
additional resource consumption (CPU, RAM, storage, and network
|
||||
bandwidth). Some other potential design impacts include:
|
||||
|
||||
* OS-hypervisor combination
|
||||
Ensure that the selected logging, monitoring, or alerting tools support
|
||||
the proposed OS-hypervisor combination.
|
||||
|
||||
* Network hardware
|
||||
The network hardware selection needs to be supported by the logging,
|
||||
monitoring, and alerting software.
|
||||
|
||||
Database software
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
Most OpenStack components require access to back-end database services
|
||||
to store state and configuration information. Choose an appropriate
|
||||
back-end database which satisfies the availability and fault tolerance
|
||||
requirements of the OpenStack services.
|
||||
|
||||
MySQL is the default database for OpenStack, but other compatible
|
||||
databases are available.
|
||||
|
||||
.. note::
|
||||
|
||||
Telemetry uses MongoDB.
|
||||
|
||||
The chosen high availability database solution changes according to the
|
||||
selected database. MySQL, for example, provides several options. Use a
|
||||
replication technology such as Galera for active-active clustering. For
|
||||
active-passive use some form of shared storage. Each of these potential
|
||||
solutions has an impact on the design:
|
||||
|
||||
* Solutions that employ Galera/MariaDB require at least three MySQL
|
||||
nodes.
|
||||
|
||||
* MongoDB has its own design considerations for high availability.
|
||||
|
||||
* OpenStack design, generally, does not include shared storage.
|
||||
However, for some high availability designs, certain components might
|
||||
require it depending on the specific implementation.
|
@ -0,0 +1,17 @@
|
||||
=======================
|
||||
Support and maintenance
|
||||
=======================
|
||||
|
||||
To be able to support and maintain an installation, OpenStack cloud
|
||||
management requires operations staff to understand and comprehend design
|
||||
architecture content. The operations and engineering staff skill level,
|
||||
and level of separation, are dependent on size and purpose of the
|
||||
installation. Large cloud service providers, or telecom providers, are
|
||||
more likely to be managed by specially trained, dedicated operations
|
||||
organizations. Smaller implementations are more likely to rely on
|
||||
support staff that need to take on combined engineering, design and
|
||||
operations functions.
|
||||
|
||||
Maintaining OpenStack installations requires a variety of technical
|
||||
skills. You may want to consider using a third-party management company
|
||||
with special expertise in managing OpenStack deployment.
|
@ -0,0 +1,50 @@
|
||||
========
|
||||
Upgrades
|
||||
========
|
||||
|
||||
Running OpenStack with a focus on availability requires striking a balance
|
||||
between stability and features. For example, it might be tempting to run an
|
||||
older stable release branch of OpenStack to make deployments easier. However
|
||||
known issues that may be of some concern or only have minimal impact in smaller
|
||||
deployments could become pain points as scale increases. Recent releases may
|
||||
address well known issues. The OpenStack community can help resolve reported
|
||||
issues by applying the collective expertise of the OpenStack developers.
|
||||
|
||||
In multi-site OpenStack clouds deployed using regions, sites are
|
||||
independent OpenStack installations which are linked together using
|
||||
shared centralized services such as OpenStack Identity. At a high level
|
||||
the recommended order of operations to upgrade an individual OpenStack
|
||||
environment is (see the `Upgrades
|
||||
chapter <http://docs.openstack.org/openstack-ops/content/ops_upgrades-general-steps.html>`_
|
||||
of the Operations Guide for details):
|
||||
|
||||
#. Upgrade the OpenStack Identity service (keystone).
|
||||
|
||||
#. Upgrade the OpenStack Image service (glance).
|
||||
|
||||
#. Upgrade OpenStack Compute (nova), including networking components.
|
||||
|
||||
#. Upgrade OpenStack Block Storage (cinder).
|
||||
|
||||
#. Upgrade the OpenStack dashboard (horizon).
|
||||
|
||||
The process for upgrading a multi-site environment is not significantly
|
||||
different:
|
||||
|
||||
#. Upgrade the shared OpenStack Identity service (keystone) deployment.
|
||||
|
||||
#. Upgrade the OpenStack Image service (glance) at each site.
|
||||
|
||||
#. Upgrade OpenStack Compute (nova), including networking components, at
|
||||
each site.
|
||||
|
||||
#. Upgrade OpenStack Block Storage (cinder) at each site.
|
||||
|
||||
#. Upgrade the OpenStack dashboard (horizon), at each site or in the
|
||||
single central location if it is shared.
|
||||
|
||||
Compute upgrades within each site can also be performed in a rolling
|
||||
fashion. Compute controller services (API, Scheduler, and Conductor) can
|
||||
be upgraded prior to upgrading of individual compute nodes. This allows
|
||||
operations staff to keep a site operational for users of Compute
|
||||
services while performing an upgrade.
|
@ -5,9 +5,21 @@ Operator requirements
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
|
||||
Introduction
|
||||
~~~~~~~~~~~~
|
||||
operator-requirements-sla.rst
|
||||
operator-requirements-logging-monitoring.rst
|
||||
operator-requirements-network-design.rst
|
||||
operator-requirements-licensing.rst
|
||||
operator-requirements-support-maintenance.rst
|
||||
operator-requirements-ops-access.rst
|
||||
operator-requirements-capacity-planning.rst
|
||||
operator-requirements-quota-management.rst
|
||||
operator-requirements-policy-management.rst
|
||||
operator-requirements-hardware-selection.rst
|
||||
operator-requirements-software-selection.rst
|
||||
operator-requirements-external-idp.rst
|
||||
operator-requirements-upgrades.rst
|
||||
operator-requirements-bleeding-edge.rst
|
||||
operator-requirements-skills-training.rst
|
||||
|
||||
Several operational factors affect the design choices for a general
|
||||
purpose cloud. Operations staff receive tasks regarding the maintenance
|
||||
@ -51,450 +63,3 @@ teams can work more efficiently because fewer resources are required for these
|
||||
common tasks. Administrators are then free to tackle tasks that are not easy
|
||||
to automate and that have longer-term impacts on the business, for example,
|
||||
capacity planning.
|
||||
|
||||
|
||||
SLA Considerations
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Service-level agreements (SLAs) are contractual obligations that ensure the
|
||||
availability of a service. When designing an OpenStack cloud, factoring in
|
||||
promises of availability implies a certain level of redundancy and resiliency.
|
||||
|
||||
Expectations set by the Service Level Agreements (SLAs) directly affect
|
||||
knowing when and where you should implement redundancy and high
|
||||
availability. SLAs are contractual obligations that provide assurances
|
||||
for service availability. They define the levels of availability that
|
||||
drive the technical design, often with penalties for not meeting
|
||||
contractual obligations.
|
||||
|
||||
SLA terms that affect design include:
|
||||
|
||||
* API availability guarantees implying multiple infrastructure services
|
||||
and highly available load balancers.
|
||||
|
||||
* Network uptime guarantees affecting switch design, which might
|
||||
require redundant switching and power.
|
||||
|
||||
* Factor in networking security policy requirements in to your
|
||||
deployments.
|
||||
|
||||
In any environment larger than just a few hosts, it is important to note that
|
||||
there are two separate areas that might be subject to an SLA. Firstly, the
|
||||
services that provide actual virtualization, networking and storage, are
|
||||
subject to an SLA that customers of the environment are most likely to want to
|
||||
be continuously available. This is often referred to as the 'Data Plane'.
|
||||
|
||||
Secondly, there are the ancillary services such as API endpoints, and the
|
||||
various services that control CRUD operations. These are often referred to as
|
||||
the 'Control Plane'. The services in this category are usually subject to a
|
||||
different SLA expectation and therefore may be better suited on separate
|
||||
hardware or at least containers from the Data Plane services.
|
||||
|
||||
To effectively run cloud installations, initial downtime planning
|
||||
includes creating processes and architectures that support the
|
||||
following:
|
||||
|
||||
* Planned (maintenance)
|
||||
* Unplanned (system faults)
|
||||
|
||||
It is important to determine as part of the SLA negotiation which party is
|
||||
responsible for monitoring and starting up Compute service Instances should an
|
||||
outage occur which shuts them down.
|
||||
|
||||
Resiliency of overall system and individual components are going to be
|
||||
dictated by the requirements of the SLA, meaning designing for
|
||||
:term:`high availability (HA)` can have cost ramifications.
|
||||
|
||||
When upgrading, patching and changing configuration items this may require
|
||||
downtime for some services. In these cases, stopping services that form the
|
||||
Control Plane may leave the Data Plane unaffected, while actions such as
|
||||
live-migration of Compute instances may be required in order to perform any
|
||||
actions that require downtime to Data Plane components whilst still meeting SLA
|
||||
expectations.
|
||||
|
||||
Note that there are many services that are outside the realms of pure OpenStack
|
||||
code which affects the ability of any given design to meet SLA, including:
|
||||
|
||||
* Database services, such as ``MySQL`` or ``PostgreSQL``.
|
||||
* Services providing RPC, such as ``RabbitMQ``.
|
||||
* External network attachments.
|
||||
* Physical constraints such as power, rack space, network cabling, etc.
|
||||
* Shared storage including SAN based arrays, storage clusters such as ``Ceph``,
|
||||
and/or NFS services.
|
||||
|
||||
Depending on the design, some Network service functions may fall into both the
|
||||
Control and Data Plane categories. E.g. the neutron L3 Agent service may be
|
||||
considered a Control Plane component, but the routers themselves would be Data
|
||||
Plane.
|
||||
|
||||
It may be that a given set of requirements could dictate an SLA that suggests
|
||||
some services need HA, and some do not.
|
||||
|
||||
In a design with multiple regions, the SLA would also need to take into
|
||||
consideration the use of shared services such as the Identity service,
|
||||
Dashboard, and so on.
|
||||
|
||||
Any SLA negotiation must also take into account the reliance on 3rd parties for
|
||||
critical aspects of the design - for example, if there is an existing SLA on a
|
||||
component such as a storage system, the cloud SLA must take into account this
|
||||
limitation. If the required SLA for the cloud exceeds the agreed uptime levels
|
||||
of the components comprising that cloud, additional redundancy would be
|
||||
required. This consideration is critical to review in a hybrid cloud design,
|
||||
where there are multiple 3rd parties involved.
|
||||
|
||||
Logging and Monitoring
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack clouds require appropriate monitoring platforms to catch and
|
||||
manage errors.
|
||||
|
||||
.. note::
|
||||
|
||||
We recommend leveraging existing monitoring systems to see if they
|
||||
are able to effectively monitor an OpenStack environment.
|
||||
|
||||
Specific meters that are critically important to capture include:
|
||||
|
||||
* Image disk utilization
|
||||
|
||||
* Response time to the Compute API
|
||||
|
||||
Logging and monitoring does not significantly differ for a multi-site OpenStack
|
||||
cloud. The tools described in the `Logging and monitoring chapter
|
||||
<http://docs.openstack.org/openstack-ops/content/logging_monitoring.html>`__ of
|
||||
the Operations Guide remain applicable. Logging and monitoring can be provided
|
||||
on a per-site basis, and in a common centralized location.
|
||||
|
||||
When attempting to deploy logging and monitoring facilities to a centralized
|
||||
location, care must be taken with the load placed on the inter-site networking
|
||||
links.
|
||||
|
||||
|
||||
|
||||
Network
|
||||
~~~~~~~
|
||||
|
||||
The network design for an OpenStack cluster includes decisions regarding
|
||||
the interconnect needs within the cluster, plus the need to allow clients to
|
||||
access their resources, and for operators to access the cluster for
|
||||
maintenance. The bandwidth, latency, and reliability of these networks needs
|
||||
consideration.
|
||||
|
||||
Make additional design decisions about monitoring and alarming. This can
|
||||
be an internal responsibility or the responsibility of the external
|
||||
provider. In the case of using an external provider, service level
|
||||
agreements (SLAs) likely apply. In addition, other operational
|
||||
considerations such as bandwidth, latency, and jitter can be part of an
|
||||
SLA.
|
||||
|
||||
Consider the ability to upgrade the infrastructure. As demand for
|
||||
network resources increase, operators add additional IP address blocks
|
||||
and add additional bandwidth capacity. In addition, consider managing
|
||||
hardware and software life cycle events, for example upgrades,
|
||||
decommissioning, and outages, while avoiding service interruptions for
|
||||
tenants.
|
||||
|
||||
Factor maintainability into the overall network design. This includes
|
||||
the ability to manage and maintain IP addresses as well as the use of
|
||||
overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS
|
||||
tags. As an example, if you may need to change all of the IP addresses
|
||||
on a network, a process known as renumbering, then the design must
|
||||
support this function.
|
||||
|
||||
Address network-focused applications when considering certain
|
||||
operational realities. For example, consider the impending exhaustion of
|
||||
IPv4 addresses, the migration to IPv6, and the use of private networks
|
||||
to segregate different types of traffic that an application receives or
|
||||
generates. In the case of IPv4 to IPv6 migrations, applications should
|
||||
follow best practices for storing IP addresses. We recommend you avoid
|
||||
relying on IPv4 features that did not carry over to the IPv6 protocol or
|
||||
have differences in implementation.
|
||||
|
||||
To segregate traffic, allow applications to create a private tenant
|
||||
network for database and storage network traffic. Use a public network
|
||||
for services that require direct client access from the internet. Upon
|
||||
segregating the traffic, consider quality of service (QoS) and security
|
||||
to ensure each network has the required level of service.
|
||||
|
||||
Finally, consider the routing of network traffic. For some applications,
|
||||
develop a complex policy framework for routing. To create a routing
|
||||
policy that satisfies business requirements, consider the economic cost
|
||||
of transmitting traffic over expensive links versus cheaper links, in
|
||||
addition to bandwidth, latency, and jitter requirements.
|
||||
|
||||
Additionally, consider how to respond to network events. As an example,
|
||||
how load transfers from one link to another during a failure scenario
|
||||
could be a factor in the design. If you do not plan network capacity
|
||||
correctly, failover traffic could overwhelm other ports or network links
|
||||
and create a cascading failure scenario. In this case, traffic that
|
||||
fails over to one link overwhelms that link and then moves to the
|
||||
subsequent links until all network traffic stops.
|
||||
|
||||
|
||||
Licensing
|
||||
~~~~~~~~~
|
||||
|
||||
The many different forms of license agreements for software are often written
|
||||
with the use of dedicated hardware in mind. This model is relevant for the
|
||||
cloud platform itself, including the hypervisor operating system, supporting
|
||||
software for items such as database, RPC, backup, and so on. Consideration
|
||||
must be made when offering Compute service instances and applications to end
|
||||
users of the cloud, since the license terms for that software may need some
|
||||
adjustment to be able to operate economically in the cloud.
|
||||
|
||||
Multi-site OpenStack deployments present additional licensing
|
||||
considerations over and above regular OpenStack clouds, particularly
|
||||
where site licenses are in use to provide cost efficient access to
|
||||
software licenses. The licensing for host operating systems, guest
|
||||
operating systems, OpenStack distributions (if applicable),
|
||||
software-defined infrastructure including network controllers and
|
||||
storage systems, and even individual applications need to be evaluated.
|
||||
|
||||
Topics to consider include:
|
||||
|
||||
* The definition of what constitutes a site in the relevant licenses,
|
||||
as the term does not necessarily denote a geographic or otherwise
|
||||
physically isolated location.
|
||||
|
||||
* Differentiations between "hot" (active) and "cold" (inactive) sites,
|
||||
where significant savings may be made in situations where one site is
|
||||
a cold standby for disaster recovery purposes only.
|
||||
|
||||
* Certain locations might require local vendors to provide support and
|
||||
services for each site which may vary with the licensing agreement in
|
||||
place.
|
||||
|
||||
Support and maintainability
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To be able to support and maintain an installation, OpenStack cloud
|
||||
management requires operations staff to understand and comprehend design
|
||||
architecture content. The operations and engineering staff skill level,
|
||||
and level of separation, are dependent on size and purpose of the
|
||||
installation. Large cloud service providers, or telecom providers, are
|
||||
more likely to be managed by specially trained, dedicated operations
|
||||
organizations. Smaller implementations are more likely to rely on
|
||||
support staff that need to take on combined engineering, design and
|
||||
operations functions.
|
||||
|
||||
Maintaining OpenStack installations requires a variety of technical
|
||||
skills. You may want to consider using a third-party management company
|
||||
with special expertise in managing OpenStack deployment.
|
||||
|
||||
Operator access to systems
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
As more and more applications are migrated into a Cloud based environment, we
|
||||
get to a position where systems that are critical for Cloud operations are
|
||||
hosted within the cloud that is being operated. Consideration must be given to
|
||||
the ability for operators to be able to access the systems and tools required
|
||||
in order to resolve a major incident.
|
||||
|
||||
If a significant portion of the cloud is on externally managed systems,
|
||||
prepare for situations where it may not be possible to make changes.
|
||||
Additionally, providers may differ on how infrastructure must be managed and
|
||||
exposed. This can lead to delays in root cause analysis where each insists the
|
||||
blame lies with the other provider.
|
||||
|
||||
Ensure that the network structure connects all clouds to form an integrated
|
||||
system, keeping in mind the state of handoffs. These handoffs must both be as
|
||||
reliable as possible and include as little latency as possible to ensure the
|
||||
best performance of the overall system.
|
||||
|
||||
Capacity planning
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
An important consideration in running a cloud over time is projecting growth
|
||||
and utilization trends in order to plan capital expenditures for the short and
|
||||
long term. Gather utilization meters for compute, network, and storage, along
|
||||
with historical records of these meters. While securing major anchor tenants
|
||||
can lead to rapid jumps in the utilization rates of all resources, the steady
|
||||
adoption of the cloud inside an organization or by consumers in a public
|
||||
offering also creates a steady trend of increased utilization.
|
||||
|
||||
Capacity constraints for a general purpose cloud environment include:
|
||||
|
||||
* Compute limits
|
||||
* Storage limits
|
||||
|
||||
A relationship exists between the size of the compute environment and
|
||||
the supporting OpenStack infrastructure controller nodes requiring
|
||||
support.
|
||||
|
||||
Increasing the size of the supporting compute environment increases the
|
||||
network traffic and messages, adding load to the controller or
|
||||
networking nodes. Effective monitoring of the environment will help with
|
||||
capacity decisions on scaling.
|
||||
|
||||
Compute nodes automatically attach to OpenStack clouds, resulting in a
|
||||
horizontally scaling process when adding extra compute capacity to an
|
||||
OpenStack cloud. Additional processes are required to place nodes into
|
||||
appropriate availability zones and host aggregates. When adding
|
||||
additional compute nodes to environments, ensure identical or functional
|
||||
compatible CPUs are used, otherwise live migration features will break.
|
||||
It is necessary to add rack capacity or network switches as scaling out
|
||||
compute hosts directly affects network and datacenter resources.
|
||||
|
||||
Compute host components can also be upgraded to account for increases in
|
||||
demand; this is known as vertical scaling. Upgrading CPUs with more
|
||||
cores, or increasing the overall server memory, can add extra needed
|
||||
capacity depending on whether the running applications are more CPU
|
||||
intensive or memory intensive.
|
||||
|
||||
Another option is to assess the average workloads and increase the
|
||||
number of instances that can run within the compute environment by
|
||||
adjusting the overcommit ratio.
|
||||
|
||||
.. note::
|
||||
|
||||
It is important to remember that changing the CPU overcommit ratio
|
||||
can have a detrimental effect and cause a potential increase in a
|
||||
noisy neighbor.
|
||||
|
||||
Insufficient disk capacity could also have a negative effect on overall
|
||||
performance including CPU and memory usage. Depending on the back-end
|
||||
architecture of the OpenStack Block Storage layer, capacity includes
|
||||
adding disk shelves to enterprise storage systems or installing
|
||||
additional block storage nodes. Upgrading directly attached storage
|
||||
installed in compute hosts, and adding capacity to the shared storage
|
||||
for additional ephemeral storage to instances, may be necessary.
|
||||
|
||||
For a deeper discussion on many of these topics, refer to the `OpenStack
|
||||
Operations Guide <http://docs.openstack.org/ops>`_.
|
||||
|
||||
Quota management
|
||||
~~~~~~~~~~~~~~~~
|
||||
|
||||
Quotas are used to set operational limits to prevent system capacities
|
||||
from being exhausted without notification. They are currently enforced
|
||||
at the tenant (or project) level rather than at the user level.
|
||||
|
||||
Quotas are defined on a per-region basis. Operators can define identical
|
||||
quotas for tenants in each region of the cloud to provide a consistent
|
||||
experience, or even create a process for synchronizing allocated quotas
|
||||
across regions. It is important to note that only the operational limits
|
||||
imposed by the quotas will be aligned consumption of quotas by users
|
||||
will not be reflected between regions.
|
||||
|
||||
For example, given a cloud with two regions, if the operator grants a
|
||||
user a quota of 25 instances in each region then that user may launch a
|
||||
total of 50 instances spread across both regions. They may not, however,
|
||||
launch more than 25 instances in any single region.
|
||||
|
||||
For more information on managing quotas refer to the `Managing projects
|
||||
and users
|
||||
chapter <http://docs.openstack.org/openstack-ops/content/projects_users.html>`__
|
||||
of the OpenStack Operators Guide.
|
||||
|
||||
Policy management
|
||||
~~~~~~~~~~~~~~~~~
|
||||
|
||||
OpenStack provides a default set of Role Based Access Control (RBAC)
|
||||
policies, defined in a ``policy.json`` file, for each service. Operators
|
||||
edit these files to customize the policies for their OpenStack
|
||||
installation. If the application of consistent RBAC policies across
|
||||
sites is a requirement, then it is necessary to ensure proper
|
||||
synchronization of the ``policy.json`` files to all installations.
|
||||
|
||||
This must be done using system administration tools such as rsync as
|
||||
functionality for synchronizing policies across regions is not currently
|
||||
provided within OpenStack.
|
||||
|
||||
Selecting Hardware
|
||||
~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
Integration with external IDP
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
|
||||
Upgrades
|
||||
~~~~~~~~
|
||||
|
||||
Running OpenStack with a focus on availability requires striking a balance
|
||||
between stability and features. For example, it might be tempting to run an
|
||||
older stable release branch of OpenStack to make deployments easier. However
|
||||
known issues that may be of some concern or only have minimal impact in smaller
|
||||
deployments could become pain points as scale increases. Recent releases may
|
||||
address well known issues. The OpenStack community can help resolve reported
|
||||
issues by applying the collective expertise of the OpenStack developers.
|
||||
|
||||
In multi-site OpenStack clouds deployed using regions, sites are
|
||||
independent OpenStack installations which are linked together using
|
||||
shared centralized services such as OpenStack Identity. At a high level
|
||||
the recommended order of operations to upgrade an individual OpenStack
|
||||
environment is (see the `Upgrades
|
||||
chapter <http://docs.openstack.org/openstack-ops/content/ops_upgrades-general-steps.html>`__
|
||||
of the Operations Guide for details):
|
||||
|
||||
#. Upgrade the OpenStack Identity service (keystone).
|
||||
|
||||
#. Upgrade the OpenStack Image service (glance).
|
||||
|
||||
#. Upgrade OpenStack Compute (nova), including networking components.
|
||||
|
||||
#. Upgrade OpenStack Block Storage (cinder).
|
||||
|
||||
#. Upgrade the OpenStack dashboard (horizon).
|
||||
|
||||
The process for upgrading a multi-site environment is not significantly
|
||||
different:
|
||||
|
||||
#. Upgrade the shared OpenStack Identity service (keystone) deployment.
|
||||
|
||||
#. Upgrade the OpenStack Image service (glance) at each site.
|
||||
|
||||
#. Upgrade OpenStack Compute (nova), including networking components, at
|
||||
each site.
|
||||
|
||||
#. Upgrade OpenStack Block Storage (cinder) at each site.
|
||||
|
||||
#. Upgrade the OpenStack dashboard (horizon), at each site or in the
|
||||
single central location if it is shared.
|
||||
|
||||
Compute upgrades within each site can also be performed in a rolling
|
||||
fashion. Compute controller services (API, Scheduler, and Conductor) can
|
||||
be upgraded prior to upgrading of individual compute nodes. This allows
|
||||
operations staff to keep a site operational for users of Compute
|
||||
services while performing an upgrade.
|
||||
|
||||
|
||||
The bleeding edge
|
||||
-----------------
|
||||
|
||||
The number of organizations running at massive scales is a small proportion of
|
||||
the OpenStack community, therefore it is important to share related issues
|
||||
with the community and be a vocal advocate for resolving them. Some issues
|
||||
only manifest when operating at large scale, and the number of organizations
|
||||
able to duplicate and validate an issue is small, so it is important to
|
||||
document and dedicate resources to their resolution.
|
||||
|
||||
In some cases, the resolution to the problem is ultimately to deploy a more
|
||||
recent version of OpenStack. Alternatively, when you must resolve an issue in
|
||||
a production environment where rebuilding the entire environment is not an
|
||||
option, it is sometimes possible to deploy updates to specific underlying
|
||||
components in order to resolve issues or gain significant performance
|
||||
improvements. Although this may appear to expose the deployment to increased
|
||||
risk and instability, in many cases it could be an undiscovered issue.
|
||||
|
||||
We recommend building a development and operations organization that is
|
||||
responsible for creating desired features, diagnosing and resolving issues,
|
||||
and building the infrastructure for large scale continuous integration tests
|
||||
and continuous deployment. This helps catch bugs early and makes deployments
|
||||
faster and easier. In addition to development resources, we also recommend the
|
||||
recruitment of experts in the fields of message queues, databases, distributed
|
||||
systems, networking, cloud, and storage.
|
||||
|
||||
|
||||
Skills and training
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Projecting growth for storage, networking, and compute is only one aspect of a
|
||||
growth plan for running OpenStack at massive scale. Growing and nurturing
|
||||
development and operational staff is an additional consideration. Sending team
|
||||
members to OpenStack conferences, meetup events, and encouraging active
|
||||
participation in the mailing lists and committees is a very important way to
|
||||
maintain skills and forge relationships in the community. For a list of
|
||||
OpenStack training providers in the marketplace, see:
|
||||
http://www.openstack.org/marketplace/training/.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user