20d15b80df
Use the common documentation glossary. Move entries from ch_glossary to the common documentation glossary and add glossterm so that the entry gets displayed in the glossary. Remove entries from the Arch Design specific glossary that are fully explained in the text. Remove entries that are already in the common glossary and add glossterm for them in the text. Add glossary as final chapter to the manual. The remaining glossary entries need some further discussion. Change-Id: Ic82ae5235e1306fc13be81fa4971710321da5072
153 lines
8.6 KiB
XML
153 lines
8.6 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="operational-considerations-general-purpose">
|
|
<?dbhtml stop-chunking?>
|
|
<title>Operational considerations</title>
|
|
<para>Many operational factors will affect general purpose cloud
|
|
design choices. In larger installations, it is not uncommon
|
|
for operations staff to be tasked with maintaining cloud
|
|
environments. This differs from the operations staff that is
|
|
responsible for building or designing the infrastructure. It
|
|
is important to include the operations function in the
|
|
planning and design phases of the build out.</para>
|
|
<para>Service Level Agreements (SLAs) are contractual obligations
|
|
that provide assurances for service availability. SLAs define
|
|
levels of availability that drive the technical design, often
|
|
with penalties for not meeting the contractual obligations.
|
|
The strictness of the SLA dictates the level of redundancy and
|
|
resiliency in the OpenStack cloud design. Knowing when and
|
|
where to implement redundancy and high availability is
|
|
directly affected by
|
|
expectations set by the terms of the SLA. Some of the SLA
|
|
terms that will affect the design include:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Guarantees for API availability imply multiple
|
|
infrastructure services combined with highly available
|
|
load balancers.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Network uptime guarantees will affect the switch
|
|
design and might require redundant switching and
|
|
power.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Network security policies requirements need to be
|
|
factored in to deployments.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<section xml:id="support-and-maintainability-general-purpose">
|
|
<title>Support and maintainability</title>
|
|
<para>OpenStack cloud management requires operations staff to be
|
|
able to understand and comprehend design architecture content
|
|
on some level. The level of skills and the level of separation
|
|
of the operations and engineering staff are dependent on the
|
|
size and purpose of the installation. A large cloud service
|
|
provider or a telecom provider is more likely to be managed by
|
|
a specially trained, dedicated operations organization. A
|
|
smaller implementation is more likely to rely on a smaller
|
|
support staff that might need to take on the combined
|
|
engineering, design and operations functions.</para>
|
|
<para>Furthermore, maintaining OpenStack installations requires a
|
|
variety of technical skills. Some of these skills may include
|
|
the ability to debug Python log output to a basic level and an
|
|
understanding of networking concepts.</para>
|
|
<para>Consider incorporating features into the architecture and
|
|
design that reduce the operations burden. This is accomplished
|
|
by automating some of the operations functions. In some cases
|
|
it may be beneficial to use a third party management company
|
|
with special expertise in managing OpenStack
|
|
deployments.</para></section>
|
|
<section xml:id="monitoring-general-purpose">
|
|
<title>Monitoring</title>
|
|
<para>Like any other infrastructure deployment, OpenStack clouds
|
|
need an appropriate monitoring platform to ensure any errors
|
|
are caught and managed appropriately. Consider leveraging any
|
|
existing monitoring system to see if it will be able to
|
|
effectively monitor an OpenStack environment. While there are
|
|
many aspects that need to be monitored, specific metrics that
|
|
are critically important to capture include image disk
|
|
utilization, or response time to the Compute API.</para></section>
|
|
<section xml:id="downtime-general-purpose">
|
|
<title>Downtime</title>
|
|
<para>No matter how robust the architecture is, at some point
|
|
components will fail. Designing for high availability (HA) can
|
|
have significant cost ramifications, therefore the resiliency
|
|
of the overall system and the individual components is going
|
|
to be dictated by the requirements of the SLA. Downtime
|
|
planning includes creating processes and architectures that
|
|
support planned (maintenance) and unplanned (system faults)
|
|
downtime.</para>
|
|
<para>An example of an operational consideration is the recovery
|
|
of a failed compute host. This might mean requiring the
|
|
restoration of instances from a snapshot or respawning an
|
|
instance on another available compute host. This could have
|
|
consequences on the overall application design. A general
|
|
purpose cloud should not need to provide an ability to migrate
|
|
instances from one host to another. If the expectation is that
|
|
the application will be designed to tolerate failure,
|
|
additional considerations need to be made around supporting
|
|
instance migration. In this scenario, extra supporting
|
|
services, including shared storage attached to compute hosts,
|
|
might need to be deployed.</para></section>
|
|
<section xml:id="capacity-planning">
|
|
<title>Capacity planning</title>
|
|
<para>Capacity planning for future growth is a critically
|
|
important and often overlooked consideration. Capacity
|
|
constraints in a general purpose cloud environment include
|
|
compute and storage limits. There is a relationship between
|
|
the size of the compute environment and the supporting
|
|
OpenStack infrastructure controller nodes required to support
|
|
it. As the size of the supporting compute environment
|
|
increases, the network traffic and messages will increase
|
|
which will add load to the controller or networking nodes.
|
|
While no hard and fast rule exists, effective monitoring of
|
|
the environment will help with capacity decisions on when to
|
|
scale the back-end infrastructure as part of the scaling of
|
|
the compute resources.</para>
|
|
<para>Adding extra compute capacity to an OpenStack cloud is a
|
|
horizontally scaling process as consistently configured
|
|
compute nodes automatically attach to an OpenStack cloud. Be
|
|
mindful of any additional work that is needed to place the
|
|
nodes into appropriate availability zones and host aggregates.
|
|
Make sure to use identical or functionally compatible CPUs
|
|
when adding additional compute nodes to the environment
|
|
otherwise live migration features will break. Scaling out
|
|
compute hosts will directly affect network and other
|
|
datacenter resources so it will be necessary to add rack
|
|
capacity or network switches.</para>
|
|
<para>Another option is to assess the average workloads and
|
|
increase the number of instances that can run within the
|
|
compute environment by adjusting the overcommit ratio. While
|
|
only appropriate in some environments, it's important to
|
|
remember that changing the CPU overcommit ratio can have a
|
|
detrimental effect and cause a potential increase in noisy
|
|
neighbor. The added risk of increasing the overcommit ratio is
|
|
more instances will fail when a compute host fails.</para>
|
|
<para>Compute host components can also be upgraded to account for
|
|
increases in demand; this is known as vertical scaling.
|
|
Upgrading CPUs with more cores, or increasing the overall
|
|
server memory, can add extra needed capacity depending on
|
|
whether the running applications are more CPU intensive or
|
|
memory intensive.</para>
|
|
<para>Insufficient disk capacity could also have a negative effect
|
|
on overall performance including CPU and memory usage.
|
|
Depending on the back-end architecture of the OpenStack Block
|
|
Storage layer, capacity might include adding disk shelves to
|
|
enterprise storage systems or installing additional block
|
|
storage nodes. It may also be necessary to upgrade directly
|
|
attached storage installed in compute hosts or add capacity to
|
|
the shared storage to provide additional ephemeral storage to
|
|
instances.</para>
|
|
<para>
|
|
For a deeper discussion on many of these topics, refer to the
|
|
<link
|
|
xlink:href="http://docs.openstack.org/ops"><citetitle>OpenStack
|
|
Operations Guide</citetitle></link>.
|
|
</para>
|
|
</section>
|
|
</section>
|