b0337e125c
Replaced 'metrics' with 'meters' in all contexts, excluding the metric system or where the term is used as name. Change-Id: If4c32dfe92c28a2079a485a6aec1d61c7b9999a1 Closes-Bug: #1446518
103 lines
5.9 KiB
XML
103 lines
5.9 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="operational-considerations-massive-scale">
|
|
<?dbhtml stop-chunking?>
|
|
<title>Operational considerations</title>
|
|
<para>In order to run efficiently at massive scale, automate
|
|
as many of the operational processes as
|
|
possible. Automation includes the configuration of
|
|
provisioning, monitoring and alerting systems. Part of the
|
|
automation process includes the capability to determine when
|
|
human intervention is required and who should act. The
|
|
objective is to increase the ratio of operational staff to
|
|
running systems as much as possible in order to reduce maintenance
|
|
costs. In a massively scaled environment, it is impossible for
|
|
staff to give each system individual care.</para>
|
|
<para>Configuration management tools such as Puppet and Chef enable
|
|
operations staff to categorize systems into groups based on
|
|
their roles and thus create configurations and system states
|
|
that the provisioning system enforces. Systems
|
|
that fall out of the defined state due to errors or failures
|
|
are quickly removed from the pool of active nodes and
|
|
replaced.</para>
|
|
<para>At large scale the resource cost of diagnosing failed individual
|
|
systems is far greater than the cost of
|
|
replacement. It is more economical to replace the failed
|
|
system with a new system, provisioning and configuring it
|
|
automatically then quickly adding it to the
|
|
pool of active nodes. By automating tasks that are labor-intensive,
|
|
repetitive, and critical to operations, cloud operations
|
|
teams can work more
|
|
efficiently because fewer resources are required for these
|
|
common tasks. Administrators are then free to tackle
|
|
tasks that are not easy to automate and that have longer-term
|
|
impacts on the business, for example capacity planning.</para>
|
|
<section xml:id="the-bleeding-edge">
|
|
<title>The bleeding edge</title>
|
|
<para>Running OpenStack at massive scale requires striking a
|
|
balance between stability and features. For example, it might
|
|
be tempting to run an older stable release branch of OpenStack
|
|
to make deployments easier. However, when running at massive
|
|
scale, known issues that may be of some concern or only have
|
|
minimal impact in smaller deployments could become pain points.
|
|
Recent releases may address well known issues. The OpenStack
|
|
community can help resolve reported issues by applying
|
|
the collective expertise of the OpenStack developers.</para>
|
|
<para>The number of organizations running at
|
|
massive scales is a small proportion of the
|
|
OpenStack community, therefore it is important to share
|
|
related issues with the community and be a vocal advocate for
|
|
resolving them. Some issues only manifest when operating at
|
|
large scale, and the number of organizations able to duplicate
|
|
and validate an issue is small, so it is important to
|
|
document and dedicate resources to their resolution.</para>
|
|
<para>In some cases, the resolution to the problem is ultimately
|
|
to deploy a more recent version of OpenStack. Alternatively,
|
|
when you must resolve an issue in a production
|
|
environment where rebuilding the entire environment is not an
|
|
option, it is sometimes possible to deploy updates to specific
|
|
underlying components in order to resolve issues or gain
|
|
significant performance improvements. Although this may appear
|
|
to expose the deployment to
|
|
increased risk and instability, in many cases it
|
|
could be an undiscovered issue.</para>
|
|
<para>We recommend building a development and operations
|
|
organization that is responsible for creating desired
|
|
features, diagnosing and resolving issues, and building the
|
|
infrastructure for large scale continuous integration tests
|
|
and continuous deployment. This helps catch bugs early and
|
|
makes deployments faster and easier. In addition to
|
|
development resources, we also recommend the recruitment
|
|
of experts in the fields of message queues, databases, distributed
|
|
systems, networking, cloud, and storage.</para></section>
|
|
<section xml:id="growth-and-capacity-planning">
|
|
<title>Growth and capacity planning</title>
|
|
<para>An important consideration in running at massive scale is
|
|
projecting growth and utilization trends in order to plan capital
|
|
expenditures for the short and long term. Gather utilization
|
|
meters for compute, network, and storage, along with historical
|
|
records of these meters. While securing major
|
|
anchor tenants can lead to rapid jumps in the utilization
|
|
rates of all resources, the steady adoption of the cloud
|
|
inside an organization or by consumers in a public
|
|
offering also creates a steady trend of increased
|
|
utilization.</para></section>
|
|
<section xml:id="skills-and-training">
|
|
<title>Skills and training</title>
|
|
<para>Projecting growth for storage, networking, and compute is
|
|
only one aspect of a growth plan for running OpenStack at
|
|
massive scale. Growing and nurturing development and
|
|
operational staff is an additional consideration. Sending team
|
|
members to OpenStack conferences, meetup events, and
|
|
encouraging active participation in the mailing lists and
|
|
committees is a very important way to maintain skills and
|
|
forge relationships in the community. For a list of OpenStack
|
|
training providers in the marketplace, see: <link
|
|
xlink:href="http://www.openstack.org/marketplace/training/">http://www.openstack.org/marketplace/training/</link>.
|
|
</para>
|
|
</section>
|
|
</section>
|