Merge "[arch-design] Convert massively scalable to RST"
This commit is contained in:
commit
cafd893d04
doc/arch-design-rst/source
@ -7,6 +7,7 @@ Massively scalable
|
||||
|
||||
user-requirements-massively-scalable.rst
|
||||
tech-considerations-massively-scalable.rst
|
||||
operational-considerations-massively-scalable.rst
|
||||
|
||||
A massively scalable architecture is a cloud implementation
|
||||
that is either a very large deployment, such as a commercial
|
||||
|
@ -0,0 +1,84 @@
|
||||
Operational considerations
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
In order to run efficiently at massive scale, automate as many of the
|
||||
operational processes as possible. Automation includes the configuration of
|
||||
provisioning, monitoring and alerting systems. Part of the automation process
|
||||
includes the capability to determine when human intervention is required and
|
||||
who should act. The objective is to increase the ratio of operational staff to
|
||||
running systems as much as possible in order to reduce maintenance costs. In a
|
||||
massively scaled environment, it is very difficult for staff to give each
|
||||
system individual care.
|
||||
|
||||
Configuration management tools such as Puppet and Chef enable operations staff
|
||||
to categorize systems into groups based on their roles and thus create
|
||||
configurations and system states that the provisioning system enforces.
|
||||
Systems that fall out of the defined state due to errors or failures are
|
||||
quickly removed from the pool of active nodes and replaced.
|
||||
|
||||
At large scale the resource cost of diagnosing failed individual systems is
|
||||
far greater than the cost of replacement. It is more economical to replace the
|
||||
failed system with a new system, provisioning and configuring it automatically
|
||||
and adding it to the pool of active nodes. By automating tasks that are
|
||||
labor-intensive, repetitive, and critical to operations, cloud operations
|
||||
teams can work more efficiently because fewer resources are required for these
|
||||
common tasks. Administrators are then free to tackle tasks that are not easy
|
||||
to automate and that have longer-term impacts on the business, for example,
|
||||
capacity planning.
|
||||
|
||||
The bleeding edge
|
||||
-----------------
|
||||
|
||||
Running OpenStack at massive scale requires striking a balance between
|
||||
stability and features. For example, it might be tempting to run an older
|
||||
stable release branch of OpenStack to make deployments easier. However, when
|
||||
running at massive scale, known issues that may be of some concern or only
|
||||
have minimal impact in smaller deployments could become pain points. Recent
|
||||
releases may address well known issues. The OpenStack community can help
|
||||
resolve reported issues by applying the collective expertise of the OpenStack
|
||||
developers.
|
||||
|
||||
The number of organizations running at massive scales is a small proportion of
|
||||
the OpenStack community, therefore it is important to share related issues
|
||||
with the community and be a vocal advocate for resolving them. Some issues
|
||||
only manifest when operating at large scale, and the number of organizations
|
||||
able to duplicate and validate an issue is small, so it is important to
|
||||
document and dedicate resources to their resolution.
|
||||
|
||||
In some cases, the resolution to the problem is ultimately to deploy a more
|
||||
recent version of OpenStack. Alternatively, when you must resolve an issue in
|
||||
a production environment where rebuilding the entire environment is not an
|
||||
option, it is sometimes possible to deploy updates to specific underlying
|
||||
components in order to resolve issues or gain significant performance
|
||||
improvements. Although this may appear to expose the deployment to increased
|
||||
risk and instability, in many cases it could be an undiscovered issue.
|
||||
|
||||
We recommend building a development and operations organization that is
|
||||
responsible for creating desired features, diagnosing and resolving issues,
|
||||
and building the infrastructure for large scale continuous integration tests
|
||||
and continuous deployment. This helps catch bugs early and makes deployments
|
||||
faster and easier. In addition to development resources, we also recommend the
|
||||
recruitment of experts in the fields of message queues, databases, distributed
|
||||
systems, networking, cloud, and storage.
|
||||
|
||||
Growth and capacity planning
|
||||
----------------------------
|
||||
|
||||
An important consideration in running at massive scale is projecting growth
|
||||
and utilization trends in order to plan capital expenditures for the short and
|
||||
long term. Gather utilization meters for compute, network, and storage, along
|
||||
with historical records of these meters. While securing major anchor tenants
|
||||
can lead to rapid jumps in the utilization rates of all resources, the steady
|
||||
adoption of the cloud inside an organization or by consumers in a public
|
||||
offering also creates a steady trend of increased utilization.
|
||||
|
||||
Skills and training
|
||||
-------------------
|
||||
|
||||
Projecting growth for storage, networking, and compute is only one aspect of a
|
||||
growth plan for running OpenStack at massive scale. Growing and nurturing
|
||||
development and operational staff is an additional consideration. Sending team
|
||||
members to OpenStack conferences, meetup events, and encouraging active
|
||||
participation in the mailing lists and committees is a very important way to
|
||||
maintain skills and forge relationships in the community. For a list of
|
||||
OpenStack training providers in the marketplace, see: http://www.openstack.org/marketplace/training/.
|
Loading…
x
Reference in New Issue
Block a user