diff --git a/doc/arch-design-rst/source/massively-scalable.rst b/doc/arch-design-rst/source/massively-scalable.rst index 6f63a2a1ad..5b7de7444a 100644 --- a/doc/arch-design-rst/source/massively-scalable.rst +++ b/doc/arch-design-rst/source/massively-scalable.rst @@ -7,6 +7,7 @@ Massively scalable user-requirements-massively-scalable.rst tech-considerations-massively-scalable.rst + operational-considerations-massively-scalable.rst A massively scalable architecture is a cloud implementation that is either a very large deployment, such as a commercial diff --git a/doc/arch-design-rst/source/operational-considerations-massively-scalable.rst b/doc/arch-design-rst/source/operational-considerations-massively-scalable.rst new file mode 100644 index 0000000000..9992608f4b --- /dev/null +++ b/doc/arch-design-rst/source/operational-considerations-massively-scalable.rst @@ -0,0 +1,84 @@ +Operational considerations +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In order to run efficiently at massive scale, automate as many of the +operational processes as possible. Automation includes the configuration of +provisioning, monitoring and alerting systems. Part of the automation process +includes the capability to determine when human intervention is required and +who should act. The objective is to increase the ratio of operational staff to +running systems as much as possible in order to reduce maintenance costs. In a +massively scaled environment, it is very difficult for staff to give each +system individual care. + +Configuration management tools such as Puppet and Chef enable operations staff +to categorize systems into groups based on their roles and thus create +configurations and system states that the provisioning system enforces. +Systems that fall out of the defined state due to errors or failures are +quickly removed from the pool of active nodes and replaced. + +At large scale the resource cost of diagnosing failed individual systems is +far greater than the cost of replacement. It is more economical to replace the +failed system with a new system, provisioning and configuring it automatically +and adding it to the pool of active nodes. By automating tasks that are +labor-intensive, repetitive, and critical to operations, cloud operations +teams can work more efficiently because fewer resources are required for these +common tasks. Administrators are then free to tackle tasks that are not easy +to automate and that have longer-term impacts on the business, for example, +capacity planning. + +The bleeding edge +----------------- + +Running OpenStack at massive scale requires striking a balance between +stability and features. For example, it might be tempting to run an older +stable release branch of OpenStack to make deployments easier. However, when +running at massive scale, known issues that may be of some concern or only +have minimal impact in smaller deployments could become pain points. Recent +releases may address well known issues. The OpenStack community can help +resolve reported issues by applying the collective expertise of the OpenStack +developers. + +The number of organizations running at massive scales is a small proportion of +the OpenStack community, therefore it is important to share related issues +with the community and be a vocal advocate for resolving them. Some issues +only manifest when operating at large scale, and the number of organizations +able to duplicate and validate an issue is small, so it is important to +document and dedicate resources to their resolution. + +In some cases, the resolution to the problem is ultimately to deploy a more +recent version of OpenStack. Alternatively, when you must resolve an issue in +a production environment where rebuilding the entire environment is not an +option, it is sometimes possible to deploy updates to specific underlying +components in order to resolve issues or gain significant performance +improvements. Although this may appear to expose the deployment to increased +risk and instability, in many cases it could be an undiscovered issue. + +We recommend building a development and operations organization that is +responsible for creating desired features, diagnosing and resolving issues, +and building the infrastructure for large scale continuous integration tests +and continuous deployment. This helps catch bugs early and makes deployments +faster and easier. In addition to development resources, we also recommend the +recruitment of experts in the fields of message queues, databases, distributed +systems, networking, cloud, and storage. + +Growth and capacity planning +---------------------------- + +An important consideration in running at massive scale is projecting growth +and utilization trends in order to plan capital expenditures for the short and +long term. Gather utilization meters for compute, network, and storage, along +with historical records of these meters. While securing major anchor tenants +can lead to rapid jumps in the utilization rates of all resources, the steady +adoption of the cloud inside an organization or by consumers in a public +offering also creates a steady trend of increased utilization. + +Skills and training +------------------- + +Projecting growth for storage, networking, and compute is only one aspect of a +growth plan for running OpenStack at massive scale. Growing and nurturing +development and operational staff is an additional consideration. Sending team +members to OpenStack conferences, meetup events, and encouraging active +participation in the mailing lists and committees is a very important way to +maintain skills and forge relationships in the community. For a list of +OpenStack training providers in the marketplace, see: http://www.openstack.org/marketplace/training/.