From bc05b757da48e94712a472177ed2ddd6938e48fb Mon Sep 17 00:00:00 2001 From: daz Date: Tue, 22 Dec 2015 17:06:59 +1100 Subject: [PATCH] [arch-design-draft] Reorganise operators requirements chapter 1. Split sections into separate files 2. Consolidate hardware content from the current guide 3. Consolidate software content from the current guide Change-Id: I5edf204d88ea5622c656fe58699260e31fb9b7d3 Implements: blueprint arch-guide-mitaka-reorg --- .../operator-requirements-bleeding-edge.rst | 26 + ...perator-requirements-capacity-planning.rst | 61 +++ .../operator-requirements-external-idp.rst | 5 + ...erator-requirements-hardware-selection.rst | 449 +++++++++++++++++ .../operator-requirements-licensing.rst | 33 ++ ...erator-requirements-logging-monitoring.rst | 27 + .../operator-requirements-network-design.rst | 59 +++ .../operator-requirements-ops-access.rst | 20 + ...perator-requirements-policy-management.rst | 14 + ...operator-requirements-quota-management.rst | 24 + .../operator-requirements-skills-training.rst | 12 + .../source/operator-requirements-sla.rst | 89 ++++ ...erator-requirements-software-selection.rst | 226 +++++++++ ...rator-requirements-support-maintenance.rst | 17 + .../source/operator-requirements-upgrades.rst | 50 ++ .../source/operator-requirements.rst | 465 +----------------- 16 files changed, 1127 insertions(+), 450 deletions(-) create mode 100644 doc/arch-design-draft/source/operator-requirements-bleeding-edge.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-capacity-planning.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-external-idp.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-hardware-selection.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-licensing.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-logging-monitoring.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-network-design.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-ops-access.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-policy-management.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-quota-management.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-skills-training.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-sla.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-software-selection.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-support-maintenance.rst create mode 100644 doc/arch-design-draft/source/operator-requirements-upgrades.rst diff --git a/doc/arch-design-draft/source/operator-requirements-bleeding-edge.rst b/doc/arch-design-draft/source/operator-requirements-bleeding-edge.rst new file mode 100644 index 0000000000..bd097412a1 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-bleeding-edge.rst @@ -0,0 +1,26 @@ +================= +The bleeding edge +================= + +The number of organizations running at massive scales is a small proportion of +the OpenStack community, therefore it is important to share related issues +with the community and be a vocal advocate for resolving them. Some issues +only manifest when operating at large scale, and the number of organizations +able to duplicate and validate an issue is small, so it is important to +document and dedicate resources to their resolution. + +In some cases, the resolution to the problem is ultimately to deploy a more +recent version of OpenStack. Alternatively, when you must resolve an issue in +a production environment where rebuilding the entire environment is not an +option, it is sometimes possible to deploy updates to specific underlying +components in order to resolve issues or gain significant performance +improvements. Although this may appear to expose the deployment to increased +risk and instability, in many cases it could be an undiscovered issue. + +We recommend building a development and operations organization that is +responsible for creating desired features, diagnosing and resolving issues, +and building the infrastructure for large scale continuous integration tests +and continuous deployment. This helps catch bugs early and makes deployments +faster and easier. In addition to development resources, we also recommend the +recruitment of experts in the fields of message queues, databases, distributed +systems, networking, cloud, and storage. diff --git a/doc/arch-design-draft/source/operator-requirements-capacity-planning.rst b/doc/arch-design-draft/source/operator-requirements-capacity-planning.rst new file mode 100644 index 0000000000..92638dc30f --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-capacity-planning.rst @@ -0,0 +1,61 @@ +================= +Capacity planning +================= + +An important consideration in running a cloud over time is projecting growth +and utilization trends in order to plan capital expenditures for the short and +long term. Gather utilization meters for compute, network, and storage, along +with historical records of these meters. While securing major anchor tenants +can lead to rapid jumps in the utilization rates of all resources, the steady +adoption of the cloud inside an organization or by consumers in a public +offering also creates a steady trend of increased utilization. + +Capacity constraints for a general purpose cloud environment include: + +* Compute limits +* Storage limits + +A relationship exists between the size of the compute environment and +the supporting OpenStack infrastructure controller nodes requiring +support. + +Increasing the size of the supporting compute environment increases the +network traffic and messages, adding load to the controller or +networking nodes. Effective monitoring of the environment will help with +capacity decisions on scaling. + +Compute nodes automatically attach to OpenStack clouds, resulting in a +horizontally scaling process when adding extra compute capacity to an +OpenStack cloud. Additional processes are required to place nodes into +appropriate availability zones and host aggregates. When adding +additional compute nodes to environments, ensure identical or functional +compatible CPUs are used, otherwise live migration features will break. +It is necessary to add rack capacity or network switches as scaling out +compute hosts directly affects network and datacenter resources. + +Compute host components can also be upgraded to account for increases in +demand; this is known as vertical scaling. Upgrading CPUs with more +cores, or increasing the overall server memory, can add extra needed +capacity depending on whether the running applications are more CPU +intensive or memory intensive. + +Another option is to assess the average workloads and increase the +number of instances that can run within the compute environment by +adjusting the overcommit ratio. + +.. note:: + + It is important to remember that changing the CPU overcommit ratio + can have a detrimental effect and cause a potential increase in a + noisy neighbor. + +Insufficient disk capacity could also have a negative effect on overall +performance including CPU and memory usage. Depending on the back-end +architecture of the OpenStack Block Storage layer, capacity includes +adding disk shelves to enterprise storage systems or installing +additional block storage nodes. Upgrading directly attached storage +installed in compute hosts, and adding capacity to the shared storage +for additional ephemeral storage to instances, may be necessary. + +For a deeper discussion on many of these topics, refer to the `OpenStack +Operations Guide `_. diff --git a/doc/arch-design-draft/source/operator-requirements-external-idp.rst b/doc/arch-design-draft/source/operator-requirements-external-idp.rst new file mode 100644 index 0000000000..c1f8bd114f --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-external-idp.rst @@ -0,0 +1,5 @@ +============================= +Integration with external IDP +============================= + +.. TODO diff --git a/doc/arch-design-draft/source/operator-requirements-hardware-selection.rst b/doc/arch-design-draft/source/operator-requirements-hardware-selection.rst new file mode 100644 index 0000000000..f55d411914 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-hardware-selection.rst @@ -0,0 +1,449 @@ +================== +Hardware selection +================== + +Hardware selection involves three key areas: + +* Network + +* Compute + +* Storage + +Network hardware selection +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The network architecture determines which network hardware will be +used. Networking software is determined by the selected networking +hardware. + +There are more subtle design impacts that need to be considered. The +selection of certain networking hardware (and the networking software) +affects the management tools that can be used. There are exceptions to +this; the rise of *open* networking software that supports a range of +networking hardware means there are instances where the relationship +between networking hardware and networking software are not as tightly +defined. + +For a compute-focus architecture, we recommend designing the network +architecture using a scalable network model that makes it easy to add +capacity and bandwidth. A good example of such a model is the leaf-spline +model. In this type of network design, it is possible to easily add additional +bandwidth as well as scale out to additional racks of gear. It is important to +select network hardware that supports the required port count, port speed, and +port density while also allowing for future growth as workload demands +increase. It is also important to evaluate where in the network architecture +it is valuable to provide redundancy. + +Some of the key considerations that should be included in the selection +of networking hardware include: + +Port count + The design will require networking hardware that has the requisite + port count. + +Port density + The network design will be affected by the physical space that is + required to provide the requisite port count. A higher port density + is preferred, as it leaves more rack space for compute or storage + components that may be required by the design. This can also lead + into considerations about fault domains and power density. Higher + density switches are more expensive, therefore it is important not + to over design the network. + +Port speed + The networking hardware must support the proposed network speed, for + example: 1 GbE, 10 GbE, or 40 GbE (or even 100 GbE). + +Redundancy + User requirements for high availability and cost considerations + influence the required level of network hardware redundancy. + Network redundancy can be achieved by adding redundant power + supplies or paired switches. + + .. note:: + + If this is a requirement, the hardware must support this + configuration. User requirements determine if a completely + redundant network infrastructure is required. + +Power requirements + Ensure that the physical data center provides the necessary power + for the selected network hardware. + + .. note:: + + This is not an issue for top of rack (ToR) switches. This may be an issue + for spine switches in a leaf and spine fabric, or end of row (EoR) + switches. + +Protocol support + It is possible to gain more performance out of a single storage + system by using specialized network technologies such as RDMA, SRP, + iSER and SCST. The specifics for using these technologies is beyond + the scope of this book. + +There is no single best practice architecture for the networking +hardware supporting an OpenStack cloud that will apply to all implementations. +Some of the key factors that will have a major influence on selection of +networking hardware include: + +Connectivity + All nodes within an OpenStack cloud require network connectivity. In + some cases, nodes require access to more than one network segment. + The design must encompass sufficient network capacity and bandwidth + to ensure that all communications within the cloud, both north-south + and east-west traffic have sufficient resources available. + +Scalability + The network design should encompass a physical and logical network + design that can be easily expanded upon. Network hardware should + offer the appropriate types of interfaces and speeds that are + required by the hardware nodes. + +Availability + To ensure access to nodes within the cloud is not interrupted, + we recommend that the network architecture identify any single + points of failure and provide some level of redundancy or fault + tolerance. The network infrastructure often involves use of + networking protocols such as LACP, VRRP or others to achieve a highly + available network connection. It is also important to consider the + networking implications on API availability. We recommend a load balancing + solution is designed within the network architecture to ensure that the APIs, + and potentially other services in the cloud are highly available. + +Compute (server) hardware selection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Consider the following factors when selecting compute (server) hardware: + +* Server density + A measure of how many servers can fit into a given measure of + physical space, such as a rack unit [U]. + +* Resource capacity + The number of CPU cores, how much RAM, or how much storage a given + server delivers. + +* Expandability + The number of additional resources you can add to a server before it + reaches capacity. + +* Cost + The relative cost of the hardware weighed against the level of + design effort needed to build the system. + +Weigh these considerations against each other to determine the best +design for the desired purpose. For example, increasing server density +means sacrificing resource capacity or expandability. Increasing resource +capacity and expandability can increase cost but decrease server density. +Decreasing cost often means decreasing supportability, server density, +resource capacity, and expandability. + +Compute capacity (CPU cores and RAM capacity) is a secondary +consideration for selecting server hardware. The required +server hardware must supply adequate CPU sockets, additional CPU cores, +and more RAM; network connectivity and storage capacity are not as +critical. The hardware needs to provide enough network connectivity and +storage capacity to meet the user requirements. + +For a compute-focused cloud, emphasis should be on server +hardware that can offer more CPU sockets, more CPU cores, and more RAM. +Network connectivity and storage capacity are less critical. + +When designing a OpenStack cloud architecture, you must +consider whether you intend to scale up or scale out. Selecting a +smaller number of larger hosts, or a larger number of smaller hosts, +depends on a combination of factors: cost, power, cooling, physical rack +and floor space, support-warranty, and manageability. + +Consider the following in selecting server hardware form factor suited for +your OpenStack design architecture: + +* Most blade servers can support dual-socket multi-core CPUs. To avoid + this CPU limit, select ``full width`` or ``full height`` blades. Be + aware, however, that this also decreases server density. For example, + high density blade servers such as HP BladeSystem or Dell PowerEdge + M1000e support up to 16 servers in only ten rack units. Using + half-height blades is twice as dense as using full-height blades, + which results in only eight servers per ten rack units. + +* 1U rack-mounted servers have the ability to offer greater server density + than a blade server solution, but are often limited to dual-socket, + multi-core CPU configurations. It is possible to place forty 1U servers + in a rack, providing space for the top of rack (ToR) switches, compared + to 32 full width blade servers. + + To obtain greater than dual-socket support in a 1U rack-mount form + factor, customers need to buy their systems from Original Design + Manufacturers (ODMs) or second-tier manufacturers. + + .. warning:: + + This may cause issues for organizations that have preferred + vendor policies or concerns with support and hardware warranties + of non-tier 1 vendors. + +* 2U rack-mounted servers provide quad-socket, multi-core CPU support, + but with a corresponding decrease in server density (half the density + that 1U rack-mounted servers offer). + +* Larger rack-mounted servers, such as 4U servers, often provide even + greater CPU capacity, commonly supporting four or even eight CPU + sockets. These servers have greater expandability, but such servers + have much lower server density and are often more expensive. + +* ``Sled servers`` are rack-mounted servers that support multiple + independent servers in a single 2U or 3U enclosure. These deliver + higher density as compared to typical 1U or 2U rack-mounted servers. + For example, many sled servers offer four independent dual-socket + nodes in 2U for a total of eight CPU sockets in 2U. + +Other factors that influence server hardware selection for an OpenStack +design architecture include: + +Instance density + More hosts are required to support the anticipated scale + if the design architecture uses dual-socket hardware designs. + + For a general purpose OpenStack cloud, sizing is an important consideration. + The expected or anticipated number of instances that each hypervisor can + host is a common meter used in sizing the deployment. The selected server + hardware needs to support the expected or anticipated instance density. + +Host density + Another option to address the higher host count is to use a + quad-socket platform. Taking this approach decreases host density + which also increases rack count. This configuration affects the + number of power connections and also impacts network and cooling + requirements. + + Physical data centers have limited physical space, power, and + cooling. The number of hosts (or hypervisors) that can be fitted + into a given metric (rack, rack unit, or floor tile) is another + important method of sizing. Floor weight is an often overlooked + consideration. The data center floor must be able to support the + weight of the proposed number of hosts within a rack or set of + racks. These factors need to be applied as part of the host density + calculation and server hardware selection. + +Power and cooling density + The power and cooling density requirements might be lower than with + blade, sled, or 1U server designs due to lower host density (by + using 2U, 3U or even 4U server designs). For data centers with older + infrastructure, this might be a desirable feature. + + Data centers have a specified amount of power fed to a given rack or + set of racks. Older data centers may have a power density as power + as low as 20 AMPs per rack, while more recent data centers can be + architected to support power densities as high as 120 AMP per rack. + The selected server hardware must take power density into account. + +Network connectivity + The selected server hardware must have the appropriate number of + network connections, as well as the right type of network + connections, in order to support the proposed architecture. Ensure + that, at a minimum, there are at least two diverse network + connections coming into each rack. + +The selection of form factors or architectures affects the selection of +server hardware. Ensure that the selected server hardware is configured +to support enough storage capacity (or storage expandability) to match +the requirements of selected scale-out storage solution. Similarly, the +network architecture impacts the server hardware selection and vice +versa. + +Hardware for general purpose OpenStack cloud +-------------------------------------------- + +Hardware for a general purpose OpenStack cloud should reflect a cloud +with no pre-defined usage model, designed to run a wide variety of +applications with varying resource usage requirements. These +applications include any of the following: + +* RAM-intensive + +* CPU-intensive + +* Storage-intensive + +Certain hardware form factors may better suit a general purpose +OpenStack cloud due to the requirement for equal (or nearly equal) +balance of resources. Server hardware must provide the following: + +* Equal (or nearly equal) balance of compute capacity (RAM and CPU) + +* Network capacity (number and speed of links) + +* Storage capacity (gigabytes or terabytes as well as Input/Output + Operations Per Second (:term:`IOPS`) + +The best form factor for server hardware supporting a general purpose +OpenStack cloud is driven by outside business and cost factors. No +single reference architecture applies to all implementations; the +decision must flow from user requirements, technical considerations, and +operational considerations. + +Selecting storage hardware +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Storage hardware architecture is determined by selecting specific storage +architecture. Determine the selection of storage architecture by +evaluating possible solutions against the critical factors, the user +requirements, technical considerations, and operational considerations. +Consider the following factors when selecting storage hardware: + +Cost + Storage can be a significant portion of the overall system cost. For + an organization that is concerned with vendor support, a commercial + storage solution is advisable, although it comes with a higher price + tag. If initial capital expenditure requires minimization, designing + a system based on commodity hardware would apply. The trade-off is + potentially higher support costs and a greater risk of + incompatibility and interoperability issues. + +Performance + The latency of storage I/O requests indicates performance. Performance + requirements affect which solution you choose. + +Scalability + Scalability, along with expandability, is a major consideration in a + general purpose OpenStack cloud. It might be difficult to predict + the final intended size of the implementation as there are no + established usage patterns for a general purpose cloud. It might + become necessary to expand the initial deployment in order to + accommodate growth and user demand. + +Expandability + Expandability is a major architecture factor for storage solutions + with general purpose OpenStack cloud. A storage solution that + expands to 50 PB is considered more expandable than a solution that + only scales to 10 PB. This meter is related to scalability, which is + the measure of a solution's performance as it expands. + +General purpose cloud storage requirements +------------------------------------------ +Using a scale-out storage solution with direct-attached storage (DAS) in +the servers is well suited for a general purpose OpenStack cloud. Cloud +services requirements determine your choice of scale-out solution. You +need to determine if a single, highly expandable and highly vertical, +scalable, centralized storage array is suitable for your design. After +determining an approach, select the storage hardware based on this +criteria. + +This list expands upon the potential impacts for including a particular +storage architecture (and corresponding storage hardware) into the +design for a general purpose OpenStack cloud: + +Connectivity + If storage protocols other than Ethernet are part of the storage solution, + ensure the appropriate hardware has been selected. If a centralized storage + array is selected, ensure that the hypervisor will be able to connect to + that storage array for image storage. + +Usage + How the particular storage architecture will be used is critical for + determining the architecture. Some of the configurations that will + influence the architecture include whether it will be used by the + hypervisors for ephemeral instance storage, or if OpenStack Object + Storage will use it for object storage. + +Instance and image locations + Where instances and images will be stored will influence the + architecture. + +Server hardware + If the solution is a scale-out storage architecture that includes + DAS, it will affect the server hardware selection. This could ripple + into the decisions that affect host density, instance density, power + density, OS-hypervisor, management tools and others. + +A general purpose OpenStack cloud has multiple options. The key factors +that will have an influence on selection of storage hardware for a +general purpose OpenStack cloud are as follows: + +Capacity + Hardware resources selected for the resource nodes should be capable + of supporting enough storage for the cloud services. Defining the + initial requirements and ensuring the design can support adding + capacity is important. Hardware nodes selected for object storage + should be capable of support a large number of inexpensive disks + with no reliance on RAID controller cards. Hardware nodes selected + for block storage should be capable of supporting high speed storage + solutions and RAID controller cards to provide performance and + redundancy to storage at a hardware level. Selecting hardware RAID + controllers that automatically repair damaged arrays will assist + with the replacement and repair of degraded or deleted storage + devices. + +Performance + Disks selected for object storage services do not need to be fast + performing disks. We recommend that object storage nodes take + advantage of the best cost per terabyte available for storage. + Contrastingly, disks chosen for block storage services should take + advantage of performance boosting features that may entail the use + of SSDs or flash storage to provide high performance block storage + pools. Storage performance of ephemeral disks used for instances + should also be taken into consideration. + +Fault tolerance + Object storage resource nodes have no requirements for hardware + fault tolerance or RAID controllers. It is not necessary to plan for + fault tolerance within the object storage hardware because the + object storage service provides replication between zones as a + feature of the service. Block storage nodes, compute nodes, and + cloud controllers should all have fault tolerance built in at the + hardware level by making use of hardware RAID controllers and + varying levels of RAID configuration. The level of RAID chosen + should be consistent with the performance and availability + requirements of the cloud. + +Storage-focus cloud storage requirements +---------------------------------------- + +Storage-focused OpenStack clouds must address I/O intensive workloads. +These workloads are not CPU intensive, nor are they consistently network +intensive. The network may be heavily utilized to transfer storage, but +they are not otherwise network intensive. + +The selection of storage hardware determines the overall performance and +scalability of a storage-focused OpenStack design architecture. Several +factors impact the design process, including: + +Latency is a key consideration in a storage-focused OpenStack cloud. +Using solid-state disks (SSDs) to minimize latency and, to reduce CPU +delays caused by waiting for the storage, increases performance. Use +RAID controller cards in compute hosts to improve the performance of the +underlying disk subsystem. + +Depending on the storage architecture, you can adopt a scale-out +solution, or use a highly expandable and scalable centralized storage +array. If a centralized storage array meets your requirements, then the +array vendor determines the hardware selection. It is possible to build +a storage array using commodity hardware with Open Source software, but +requires people with expertise to build such a system. + +On the other hand, a scale-out storage solution that uses +direct-attached storage (DAS) in the servers may be an appropriate +choice. This requires configuration of the server hardware to support +the storage solution. + +Considerations affecting storage architecture (and corresponding storage +hardware) of a Storage-focused OpenStack cloud include: + +Connectivity + Ensure the connectivity matches the storage solution requirements. We + recommended confirming that the network characteristics minimize latency + to boost the overall performance of the design. + +Latency + Determine if the use case has consistent or highly variable latency. + +Throughput + Ensure that the storage solution throughput is optimized for your + application requirements. + +Server hardware + Use of DAS impacts the server hardware choice and affects host + density, instance density, power density, OS-hypervisor, and + management tools. diff --git a/doc/arch-design-draft/source/operator-requirements-licensing.rst b/doc/arch-design-draft/source/operator-requirements-licensing.rst new file mode 100644 index 0000000000..b48f8774cf --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-licensing.rst @@ -0,0 +1,33 @@ +========= +Licensing +========= + +The many different forms of license agreements for software are often written +with the use of dedicated hardware in mind. This model is relevant for the +cloud platform itself, including the hypervisor operating system, supporting +software for items such as database, RPC, backup, and so on. Consideration +must be made when offering Compute service instances and applications to end +users of the cloud, since the license terms for that software may need some +adjustment to be able to operate economically in the cloud. + +Multi-site OpenStack deployments present additional licensing +considerations over and above regular OpenStack clouds, particularly +where site licenses are in use to provide cost efficient access to +software licenses. The licensing for host operating systems, guest +operating systems, OpenStack distributions (if applicable), +software-defined infrastructure including network controllers and +storage systems, and even individual applications need to be evaluated. + +Topics to consider include: + +* The definition of what constitutes a site in the relevant licenses, + as the term does not necessarily denote a geographic or otherwise + physically isolated location. + +* Differentiations between "hot" (active) and "cold" (inactive) sites, + where significant savings may be made in situations where one site is + a cold standby for disaster recovery purposes only. + +* Certain locations might require local vendors to provide support and + services for each site which may vary with the licensing agreement in + place. diff --git a/doc/arch-design-draft/source/operator-requirements-logging-monitoring.rst b/doc/arch-design-draft/source/operator-requirements-logging-monitoring.rst new file mode 100644 index 0000000000..74a5f2ba71 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-logging-monitoring.rst @@ -0,0 +1,27 @@ +====================== +Logging and monitoring +====================== + +OpenStack clouds require appropriate monitoring platforms to catch and +manage errors. + +.. note:: + + We recommend leveraging existing monitoring systems to see if they + are able to effectively monitor an OpenStack environment. + +Specific meters that are critically important to capture include: + +* Image disk utilization + +* Response time to the Compute API + +Logging and monitoring does not significantly differ for a multi-site OpenStack +cloud. The tools described in the `Logging and monitoring chapter +`__ of +the Operations Guide remain applicable. Logging and monitoring can be provided +on a per-site basis, and in a common centralized location. + +When attempting to deploy logging and monitoring facilities to a centralized +location, care must be taken with the load placed on the inter-site networking +links. diff --git a/doc/arch-design-draft/source/operator-requirements-network-design.rst b/doc/arch-design-draft/source/operator-requirements-network-design.rst new file mode 100644 index 0000000000..239d2b7a58 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-network-design.rst @@ -0,0 +1,59 @@ +============== +Network design +============== + +The network design for an OpenStack cluster includes decisions regarding +the interconnect needs within the cluster, plus the need to allow clients to +access their resources, and for operators to access the cluster for +maintenance. The bandwidth, latency, and reliability of these networks needs +consideration. + +Make additional design decisions about monitoring and alarming. This can +be an internal responsibility or the responsibility of the external +provider. In the case of using an external provider, service level +agreements (SLAs) likely apply. In addition, other operational +considerations such as bandwidth, latency, and jitter can be part of an +SLA. + +Consider the ability to upgrade the infrastructure. As demand for +network resources increase, operators add additional IP address blocks +and add additional bandwidth capacity. In addition, consider managing +hardware and software life cycle events, for example upgrades, +decommissioning, and outages, while avoiding service interruptions for +tenants. + +Factor maintainability into the overall network design. This includes +the ability to manage and maintain IP addresses as well as the use of +overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS +tags. As an example, if you may need to change all of the IP addresses +on a network, a process known as renumbering, then the design must +support this function. + +Address network-focused applications when considering certain +operational realities. For example, consider the impending exhaustion of +IPv4 addresses, the migration to IPv6, and the use of private networks +to segregate different types of traffic that an application receives or +generates. In the case of IPv4 to IPv6 migrations, applications should +follow best practices for storing IP addresses. We recommend you avoid +relying on IPv4 features that did not carry over to the IPv6 protocol or +have differences in implementation. + +To segregate traffic, allow applications to create a private tenant +network for database and storage network traffic. Use a public network +for services that require direct client access from the internet. Upon +segregating the traffic, consider quality of service (QoS) and security +to ensure each network has the required level of service. + +Finally, consider the routing of network traffic. For some applications, +develop a complex policy framework for routing. To create a routing +policy that satisfies business requirements, consider the economic cost +of transmitting traffic over expensive links versus cheaper links, in +addition to bandwidth, latency, and jitter requirements. + +Additionally, consider how to respond to network events. As an example, +how load transfers from one link to another during a failure scenario +could be a factor in the design. If you do not plan network capacity +correctly, failover traffic could overwhelm other ports or network links +and create a cascading failure scenario. In this case, traffic that +fails over to one link overwhelms that link and then moves to the +subsequent links until all network traffic stops. diff --git a/doc/arch-design-draft/source/operator-requirements-ops-access.rst b/doc/arch-design-draft/source/operator-requirements-ops-access.rst new file mode 100644 index 0000000000..cf37b49dd4 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-ops-access.rst @@ -0,0 +1,20 @@ +========================== +Operator access to systems +========================== + +As more and more applications are migrated into a Cloud based environment, we +get to a position where systems that are critical for Cloud operations are +hosted within the cloud that is being operated. Consideration must be given to +the ability for operators to be able to access the systems and tools required +in order to resolve a major incident. + +If a significant portion of the cloud is on externally managed systems, +prepare for situations where it may not be possible to make changes. +Additionally, providers may differ on how infrastructure must be managed and +exposed. This can lead to delays in root cause analysis where each insists the +blame lies with the other provider. + +Ensure that the network structure connects all clouds to form an integrated +system, keeping in mind the state of handoffs. These handoffs must both be as +reliable as possible and include as little latency as possible to ensure the +best performance of the overall system. diff --git a/doc/arch-design-draft/source/operator-requirements-policy-management.rst b/doc/arch-design-draft/source/operator-requirements-policy-management.rst new file mode 100644 index 0000000000..f80d38c004 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-policy-management.rst @@ -0,0 +1,14 @@ +================= +Policy management +================= + +OpenStack provides a default set of Role Based Access Control (RBAC) +policies, defined in a ``policy.json`` file, for each service. Operators +edit these files to customize the policies for their OpenStack +installation. If the application of consistent RBAC policies across +sites is a requirement, then it is necessary to ensure proper +synchronization of the ``policy.json`` files to all installations. + +This must be done using system administration tools such as rsync as +functionality for synchronizing policies across regions is not currently +provided within OpenStack. diff --git a/doc/arch-design-draft/source/operator-requirements-quota-management.rst b/doc/arch-design-draft/source/operator-requirements-quota-management.rst new file mode 100644 index 0000000000..859a600c30 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-quota-management.rst @@ -0,0 +1,24 @@ +================ +Quota management +================ + +Quotas are used to set operational limits to prevent system capacities +from being exhausted without notification. They are currently enforced +at the tenant (or project) level rather than at the user level. + +Quotas are defined on a per-region basis. Operators can define identical +quotas for tenants in each region of the cloud to provide a consistent +experience, or even create a process for synchronizing allocated quotas +across regions. It is important to note that only the operational limits +imposed by the quotas will be aligned consumption of quotas by users +will not be reflected between regions. + +For example, given a cloud with two regions, if the operator grants a +user a quota of 25 instances in each region then that user may launch a +total of 50 instances spread across both regions. They may not, however, +launch more than 25 instances in any single region. + +For more information on managing quotas refer to the `Managing projects +and users +chapter `__ +of the OpenStack Operators Guide. diff --git a/doc/arch-design-draft/source/operator-requirements-skills-training.rst b/doc/arch-design-draft/source/operator-requirements-skills-training.rst new file mode 100644 index 0000000000..e92a6d2c8a --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-skills-training.rst @@ -0,0 +1,12 @@ +=================== +Skills and training +=================== + +Projecting growth for storage, networking, and compute is only one aspect of a +growth plan for running OpenStack at massive scale. Growing and nurturing +development and operational staff is an additional consideration. Sending team +members to OpenStack conferences, meetup events, and encouraging active +participation in the mailing lists and committees is a very important way to +maintain skills and forge relationships in the community. For a list of +OpenStack training providers in the marketplace, see: +http://www.openstack.org/marketplace/training/. diff --git a/doc/arch-design-draft/source/operator-requirements-sla.rst b/doc/arch-design-draft/source/operator-requirements-sla.rst new file mode 100644 index 0000000000..771f172f28 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-sla.rst @@ -0,0 +1,89 @@ +================== +SLA considerations +================== + +Service-level agreements (SLAs) are contractual obligations that ensure the +availability of a service. When designing an OpenStack cloud, factoring in +promises of availability implies a certain level of redundancy and resiliency. + +Expectations set by the Service Level Agreements (SLAs) directly affect +knowing when and where you should implement redundancy and high +availability. SLAs are contractual obligations that provide assurances +for service availability. They define the levels of availability that +drive the technical design, often with penalties for not meeting +contractual obligations. + +SLA terms that affect design include: + +* API availability guarantees implying multiple infrastructure services + and highly available load balancers. + +* Network uptime guarantees affecting switch design, which might + require redundant switching and power. + +* Factor in networking security policy requirements in to your + deployments. + +In any environment larger than just a few hosts, it is important to note that +there are two separate areas that might be subject to an SLA. Firstly, the +services that provide actual virtualization, networking and storage, are +subject to an SLA that customers of the environment are most likely to want to +be continuously available. This is often referred to as the 'Data Plane'. + +Secondly, there are the ancillary services such as API endpoints, and the +various services that control CRUD operations. These are often referred to as +the 'Control Plane'. The services in this category are usually subject to a +different SLA expectation and therefore may be better suited on separate +hardware or at least containers from the Data Plane services. + +To effectively run cloud installations, initial downtime planning +includes creating processes and architectures that support the +following: + +* Planned (maintenance) +* Unplanned (system faults) + +It is important to determine as part of the SLA negotiation which party is +responsible for monitoring and starting up Compute service Instances should an +outage occur which shuts them down. + +Resiliency of overall system and individual components are going to be +dictated by the requirements of the SLA, meaning designing for +:term:`high availability (HA)` can have cost ramifications. + +When upgrading, patching and changing configuration items this may require +downtime for some services. In these cases, stopping services that form the +Control Plane may leave the Data Plane unaffected, while actions such as +live-migration of Compute instances may be required in order to perform any +actions that require downtime to Data Plane components whilst still meeting SLA +expectations. + +Note that there are many services that are outside the realms of pure OpenStack +code which affects the ability of any given design to meet SLA, including: + +* Database services, such as ``MySQL`` or ``PostgreSQL``. +* Services providing RPC, such as ``RabbitMQ``. +* External network attachments. +* Physical constraints such as power, rack space, network cabling, etc. +* Shared storage including SAN based arrays, storage clusters such as ``Ceph``, + and/or NFS services. + +Depending on the design, some Network service functions may fall into both the +Control and Data Plane categories. E.g. the neutron L3 Agent service may be +considered a Control Plane component, but the routers themselves would be Data +Plane. + +It may be that a given set of requirements could dictate an SLA that suggests +some services need HA, and some do not. + +In a design with multiple regions, the SLA would also need to take into +consideration the use of shared services such as the Identity service, +Dashboard, and so on. + +Any SLA negotiation must also take into account the reliance on 3rd parties for +critical aspects of the design - for example, if there is an existing SLA on a +component such as a storage system, the cloud SLA must take into account this +limitation. If the required SLA for the cloud exceeds the agreed uptime levels +of the components comprising that cloud, additional redundancy would be +required. This consideration is critical to review in a hybrid cloud design, +where there are multiple 3rd parties involved. diff --git a/doc/arch-design-draft/source/operator-requirements-software-selection.rst b/doc/arch-design-draft/source/operator-requirements-software-selection.rst new file mode 100644 index 0000000000..20b01b9c7b --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-software-selection.rst @@ -0,0 +1,226 @@ +================== +Software selection +================== + +Software selection, particularly for a general purpose OpenStack architecture +design involves three areas: + +* Operating system (OS) and hypervisor + +* OpenStack components + +* Supplemental software + +Operating system and hypervisor +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The operating system (OS) and hypervisor have a significant impact on +the overall design. Selecting a particular operating system and +hypervisor can directly affect server hardware selection. Make sure the +storage hardware and topology support the selected operating system and +hypervisor combination. Also ensure the networking hardware selection +and topology will work with the chosen operating system and hypervisor +combination. + +Some areas that could be impacted by the selection of OS and hypervisor +include: + +Cost + Selecting a commercially supported hypervisor, such as Microsoft + Hyper-V, will result in a different cost model rather than + community-supported open source hypervisors including + :term:`KVM`, Kinstance or :term:`Xen`. When + comparing open source OS solutions, choosing Ubuntu over Red Hat + (or vice versa) will have an impact on cost due to support + contracts. + +Support + Depending on the selected hypervisor, staff should have the + appropriate training and knowledge to support the selected OS and + hypervisor combination. If they do not, training will need to be + provided which could have a cost impact on the design. + +Management tools + The management tools used for Ubuntu and Kinstance differ from the + management tools for VMware vSphere. Although both OS and hypervisor + combinations are supported by OpenStack, there will be very + different impacts to the rest of the design as a result of the + selection of one combination versus the other. + +Scale and performance + Ensure that selected OS and hypervisor combinations meet the + appropriate scale and performance requirements. The chosen + architecture will need to meet the targeted instance-host ratios + with the selected OS-hypervisor combinations. + +Security + Ensure that the design can accommodate regular periodic + installations of application security patches while maintaining + required workloads. The frequency of security patches for the + proposed OS-hypervisor combination will have an impact on + performance and the patch installation process could affect + maintenance windows. + +Supported features + Determine which OpenStack features are required. This will often + determine the selection of the OS-hypervisor combination. Some + features are only available with specific operating systems or + hypervisors. + +Interoperability + You will need to consider how the OS and hypervisor combination + interactions with other operating systems and hypervisors, including + other software solutions. Operational troubleshooting tools for one + OS-hypervisor combination may differ from the tools used for another + OS-hypervisor combination and, as a result, the design will need to + address if the two sets of tools need to interoperate. + +OpenStack components +~~~~~~~~~~~~~~~~~~~~ + +Selecting which OpenStack components are included in the overall design +is important. Some OpenStack components, like compute and Image service, +are required in every architecture. Other components, like +Orchestration, are not always required. + +A compute-focused OpenStack design architecture may contain the following +components: + +* Identity (keystone) + +* Dashboard (horizon) + +* Compute (nova) + +* Object Storage (swift) + +* Image (glance) + +* Networking (neutron) + +* Orchestration (heat) + + .. note:: + + A compute-focused design is less likely to include OpenStack Block + Storage. However, there may be some situations where the need for + performance requires a block storage component to improve data I-O. + +Excluding certain OpenStack components can limit or constrain the +functionality of other components. For example, if the architecture +includes Orchestration but excludes Telemetry, then the design will not +be able to take advantage of Orchestrations' auto scaling functionality. +It is important to research the component interdependencies in +conjunction with the technical requirements before deciding on the final +architecture. + +Networking software +~~~~~~~~~~~~~~~~~~~ + +OpenStack Networking (neutron) provides a wide variety of networking +services for instances. There are many additional networking software +packages that can be useful when managing OpenStack components. Some +examples include: + +* Software to provide load balancing + +* Network redundancy protocols + +* Routing daemons + +Some of these software packages are described in more detail in the +`OpenStack network nodes chapter `_ in the OpenStack High Availability Guide. + +For a general purpose OpenStack cloud, the OpenStack infrastructure +components need to be highly available. If the design does not include +hardware load balancing, networking software packages like HAProxy will +need to be included. + +For a compute-focused OpenStack cloud, the OpenStack infrastructure +components must be highly available. If the design does not include +hardware load balancing, you must add networking software packages, for +example, HAProxy. + +Management software +~~~~~~~~~~~~~~~~~~~ + +Management software includes software for providing: + +* Clustering + +* Logging + +* Monitoring + +* Alerting + +.. important:: + + The factors for determining which software packages in this category + to select is outside the scope of this design guide. + +The selected supplemental software solution impacts and affects the overall +OpenStack cloud design. This includes software for providing clustering, +logging, monitoring and alerting. + +The inclusion of clustering software, such as Corosync or Pacemaker, is +primarily determined by the availability of the cloud infrastructure and +the complexity of supporting the configuration after it is deployed. The +`OpenStack High Availability Guide `_ +provides more details on the installation and configuration of Corosync +and Pacemaker, should these packages need to be included in the design. + +Operational considerations determine the requirements for logging, +monitoring, and alerting. Each of these sub-categories include various +options. + +For example, in the logging sub-category you could select Logstash, +Splunk, Log Insight, or another log aggregation-consolidation tool. +Store logs in a centralized location to facilitate performing analytics +against the data. Log data analytics engines can also provide automation +and issue notification, by providing a mechanism to both alert and +automatically attempt to remediate some of the more commonly known +issues. + +If these software packages are required, the design must account for the +additional resource consumption (CPU, RAM, storage, and network +bandwidth). Some other potential design impacts include: + +* OS-hypervisor combination + Ensure that the selected logging, monitoring, or alerting tools support + the proposed OS-hypervisor combination. + +* Network hardware + The network hardware selection needs to be supported by the logging, + monitoring, and alerting software. + +Database software +~~~~~~~~~~~~~~~~~ + +Most OpenStack components require access to back-end database services +to store state and configuration information. Choose an appropriate +back-end database which satisfies the availability and fault tolerance +requirements of the OpenStack services. + +MySQL is the default database for OpenStack, but other compatible +databases are available. + +.. note:: + + Telemetry uses MongoDB. + +The chosen high availability database solution changes according to the +selected database. MySQL, for example, provides several options. Use a +replication technology such as Galera for active-active clustering. For +active-passive use some form of shared storage. Each of these potential +solutions has an impact on the design: + +* Solutions that employ Galera/MariaDB require at least three MySQL + nodes. + +* MongoDB has its own design considerations for high availability. + +* OpenStack design, generally, does not include shared storage. + However, for some high availability designs, certain components might + require it depending on the specific implementation. diff --git a/doc/arch-design-draft/source/operator-requirements-support-maintenance.rst b/doc/arch-design-draft/source/operator-requirements-support-maintenance.rst new file mode 100644 index 0000000000..5f6101ce34 --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-support-maintenance.rst @@ -0,0 +1,17 @@ +======================= +Support and maintenance +======================= + +To be able to support and maintain an installation, OpenStack cloud +management requires operations staff to understand and comprehend design +architecture content. The operations and engineering staff skill level, +and level of separation, are dependent on size and purpose of the +installation. Large cloud service providers, or telecom providers, are +more likely to be managed by specially trained, dedicated operations +organizations. Smaller implementations are more likely to rely on +support staff that need to take on combined engineering, design and +operations functions. + +Maintaining OpenStack installations requires a variety of technical +skills. You may want to consider using a third-party management company +with special expertise in managing OpenStack deployment. diff --git a/doc/arch-design-draft/source/operator-requirements-upgrades.rst b/doc/arch-design-draft/source/operator-requirements-upgrades.rst new file mode 100644 index 0000000000..72307cec7d --- /dev/null +++ b/doc/arch-design-draft/source/operator-requirements-upgrades.rst @@ -0,0 +1,50 @@ +======== +Upgrades +======== + +Running OpenStack with a focus on availability requires striking a balance +between stability and features. For example, it might be tempting to run an +older stable release branch of OpenStack to make deployments easier. However +known issues that may be of some concern or only have minimal impact in smaller +deployments could become pain points as scale increases. Recent releases may +address well known issues. The OpenStack community can help resolve reported +issues by applying the collective expertise of the OpenStack developers. + +In multi-site OpenStack clouds deployed using regions, sites are +independent OpenStack installations which are linked together using +shared centralized services such as OpenStack Identity. At a high level +the recommended order of operations to upgrade an individual OpenStack +environment is (see the `Upgrades +chapter `_ +of the Operations Guide for details): + +#. Upgrade the OpenStack Identity service (keystone). + +#. Upgrade the OpenStack Image service (glance). + +#. Upgrade OpenStack Compute (nova), including networking components. + +#. Upgrade OpenStack Block Storage (cinder). + +#. Upgrade the OpenStack dashboard (horizon). + +The process for upgrading a multi-site environment is not significantly +different: + +#. Upgrade the shared OpenStack Identity service (keystone) deployment. + +#. Upgrade the OpenStack Image service (glance) at each site. + +#. Upgrade OpenStack Compute (nova), including networking components, at + each site. + +#. Upgrade OpenStack Block Storage (cinder) at each site. + +#. Upgrade the OpenStack dashboard (horizon), at each site or in the + single central location if it is shared. + +Compute upgrades within each site can also be performed in a rolling +fashion. Compute controller services (API, Scheduler, and Conductor) can +be upgraded prior to upgrading of individual compute nodes. This allows +operations staff to keep a site operational for users of Compute +services while performing an upgrade. diff --git a/doc/arch-design-draft/source/operator-requirements.rst b/doc/arch-design-draft/source/operator-requirements.rst index 2a02e08a67..27b8f92c1a 100644 --- a/doc/arch-design-draft/source/operator-requirements.rst +++ b/doc/arch-design-draft/source/operator-requirements.rst @@ -5,9 +5,21 @@ Operator requirements .. toctree:: :maxdepth: 2 - -Introduction -~~~~~~~~~~~~ + operator-requirements-sla.rst + operator-requirements-logging-monitoring.rst + operator-requirements-network-design.rst + operator-requirements-licensing.rst + operator-requirements-support-maintenance.rst + operator-requirements-ops-access.rst + operator-requirements-capacity-planning.rst + operator-requirements-quota-management.rst + operator-requirements-policy-management.rst + operator-requirements-hardware-selection.rst + operator-requirements-software-selection.rst + operator-requirements-external-idp.rst + operator-requirements-upgrades.rst + operator-requirements-bleeding-edge.rst + operator-requirements-skills-training.rst Several operational factors affect the design choices for a general purpose cloud. Operations staff receive tasks regarding the maintenance @@ -51,450 +63,3 @@ teams can work more efficiently because fewer resources are required for these common tasks. Administrators are then free to tackle tasks that are not easy to automate and that have longer-term impacts on the business, for example, capacity planning. - - -SLA Considerations -~~~~~~~~~~~~~~~~~~ - -Service-level agreements (SLAs) are contractual obligations that ensure the -availability of a service. When designing an OpenStack cloud, factoring in -promises of availability implies a certain level of redundancy and resiliency. - -Expectations set by the Service Level Agreements (SLAs) directly affect -knowing when and where you should implement redundancy and high -availability. SLAs are contractual obligations that provide assurances -for service availability. They define the levels of availability that -drive the technical design, often with penalties for not meeting -contractual obligations. - -SLA terms that affect design include: - -* API availability guarantees implying multiple infrastructure services - and highly available load balancers. - -* Network uptime guarantees affecting switch design, which might - require redundant switching and power. - -* Factor in networking security policy requirements in to your - deployments. - -In any environment larger than just a few hosts, it is important to note that -there are two separate areas that might be subject to an SLA. Firstly, the -services that provide actual virtualization, networking and storage, are -subject to an SLA that customers of the environment are most likely to want to -be continuously available. This is often referred to as the 'Data Plane'. - -Secondly, there are the ancillary services such as API endpoints, and the -various services that control CRUD operations. These are often referred to as -the 'Control Plane'. The services in this category are usually subject to a -different SLA expectation and therefore may be better suited on separate -hardware or at least containers from the Data Plane services. - -To effectively run cloud installations, initial downtime planning -includes creating processes and architectures that support the -following: - -* Planned (maintenance) -* Unplanned (system faults) - -It is important to determine as part of the SLA negotiation which party is -responsible for monitoring and starting up Compute service Instances should an -outage occur which shuts them down. - -Resiliency of overall system and individual components are going to be -dictated by the requirements of the SLA, meaning designing for -:term:`high availability (HA)` can have cost ramifications. - -When upgrading, patching and changing configuration items this may require -downtime for some services. In these cases, stopping services that form the -Control Plane may leave the Data Plane unaffected, while actions such as -live-migration of Compute instances may be required in order to perform any -actions that require downtime to Data Plane components whilst still meeting SLA -expectations. - -Note that there are many services that are outside the realms of pure OpenStack -code which affects the ability of any given design to meet SLA, including: - -* Database services, such as ``MySQL`` or ``PostgreSQL``. -* Services providing RPC, such as ``RabbitMQ``. -* External network attachments. -* Physical constraints such as power, rack space, network cabling, etc. -* Shared storage including SAN based arrays, storage clusters such as ``Ceph``, - and/or NFS services. - -Depending on the design, some Network service functions may fall into both the -Control and Data Plane categories. E.g. the neutron L3 Agent service may be -considered a Control Plane component, but the routers themselves would be Data -Plane. - -It may be that a given set of requirements could dictate an SLA that suggests -some services need HA, and some do not. - -In a design with multiple regions, the SLA would also need to take into -consideration the use of shared services such as the Identity service, -Dashboard, and so on. - -Any SLA negotiation must also take into account the reliance on 3rd parties for -critical aspects of the design - for example, if there is an existing SLA on a -component such as a storage system, the cloud SLA must take into account this -limitation. If the required SLA for the cloud exceeds the agreed uptime levels -of the components comprising that cloud, additional redundancy would be -required. This consideration is critical to review in a hybrid cloud design, -where there are multiple 3rd parties involved. - -Logging and Monitoring -~~~~~~~~~~~~~~~~~~~~~~ - -OpenStack clouds require appropriate monitoring platforms to catch and -manage errors. - -.. note:: - - We recommend leveraging existing monitoring systems to see if they - are able to effectively monitor an OpenStack environment. - -Specific meters that are critically important to capture include: - -* Image disk utilization - -* Response time to the Compute API - -Logging and monitoring does not significantly differ for a multi-site OpenStack -cloud. The tools described in the `Logging and monitoring chapter -`__ of -the Operations Guide remain applicable. Logging and monitoring can be provided -on a per-site basis, and in a common centralized location. - -When attempting to deploy logging and monitoring facilities to a centralized -location, care must be taken with the load placed on the inter-site networking -links. - - - -Network -~~~~~~~ - -The network design for an OpenStack cluster includes decisions regarding -the interconnect needs within the cluster, plus the need to allow clients to -access their resources, and for operators to access the cluster for -maintenance. The bandwidth, latency, and reliability of these networks needs -consideration. - -Make additional design decisions about monitoring and alarming. This can -be an internal responsibility or the responsibility of the external -provider. In the case of using an external provider, service level -agreements (SLAs) likely apply. In addition, other operational -considerations such as bandwidth, latency, and jitter can be part of an -SLA. - -Consider the ability to upgrade the infrastructure. As demand for -network resources increase, operators add additional IP address blocks -and add additional bandwidth capacity. In addition, consider managing -hardware and software life cycle events, for example upgrades, -decommissioning, and outages, while avoiding service interruptions for -tenants. - -Factor maintainability into the overall network design. This includes -the ability to manage and maintain IP addresses as well as the use of -overlay identifiers including VLAN tag IDs, GRE tunnel IDs, and MPLS -tags. As an example, if you may need to change all of the IP addresses -on a network, a process known as renumbering, then the design must -support this function. - -Address network-focused applications when considering certain -operational realities. For example, consider the impending exhaustion of -IPv4 addresses, the migration to IPv6, and the use of private networks -to segregate different types of traffic that an application receives or -generates. In the case of IPv4 to IPv6 migrations, applications should -follow best practices for storing IP addresses. We recommend you avoid -relying on IPv4 features that did not carry over to the IPv6 protocol or -have differences in implementation. - -To segregate traffic, allow applications to create a private tenant -network for database and storage network traffic. Use a public network -for services that require direct client access from the internet. Upon -segregating the traffic, consider quality of service (QoS) and security -to ensure each network has the required level of service. - -Finally, consider the routing of network traffic. For some applications, -develop a complex policy framework for routing. To create a routing -policy that satisfies business requirements, consider the economic cost -of transmitting traffic over expensive links versus cheaper links, in -addition to bandwidth, latency, and jitter requirements. - -Additionally, consider how to respond to network events. As an example, -how load transfers from one link to another during a failure scenario -could be a factor in the design. If you do not plan network capacity -correctly, failover traffic could overwhelm other ports or network links -and create a cascading failure scenario. In this case, traffic that -fails over to one link overwhelms that link and then moves to the -subsequent links until all network traffic stops. - - -Licensing -~~~~~~~~~ - -The many different forms of license agreements for software are often written -with the use of dedicated hardware in mind. This model is relevant for the -cloud platform itself, including the hypervisor operating system, supporting -software for items such as database, RPC, backup, and so on. Consideration -must be made when offering Compute service instances and applications to end -users of the cloud, since the license terms for that software may need some -adjustment to be able to operate economically in the cloud. - -Multi-site OpenStack deployments present additional licensing -considerations over and above regular OpenStack clouds, particularly -where site licenses are in use to provide cost efficient access to -software licenses. The licensing for host operating systems, guest -operating systems, OpenStack distributions (if applicable), -software-defined infrastructure including network controllers and -storage systems, and even individual applications need to be evaluated. - -Topics to consider include: - -* The definition of what constitutes a site in the relevant licenses, - as the term does not necessarily denote a geographic or otherwise - physically isolated location. - -* Differentiations between "hot" (active) and "cold" (inactive) sites, - where significant savings may be made in situations where one site is - a cold standby for disaster recovery purposes only. - -* Certain locations might require local vendors to provide support and - services for each site which may vary with the licensing agreement in - place. - -Support and maintainability -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -To be able to support and maintain an installation, OpenStack cloud -management requires operations staff to understand and comprehend design -architecture content. The operations and engineering staff skill level, -and level of separation, are dependent on size and purpose of the -installation. Large cloud service providers, or telecom providers, are -more likely to be managed by specially trained, dedicated operations -organizations. Smaller implementations are more likely to rely on -support staff that need to take on combined engineering, design and -operations functions. - -Maintaining OpenStack installations requires a variety of technical -skills. You may want to consider using a third-party management company -with special expertise in managing OpenStack deployment. - -Operator access to systems -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -As more and more applications are migrated into a Cloud based environment, we -get to a position where systems that are critical for Cloud operations are -hosted within the cloud that is being operated. Consideration must be given to -the ability for operators to be able to access the systems and tools required -in order to resolve a major incident. - -If a significant portion of the cloud is on externally managed systems, -prepare for situations where it may not be possible to make changes. -Additionally, providers may differ on how infrastructure must be managed and -exposed. This can lead to delays in root cause analysis where each insists the -blame lies with the other provider. - -Ensure that the network structure connects all clouds to form an integrated -system, keeping in mind the state of handoffs. These handoffs must both be as -reliable as possible and include as little latency as possible to ensure the -best performance of the overall system. - -Capacity planning -~~~~~~~~~~~~~~~~~ - -An important consideration in running a cloud over time is projecting growth -and utilization trends in order to plan capital expenditures for the short and -long term. Gather utilization meters for compute, network, and storage, along -with historical records of these meters. While securing major anchor tenants -can lead to rapid jumps in the utilization rates of all resources, the steady -adoption of the cloud inside an organization or by consumers in a public -offering also creates a steady trend of increased utilization. - -Capacity constraints for a general purpose cloud environment include: - -* Compute limits -* Storage limits - -A relationship exists between the size of the compute environment and -the supporting OpenStack infrastructure controller nodes requiring -support. - -Increasing the size of the supporting compute environment increases the -network traffic and messages, adding load to the controller or -networking nodes. Effective monitoring of the environment will help with -capacity decisions on scaling. - -Compute nodes automatically attach to OpenStack clouds, resulting in a -horizontally scaling process when adding extra compute capacity to an -OpenStack cloud. Additional processes are required to place nodes into -appropriate availability zones and host aggregates. When adding -additional compute nodes to environments, ensure identical or functional -compatible CPUs are used, otherwise live migration features will break. -It is necessary to add rack capacity or network switches as scaling out -compute hosts directly affects network and datacenter resources. - -Compute host components can also be upgraded to account for increases in -demand; this is known as vertical scaling. Upgrading CPUs with more -cores, or increasing the overall server memory, can add extra needed -capacity depending on whether the running applications are more CPU -intensive or memory intensive. - -Another option is to assess the average workloads and increase the -number of instances that can run within the compute environment by -adjusting the overcommit ratio. - -.. note:: - - It is important to remember that changing the CPU overcommit ratio - can have a detrimental effect and cause a potential increase in a - noisy neighbor. - -Insufficient disk capacity could also have a negative effect on overall -performance including CPU and memory usage. Depending on the back-end -architecture of the OpenStack Block Storage layer, capacity includes -adding disk shelves to enterprise storage systems or installing -additional block storage nodes. Upgrading directly attached storage -installed in compute hosts, and adding capacity to the shared storage -for additional ephemeral storage to instances, may be necessary. - -For a deeper discussion on many of these topics, refer to the `OpenStack -Operations Guide `_. - -Quota management -~~~~~~~~~~~~~~~~ - -Quotas are used to set operational limits to prevent system capacities -from being exhausted without notification. They are currently enforced -at the tenant (or project) level rather than at the user level. - -Quotas are defined on a per-region basis. Operators can define identical -quotas for tenants in each region of the cloud to provide a consistent -experience, or even create a process for synchronizing allocated quotas -across regions. It is important to note that only the operational limits -imposed by the quotas will be aligned consumption of quotas by users -will not be reflected between regions. - -For example, given a cloud with two regions, if the operator grants a -user a quota of 25 instances in each region then that user may launch a -total of 50 instances spread across both regions. They may not, however, -launch more than 25 instances in any single region. - -For more information on managing quotas refer to the `Managing projects -and users -chapter `__ -of the OpenStack Operators Guide. - -Policy management -~~~~~~~~~~~~~~~~~ - -OpenStack provides a default set of Role Based Access Control (RBAC) -policies, defined in a ``policy.json`` file, for each service. Operators -edit these files to customize the policies for their OpenStack -installation. If the application of consistent RBAC policies across -sites is a requirement, then it is necessary to ensure proper -synchronization of the ``policy.json`` files to all installations. - -This must be done using system administration tools such as rsync as -functionality for synchronizing policies across regions is not currently -provided within OpenStack. - -Selecting Hardware -~~~~~~~~~~~~~~~~~~ - - -Integration with external IDP -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - - -Upgrades -~~~~~~~~ - -Running OpenStack with a focus on availability requires striking a balance -between stability and features. For example, it might be tempting to run an -older stable release branch of OpenStack to make deployments easier. However -known issues that may be of some concern or only have minimal impact in smaller -deployments could become pain points as scale increases. Recent releases may -address well known issues. The OpenStack community can help resolve reported -issues by applying the collective expertise of the OpenStack developers. - -In multi-site OpenStack clouds deployed using regions, sites are -independent OpenStack installations which are linked together using -shared centralized services such as OpenStack Identity. At a high level -the recommended order of operations to upgrade an individual OpenStack -environment is (see the `Upgrades -chapter `__ -of the Operations Guide for details): - -#. Upgrade the OpenStack Identity service (keystone). - -#. Upgrade the OpenStack Image service (glance). - -#. Upgrade OpenStack Compute (nova), including networking components. - -#. Upgrade OpenStack Block Storage (cinder). - -#. Upgrade the OpenStack dashboard (horizon). - -The process for upgrading a multi-site environment is not significantly -different: - -#. Upgrade the shared OpenStack Identity service (keystone) deployment. - -#. Upgrade the OpenStack Image service (glance) at each site. - -#. Upgrade OpenStack Compute (nova), including networking components, at - each site. - -#. Upgrade OpenStack Block Storage (cinder) at each site. - -#. Upgrade the OpenStack dashboard (horizon), at each site or in the - single central location if it is shared. - -Compute upgrades within each site can also be performed in a rolling -fashion. Compute controller services (API, Scheduler, and Conductor) can -be upgraded prior to upgrading of individual compute nodes. This allows -operations staff to keep a site operational for users of Compute -services while performing an upgrade. - - -The bleeding edge ------------------ - -The number of organizations running at massive scales is a small proportion of -the OpenStack community, therefore it is important to share related issues -with the community and be a vocal advocate for resolving them. Some issues -only manifest when operating at large scale, and the number of organizations -able to duplicate and validate an issue is small, so it is important to -document and dedicate resources to their resolution. - -In some cases, the resolution to the problem is ultimately to deploy a more -recent version of OpenStack. Alternatively, when you must resolve an issue in -a production environment where rebuilding the entire environment is not an -option, it is sometimes possible to deploy updates to specific underlying -components in order to resolve issues or gain significant performance -improvements. Although this may appear to expose the deployment to increased -risk and instability, in many cases it could be an undiscovered issue. - -We recommend building a development and operations organization that is -responsible for creating desired features, diagnosing and resolving issues, -and building the infrastructure for large scale continuous integration tests -and continuous deployment. This helps catch bugs early and makes deployments -faster and easier. In addition to development resources, we also recommend the -recruitment of experts in the fields of message queues, databases, distributed -systems, networking, cloud, and storage. - - -Skills and training -~~~~~~~~~~~~~~~~~~~ - -Projecting growth for storage, networking, and compute is only one aspect of a -growth plan for running OpenStack at massive scale. Growing and nurturing -development and operational staff is an additional consideration. Sending team -members to OpenStack conferences, meetup events, and encouraging active -participation in the mailing lists and committees is a very important way to -maintain skills and forge relationships in the community. For a list of -OpenStack training providers in the marketplace, see: -http://www.openstack.org/marketplace/training/. -