openstack-manuals/doc/arch-design/generalpurpose/section_operational_considerations_general_purpose.xml

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="operational-considerations-general-purpose">
    <?dbhtml stop-chunking?>
    <title>Operational considerations</title>
    <para>Many operational factors will affect general purpose cloud
        design choices. In larger installations, it is not uncommon
        for operations staff to be tasked with maintaining cloud
        environments. This differs from the operations staff that is
        responsible for building or designing the infrastructure. It
        is important to include the operations function in the
        planning and design phases of the build out.</para>
    <para>Service Level Agreements (SLAs) are contractual obligations
        that provide assurances for service availability. SLAs define
        levels of availability that drive the technical design, often
        with penalties for not meeting the contractual obligations.
        The strictness of the SLA dictates the level of redundancy and
        resiliency in the OpenStack cloud design. Knowing when and
        where to implement redundancy and high availability is
        directly affected by
        expectations set by the terms of the SLA. Some of the SLA
        terms that will affect the design include:</para>
    <itemizedlist>
        <listitem>
            <para>Guarantees for API availability imply multiple
                infrastructure services combined with highly available
                load balancers.</para>
        </listitem>
        <listitem>
            <para>Network uptime guarantees will affect the switch
                design and might require redundant switching and
                power.</para>
        </listitem>
        <listitem>
            <para>Network security policies requirements need to be
                factored in to deployments.</para>
        </listitem>
    </itemizedlist>
    <section xml:id="support-and-maintainability-general-purpose">
      <title>Support and maintainability</title>
    <para>OpenStack cloud management requires operations staff to be
        able to understand and comprehend design architecture content
        on some level. The level of skills and the level of separation
        of the operations and engineering staff are dependent on the
        size and purpose of the installation. A large cloud service
        provider or a telecom provider is more likely to be managed by
        a specially trained, dedicated operations organization. A
        smaller implementation is more likely to rely on a smaller
        support staff that might need to take on the combined
        engineering, design and operations functions.</para>
    <para>Furthermore, maintaining OpenStack installations requires a
        variety of technical skills. Some of these skills may include
        the ability to debug Python log output to a basic level and an
        understanding of networking concepts.</para>
    <para>Consider incorporating features into the architecture and
        design that reduce the operations burden. This is accomplished
        by automating some of the operations functions. In some cases
        it may be beneficial to use a third party management company
        with special expertise in managing OpenStack
        deployments.</para></section>
    <section xml:id="monitoring-general-purpose">
      <title>Monitoring</title>
    <para>Like any other infrastructure deployment, OpenStack clouds
        need an appropriate monitoring platform to ensure any errors
        are caught and managed appropriately. Consider leveraging any
        existing monitoring system to see if it will be able to
        effectively monitor an OpenStack environment. While there are
        many aspects that need to be monitored, specific metrics that
        are critically important to capture include image disk
        utilization, or response time to the Compute API.</para></section>
    <section xml:id="downtime-general-purpose">
      <title>Downtime</title>
    <para>No matter how robust the architecture is, at some point
        components will fail. Designing for high availability (HA) can
        have significant cost ramifications, therefore the resiliency
        of the overall system and the individual components is going
        to be dictated by the requirements of the SLA. Downtime
        planning includes creating processes and architectures that
        support planned (maintenance) and unplanned (system faults)
        downtime.</para>
    <para>An example of an operational consideration is the recovery
        of a failed compute host. This might mean requiring the
        restoration of instances from a snapshot or respawning an
        instance on another available compute host. This could have
        consequences on the overall application design. A general
        purpose cloud should not need to provide an ability to migrate
        instances from one host to another. If the expectation is that
        the application will be designed to tolerate failure,
        additional considerations need to be made around supporting
        instance migration. In this scenario, extra supporting
        services, including shared storage attached to compute hosts,
        might need to be deployed.</para></section>
    <section xml:id="capacity-planning">
      <title>Capacity planning</title>
    <para>Capacity planning for future growth is a critically
        important and often overlooked consideration. Capacity
        constraints in a general purpose cloud environment include
        compute and storage limits. There is a relationship between
        the size of the compute environment and the supporting
        OpenStack infrastructure controller nodes required to support
        it. As the size of the supporting compute environment
        increases, the network traffic and messages will increase
        which will add load to the controller or networking nodes.
        While no hard and fast rule exists, effective monitoring of
        the environment will help with capacity decisions on when to
        scale the back-end infrastructure as part of the scaling of
        the compute resources.</para>
    <para>Adding extra compute capacity to an OpenStack cloud is a
        horizontally scaling process as consistently configured
        compute nodes automatically attach to an OpenStack cloud. Be
        mindful of any additional work that is needed to place the
        nodes into appropriate availability zones and host aggregates.
        Make sure to use identical or functionally compatible CPUs
        when adding additional compute nodes to the environment
        otherwise live migration features will break. Scaling out
        compute hosts will directly affect network and other
        datacenter resources so it will be necessary to add rack
        capacity or network switches.</para>
    <para>Another option is to assess the average workloads and
        increase the number of instances that can run within the
        compute environment by adjusting the overcommit ratio. While
        only appropriate in some environments, it's important to
        remember that changing the CPU overcommit ratio can have a
        detrimental effect and cause a potential increase in noisy
        neighbor. The added risk of increasing the overcommit ratio is
        more instances will fail when a compute host fails.</para>
    <para>Compute host components can also be upgraded to account for
        increases in demand; this is known as vertical scaling.
        Upgrading CPUs with more cores, or increasing the overall
        server memory, can add extra needed capacity depending on
        whether the running applications are more CPU intensive or
        memory intensive.</para>
    <para>Insufficient disk capacity could also have a negative effect
        on overall performance including CPU and memory usage.
        Depending on the back-end architecture of the OpenStack Block
        Storage layer, capacity might include adding disk shelves to
        enterprise storage systems or installing additional block
        storage nodes. It may also be necessary to upgrade directly
        attached storage installed in compute hosts or add capacity to
        the shared storage to provide additional ephemeral storage to
        instances.</para>
    <para>
      For a deeper discussion on many of these topics, refer to the
      <link
      xlink:href="http://docs.openstack.org/ops"><citetitle>OpenStack
      Operations Guide</citetitle></link>.
    </para>
    </section>
</section>