Reference architecture: common bits

This change introduces the layout of the future reference architecture guide. It also introduces common ideas and considerations, to avoid repeating them for each provided architecture. Change-Id: Icc56cdfc1c97a2bb0674a9a397259cecc0a08514
2017-07-26 15:05:46 +02:00 · 2017-07-26 15:05:46 +02:00 · d16a205acc
commit d16a205acc
parent 7a2f3482d0
4 changed files with 342 additions and 0 deletions
--- a/doc/source/install/configure-ipmi.rst
+++ b/doc/source/install/configure-ipmi.rst
@ -52,6 +52,8 @@ If there are slow or unresponsive BMCs in the environment, the
 need to be raised. The default is fairly conservative, as setting this timeout
 too low can cause older BMCs to crash and require a hard-reset.
 .. _ipmi-sensor-data:
 Collecting sensor data
 ~~~~~~~~~~~~~~~~~~~~~~
--- a/doc/source/install/index.rst
+++ b/doc/source/install/index.rst
@ -13,6 +13,7 @@ It contains the following sections:
   :maxdepth: 2
   get_started.rst
   refarch/index
   install.rst
   configure-integration.rst
   deploy-ramdisk.rst
--- a/doc/source/install/refarch/common.rst
+++ b/doc/source/install/refarch/common.rst
@ -0,0 +1,327 @@
 Common Considerations
 =====================
 This section covers considerations that are equally important to all described
 architectures.
 .. contents::
   :local:
 Components
 ----------
 As explained in :doc:`../get_started`, the Bare Metal service has three
 components.
 * The Bare Metal API service (``ironic-api``) should be deployed in a similar
  way as the control plane API services. The exact location will depend on the
  architecture used.
 * The Bare Metal conductor service (``ironic-conductor``) is where most of the
  provisioning logic lives. The following considerations are the most
  important when deciding on the way to deploy it:
  * The conductor manages a certain proportion of nodes, distributed to it
    via a hash ring. This includes constantly polling these nodes for their
    current power state and hardware sensor data (if enabled and supported
    by hardware, see :ref:`ipmi-sensor-data` for an example).
  * The conductor needs access to the `management controller`_ of each node
    it manages.
  * The conductor co-exists with TFTP (for PXE) and/or HTTP (for iPXE) services
    that provide the kernel and ramdisk to boot the nodes. The conductor
    manages them by writing files to their root directories.
  * If serial console is used, the conductor launches console processes
    locally. If the nova-serialproxy service (part of the Compute service)
    it used, it has to be able to reach them. Otherwise, they have to be
    directly accessible by the users.
  * There has to be mutual connectivity between the conductor and the nodes
    being deployed or cleaned. See Networking_ for details.
 * The provisioning ramdisk which runs the ``ironic-python-agent`` service
  on start up.
  .. warning::
    The ``ironic-python-agent`` service is not intended to be used anywhere
    other than a provisioning/cleaning ramdisk.
 Hardware and drivers
 --------------------
 The Bare Metal service strives to provide the best support possible for a
 variety of hardware. However, not all hardware is supported equally well.
 It depends on both the capabilities of hardware itself and the available
 drivers. This section covers various considerations related to the hardware
 interfaces. See :doc:`/install/enabling-drivers` for a detailed introduction
 into hardware types and interfaces before proceeding.
 Power and management interfaces
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 The minimum set of capabilities that the hardware has to provide and the
 driver has to support is as follows:
 #. getting and setting the power state of the machine
 #. getting and setting the current boot device
 #. booting an image provided by the Bare Metal service (in the simplest case,
   support booting using PXE_ and/or iPXE_)
 .. note::
    Strictly speaking, it is possible to make the Bare Metal service provision
    nodes without some of these capabilities via some manual steps. It is not
    the recommended way of deployment, and thus it is not covered in this
    guide.
 Once you make sure that the hardware supports these capabilities, you need to
 find a suitable driver. Most of enterprise-grade hardware has support for
 IPMI_ and thus can utilize :doc:`/admin/drivers/ipmitool`. Some newer hardware
 also supports :doc:`/admin/drivers/redfish`. Several vendors
 provide more specific drivers that usually provide additional capabilities.
 Check :doc:`/admin/drivers` to find the most suitable one.
 Boot interface
 ~~~~~~~~~~~~~~
 The boot interface of a node manages booting of both the deploy ramdisk and
 the user instances on the bare metal node. The deploy interface orchestrates
 the deployment and defines how the image gets transferred to the target disk.
 The ``pxe`` boot interface uses PXE_ or iPXE_ to deliver the target
 kernel/ramdisk pair. PXE uses relatively slow and unreliable TFTP protocol
 for transfer, while iPXE uses HTTP. The downside of iPXE is that it's less
 common, and usually requires bootstrapping using PXE first. It is recommended
 to use iPXE, when it is supported by target hardware, see
 :doc:`../configure-pxe` for details.
 .. note::
    Both PXE and iPXE are configured differently, when UEFI boot is used
    instead of conventional BIOS boot. This is particularly important for CPU
    architectures that do not have BIOS support at all.
 Alternatively, several vendors provide *virtual media* implementations of the
 boot interface. They work by pushing an ISO image to the node's `management
 controller`_, and do not require either PXE or iPXE. If such boot
 implementation is available for the hardware, it is recommended using it
 for better scalability and security.  Check your driver documentation at
 :doc:`/admin/drivers` for details.
 Deploy interface
 ~~~~~~~~~~~~~~~~
 There are two deploy interfaces in-tree, ``iscsi`` and ``direct``. See
 :doc:`../enabling-drivers` for explanation of the difference. With the
 ``iscsi`` deploy method, most of the deployment operations happen on the
 conductor. If the Object Storage service (swift) or RadosGW is present in
 the cloud, it is recommended to use the ``direct`` deploy method for better
 scalability and reliability.
 Hardware specifications
 ~~~~~~~~~~~~~~~~~~~~~~~
 The Bare Metal services does not impose too many restrictions on the
 characteristics of hardware itself. However, keep in mind that
 * By default, the Bare Metal service will pick the smallest hard drive that
  is large than 4 GiB for deployment. A smaller hard drive can be used, but
  requires setting :ref:`root device hints <root-device-hints>`.
 * The machines should have enough RAM to fit the deployment/cleaning ramdisk
  to run. The minimum varies greatly depending on the way the ramdisk was
  built. For example, *tinyipa*, the TinyCoreLinux-based ramdisk used in the
  CI, only needs 400 MiB of RAM, while ramdisks built by *diskimage-builder*
  may require 3 GiB or more.
 Image types
 -----------
 The Bare Metal service can deploy two types of images:
 * *Whole-disk* images contain a complete partitioning table with all necessary
  partitions. Such images are the most universal, but may be harder to build.
 * *Partition images* contain only the root partition. The Bare Metal service
  will create the necessary partitions and install a boot loader, if needed.
  .. warning::
    Partition images are only supported with GNU/Linux operating systems,
    and requires the GRUB2 bootloader to be present on the root image.
 Local vs network boot
 ---------------------
 The Bare Metal service supports booting user instances either using a local
 bootloader or using the driver's boot interface (e.g. via PXE_ or iPXE_
 protocol in case of the ``pxe`` interface).
 Network boot cannot be used with certain architectures (for example, when no
 tenant networks have access to the control plane).
 Additional considerations are related to the ``pxe`` boot interface, and other
 boot interfaces based on it:
 * Local boot makes node's boot process independent of the Bare Metal conductor
  managing it. Thus, nodes are able to reboot correctly, even if the Bare Metal
  TFTP or HTTP service is down.
 * Network boot (and iPXE) must be used when booting nodes from remote volumes.
 The default boot option for the cloud can be changed via the Bare Metal service
 configuration file, for example:
 .. code-block:: ini
    [deploy]
    default_boot_option = local
 This default can be overriden by setting the ``boot_option`` capability on a
 node. See :ref:`local-boot-partition-images` for details.
 .. note::
    Currently, network boot is used by default. However, we plan on changing it
    in the future, so it's safer to set the ``default_boot_option`` explicitly.
 Networking
 ----------
 There are several recommended network topologies to be used with the Bare
 Metal service. They are explained in depth in specific architecture
 documentation. However, several considerations are common for all of them:
 * There has to be a *provisioning* network, which is used by nodes during
  the deployment process. If allowed by the architecture, this network should
  not be accessible by end users, and should not have access to the internet.
 * There has to be a *cleaning* network, which is used by nodes during
  the cleaning process. In the majority of cases, the same network should
  be used as cleaning and provisioning for simplicity.
  Unless noted otherwise, everything in these sections apply to both networks.
 * The baremetal nodes have to have access to the Bare Metal API while connected
  to the provisioning/cleaning network.
  .. note::
      Actually, only two endpoints need to be exposed there::
        GET /v1/lookup
        POST /v1/heartbeat/[a-z0-9\-]+
      You may want to limit access from this network to only these endpoints,
      and make these endpoint not accessible from other networks.
 * If the ``pxe`` boot interface (or any boot interface based on it) is used,
  then the baremetal nodes should have untagged (access mode) connectivity
  to the provisioning/cleaning networks. It allows PXE firmware, which does not
  support VLANs, to communicate with the services required for provisioning.
  .. note::
      It depends on the *network interface* whether the Bare Metal service will
      handle it automatically. Check the networking documentation for the
      specific architecture.
 * The Baremetal nodes need to have access to any services required for
  provisioning/cleaning, while connected to the provisioning/cleaning network.
  This may include:
  * a TFTP server for PXE boot and also an HTTP server when iPXE is enabled
  * either an HTTP server or the Object Storage service in case of the
    ``direct`` deploy interface and some virtual media boot interfaces
 * The Baremetal Conductor needs to have access to the booted baremetal nodes
  during provisioning/cleaning. The conductor communicates with an internal
  API, provided by **ironic-python-agent**, to conduct actions on nodes.
 HA and Scalability
 ------------------
 ironic-api
 ~~~~~~~~~~
 The Bare Metal API service is stateless, and thus can be easily scaled
 horizontally. It is recommended to deploy it as a WSGI application behind e.g.
 Apache or another WSGI container.
 Note, that this service accesses the ironic database for reading entities
 (e.g. in response to ``GET /v1/nodes`` request) and in rare cases for writing.
 ironic-conductor
 ~~~~~~~~~~~~~~~~
 High availability
 ^^^^^^^^^^^^^^^^^
 The Bare Metal conductor service utilizes the active/active HA model. Every
 conductor manages a certain subset of nodes. The nodes are organized in a hash
 ring that tries to keep the load spread more or less uniformly across the
 conductors. When a conductor is considered offline, its nodes are taken over by
 other conductors. As a result of this, you need at least 2 conductor hosts
 for an HA deployment.
 Performance
 ^^^^^^^^^^^
 Conductors can be resource intensive, so it is recommended (but not required)
 to keep all conductors separate from other services in the cloud. The minimum
 required number of conductors in a deployment depends on several factors:
 * the performance of the hardware where the conductors will be running,
 * the speed and reliability of the `management controller`_ of the
  bare metal nodes (for example, handling slower controllers may require having
  less nodes per conductor),
 * the frequency, at which the management controllers are polled by the Bare
  Metal service (see the ``sync_power_state_interval`` option),
 * the bare metal driver used for nodes (see `Hardware and drivers`_ above),
 * the network performance,
 * the maximum number of bare metal nodes that are provisioned simultaneously
  (see the ``max_concurrent_builds`` option for the Compute service).
 We recommend a target of **100** bare metal nodes per conductor for maximum
 reliability and performance. There is some tolerance for a larger number per
 conductor. However, it was reported [1]_ [2]_ that reliability degrades when
 handling approximately 300 bare metal nodes per conductor.
 Disk space
 ^^^^^^^^^^
 Each conductor needs enough free disk space to cache images it uses.
 Depending on the combination of the deploy interface and the boot option,
 the space requirements are different:
 * The deployment kernel and ramdisk are always cached during the deployment.
 * The ``iscsi`` deploy method requires caching of the whole instance image
  locally during the deployment. The image has to be converted to the raw
  format, which may increase the required amount of disk space, as well as
  the CPU load.
  .. note::
    This is not a concern for the ``direct`` deploy interface, as in this case
    the deployment ramdisk downloads the image and either streams it to the
    disk or caches it in memory.
 * When network boot is used, the instance image kernel and ramdisk are cached
  locally while the instance is active.
 .. note::
    All images may be stored for some time after they are no longer needed.
    This is done to speed up simultaneous deployments of many similar images.
    The caching can be configured via the ``image_cache_size`` and
    ``image_cache_ttl`` configuration options in the ``pxe`` group.
 .. [1] http://lists.openstack.org/pipermail/openstack-dev/2017-June/118033.html
 .. [2] http://lists.openstack.org/pipermail/openstack-dev/2017-June/118327.html
 Other services
 ~~~~~~~~~~~~~~
 When integrating with other OpenStack services, more considerations may need
 to be applied. This is covered in other parts of this guide.
 .. _PXE: https://en.wikipedia.org/wiki/Preboot_Execution_Environment
 .. _iPXE: https://en.wikipedia.org/wiki/IPXE
 .. _IPMI: https://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface
 .. _management controller: https://en.wikipedia.org/wiki/Out-of-band_management
--- a/doc/source/install/refarch/index.rst
+++ b/doc/source/install/refarch/index.rst
@ -0,0 +1,12 @@
 Reference Deploy Architectures
 ==============================
 This section covers the way we recommend the Bare Metal service to be deployed
 and managed. It is assumed that a reader has already gone through
 :doc:`/user/index`. It may be also useful to try :ref:`deploy_devstack` first
 to get better familiar with the concepts used in this guide.
 .. toctree::
   :maxdepth: 2
   common