openstack-manuals/doc/security-guide/ch013_node-bootstrapping.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter [
<!ENTITY % openstack SYSTEM "../common/entities/openstack.ent">
%openstack;
]>
<chapter xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  xmlns="http://docbook.org/ns/docbook"
  version="5.0"
  xml:id="ch013_node-bootstrapping">
  <?dbhtml stop-chunking?>
  <title>Integrity life-cycle</title>
  <para>We define integrity life cycle as a deliberate process that
    provides assurance that we are always running the expected
    software with the expected configurations throughout the cloud.
    This process begins with secure bootstrapping and is maintained
    through configuration management and security monitoring. This
    chapter provides recommendations on how to approach the integrity
    life-cycle process.</para>
  <section xml:id="ch013_node-bootstrapping-idp44768">
    <title>Secure bootstrapping</title>
    <para>Nodes in the cloud&mdash;including compute, storage, network,
      service, and hybrid nodes&mdash;should have an automated
      provisioning process. This ensures that nodes are provisioned
      consistently and correctly. This also facilitates security
      patching, upgrading, bug fixing, and other critical changes.
      Since this process installs new software that runs at the
      highest privilege levels in the cloud, it is important to verify
      that the correct software is installed. This includes the
      earliest stages of the boot process.</para>
    <para>There are a variety of technologies that enable verification
      of these early boot stages. These typically require hardware
      support such as the trusted platform module (TPM), Intel Trusted
      Execution Technology (TXT), dynamic root of trust measurement
      (DRTM), and Unified Extensible Firmware Interface (UEFI) secure
      boot. In this book, we will refer to all of these collectively
      as <emphasis>secure boot technologies</emphasis>. We recommend
      using secure boot, while acknowledging that many of the pieces
      necessary to deploy this require advanced technical skills in
      order to customize the tools for each environment. Utilizing
      secure boot will require deeper integration and customization
      than many of the other recommendations in this guide. TPM
      technology, while common in most business class laptops and
      desktops for several years, and is now becoming available in
      servers together with supporting BIOS. Proper planning is
      essential to a successful secure boot deployment.</para>
    <para>A complete tutorial on secure boot deployment is beyond the
      scope of this book. Instead, here we provide a framework for how
      to integrate secure boot technologies with the typical node
      provisioning process. For additional details, cloud architects
      should refer to the related specifications and software
      configuration manuals.</para>
    <section xml:id="ch013_node-bootstrapping-idp48720">
      <title>Node provisioning</title>
      <para>Nodes should use Preboot eXecution Environment (PXE) for
        provisioning. This significantly reduces the effort required
        for redeploying nodes. The typical process involves the node
        receiving various boot stages&mdash;that is progressively more
        complex software to execute&mdash; from a server.</para>
      <para><inlinemediaobject>
          <imageobject role="html">
            <imagedata contentdepth="203" contentwidth="274"
              fileref="static/node-provisioning-pxe.png" format="PNG"
              scalefit="1"/>
          </imageobject>
          <imageobject role="fo">
            <imagedata contentdepth="100%"
              fileref="static/node-provisioning-pxe.png" format="PNG"
              scalefit="1" width="100%"/>
          </imageobject>
        </inlinemediaobject></para>
      <para>We recommend using a separate, isolated network within the
        management security domain for provisioning. This network will
        handle all PXE traffic, along with the subsequent boot stage
        downloads depicted above. Note that the node boot process
        begins with two insecure operations: DHCP and TFTP. Then the
        boot process downloads over SSL the remaining information
        required to deploy the node. This information might include an
        initramfs and a kernel. This concludes by downloading the
        remaining information needed to deploy the node. This may be
        an operating system installer, a basic install managed by
          <link xlink:href="http://www.opscode.com/chef/">Chef</link>
        or <link xlink:href="https://puppetlabs.com/">Puppet</link>,
        or even a complete file system image that is written directly
        to disk.</para>
      <para>While utilizing SSL during the PXE boot process is
        somewhat more challenging, common PXE firmware projects, such
        as iPXE, provide this support. Typically this involves
        building the PXE firmware with knowledge of the allowed SSL
        certificate chain(s) so that it can properly validate the
        server certificate. This raises the bar for an attacker by
        limiting the number of insecure, plain text network
        operations.</para>
    </section>
    <section xml:id="ch013_node-bootstrapping-idp58144">
      <title>Verified boot</title>
      <para>In general, there are two different strategies for
        verifying the boot process. Traditional <emphasis>secure
          boot</emphasis> will validate the code run at each step in
        the process, and stop the boot if code is incorrect.
          <emphasis>Boot attestation</emphasis> will record which code
        is run at each step, and provide this information to another
        machine as proof that the boot process completed as expected.
        In both cases, the first step is to measure each piece of code
        before it is run. In this context, a measurement is
        effectively a SHA-1 hash of the code, taken before it is
        executed. The hash is stored in a platform configuration
        register (PCR) in the TPM.</para>
      <para>Note: SHA-1 is used here because this is what the TPM
        chips support.</para>
      <para>Each TPM has at least 24 PCRs. The TCG Generic Server
        Specification, v1.0, March 2005, defines the PCR assignments
        for boot-time integrity measurements. The table below shows a
        typical PCR configuration. The context indicates if the values
        are determined based on the node hardware (firmware) or the
        software provisioned onto the node. Some values are influenced
        by firmware versions, disk sizes, and other low-level
        information. Therefore, it is important to have good practices
        in place around configuration management to ensure that each
        system deployed is configured exactly as desired.</para>

      <informaltable rules="all" width="80%">
        <colgroup>
          <col/>
          <col/>
          <col/>
        </colgroup>
        <tbody>
          <tr>
            <td><para><emphasis role="bold"
              >Register</emphasis></para></td>
            <td><para><emphasis role="bold">What is
                  measured</emphasis></para></td>
            <td><para><emphasis role="bold"
              >Context</emphasis></para></td>
          </tr>
          <tr>
            <td><para>PCR-00</para></td>
            <td><para>Core Root of Trust Measurement (CRTM), BIOS
                code, Host platform extensions</para></td>
            <td><para>Hardware</para></td>
          </tr>
          <tr>
            <td><para>PCR-01</para></td>
            <td><para>Host platform configuration</para></td>
            <td><para>Hardware</para></td>
          </tr>
          <tr>
            <td><para>PCR-02</para></td>
            <td><para>Option ROM code</para></td>
            <td><para>Hardware</para></td>
          </tr>
          <tr>
            <td><para>PCR-03</para></td>
            <td><para>Option ROM configuration and data</para></td>
            <td><para>Hardware</para></td>
          </tr>
          <tr>
            <td><para>PCR-04</para></td>
            <td><para>Initial Program Loader (IPL) code. For example,
                master boot record.</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-05</para></td>
            <td><para>IPL code configuration and data</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-06</para></td>
            <td><para>State transition and wake events</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-07</para></td>
            <td><para>Host platform manufacturer control</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-08</para></td>
            <td><para>Platform specific, often kernel, kernel
                extensions, and drivers</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-09</para></td>
            <td><para>Platform specific, often Initramfs</para></td>
            <td><para>Software</para></td>
          </tr>
          <tr>
            <td><para>PCR-10 to PCR-23</para></td>
            <td><para>Platform specific</para></td>
            <td><para>Software</para></td>
          </tr>
        </tbody>
      </informaltable>

      <para>At the time of this writing, very few clouds are using
        secure boot technologies in a production environment. As a
        result, these technologies are still somewhat immature. We
        recommend planning carefully in terms of hardware selection.
        For example, ensure that you have a TPM and Intel TXT support.
        Then verify how the node hardware vendor populates the PCR
        values. For example, which values will be available for
        validation. Typically the PCR values listed under the software
        context in the table above are the ones that a cloud architect
        has direct control over. But even these may change as the
        software in the cloud is upgraded. Configuration management
        should be linked into the PCR policy engine to ensure that the
        validation is always up to date.</para>
      <para>Each manufacturer must provide the BIOS and firmware code
        for their servers. Different servers, hypervisors, and
        operating systems will choose to populate different PCRs. In
        most real world deployments, it will be impossible to validate
        every PCR against a known good quantity ("golden
        measurement"). Experience has shown that, even within a single
        vendor's product line, the measurement process for a given PCR
        may not be consistent. We recommend establishing a baseline
        for each server and monitoring the PCR values for unexpected
        changes. Third-party software may be available to assist in
        the TPM provisioning and monitoring process, depending upon
        your chosen hypervisor solution.</para>
      <para>The initial program loader (IPL) code will most likely be
        the PXE firmware, assuming the node deployment strategy
        outlined above. Therefore, the secure boot or boot attestation
        process can measure all of the early stage boot code, such as,
        bios, firmware, and the like, the PXE firmware, and the node
        kernel. Ensuring that each node has the correct versions of
        these pieces installed provides a solid foundation on which to
        build the rest of the node software stack.</para>
      <para>Depending on the strategy selected, in the event of a
        failure the node will either fail to boot or it can report the
        failure back to another entity in the cloud. For secure boot,
        the node will fail to boot and a provisioning service within
        the management security domain must recognize this and log the
        event. For boot attestation, the node will already be running
        when the failure is detected. In this case the node should be
        immediately quarantined by disabling its network access. Then
        the event should be analyzed for the root cause. In either
        case, policy should dictate how to proceed after a failure. A
        cloud may automatically attempt to re-provision a node a
        certain number of times. Or it may immediately notify a cloud
        administrator to investigate the problem. The right policy
        here will be deployment and failure mode specific.</para>
    </section>
    <section xml:id="ch013_node-bootstrapping-idp3728">
      <title>Node hardening</title>
      <para>At this point we know that the node has booted with the
        correct kernel and underlying components. There are many paths
        for hardening a given operating system deployment. The
        specifics on these steps are outside of the scope of this
        book. We recommend following the guidance from a hardening
        guide specific to your operating system. For example, the
          <link xlink:href="http://iase.disa.mil/stigs/">security
          technical implementation guides</link> (STIG) and the <link
          xlink:href="http://www.nsa.gov/ia/mitigation_guidance/security_configuration_guides/"
          >NSA guides</link> are useful starting places.</para>
      <para>The nature of the nodes makes additional hardening
        possible. We recommend the following additional steps for
        production nodes:</para>
      <itemizedlist>
        <listitem>
          <para>Use a read-only file system where possible. Ensure
            that writeable file systems do not permit execution. This
            can be handled through the mount options provided in
              <filename>/etc/fstab</filename>.</para>
        </listitem>
        <listitem>
          <para>Use a mandatory access control policy to contain the
            instances, the node services, and any other critical
            processes and data on the node. See the discussions on
            sVirt / SELinux and AppArmor below.</para>
        </listitem>
        <listitem>
          <para>Remove any unnecessary software packages. This should
            result in a very stripped down installation because a
            compute node has a relatively small number of
            dependencies.</para>
        </listitem>
      </itemizedlist>
      <para>Finally, the node kernel should have a mechanism to
        validate that the rest of the node starts in a known good
        state. This provides the necessary link from the boot
        validation process to validating the entire system. The steps
        for doing this will be deployment specific. As an example, a
        kernel module could verify a hash over the blocks comprising
        the file system before mounting it using <link
          xlink:href="https://code.google.com/p/cryptsetup/wiki/DMVerity"
          >dm-verity</link>.</para>
    </section>
  </section>
  <section xml:id="ch013_node-bootstrapping-idp11376">
    <title>Runtime verification</title>
    <para>Once the node is running, we need to ensure that it remains
      in a good state over time. Broadly speaking, this includes both
      configuration management and security monitoring. The goals for
      each of these areas are different. By checking both, we achieve
      higher assurance that the system is operating as desired. We
      discuss configuration management in the management section, and
      security monitoring below.</para>
    <section xml:id="ch013_node-bootstrapping-idp135504">
      <title>Intrusion detection system</title>
      <para>Host-based intrusion detection tools are also useful for
        automated validation of the cloud internals. There are a wide
        variety of host-based intrusion detection tools available.
        Some are open source projects that are freely available, while
        others are commercial. Typically these tools analyze data from
        a variety of sources and produce security alerts based on rule
        sets and/or training. Typical capabilities include log
        analysis, file integrity checking, policy monitoring, and
        rootkit detection. More advanced -- often custom -- tools can
        validate that in-memory process images match the on-disk
        executable and validate the execution state of a running
        process.</para>
      <para>One critical policy decision for a cloud architect is what
        to do with the output from a security monitoring tool. There
        are effectively two options. The first is to alert a human to
        investigate and/or take corrective action. This could be done
        by including the security alert in a log or events feed for
        cloud administrators. The second option is to have the cloud
        take some form of remedial action automatically, in addition
        to logging the event. Remedial actions could include anything
        from re-installing a node to performing a minor service
        configuration. However, automated remedial action can be
        challenging due to the possibility of false positives.</para>
      <para>False positives occur when the security monitoring tool
        produces a security alert for a benign event. Due to the
        nature of security monitoring tools, false positives will most
        certainly occur from time to time. Typically a cloud
        administrator can tune security monitoring tools to reduce the
        false positives, but this may also reduce the overall
        detection rate at the same time. These classic trade-offs must
        be understood and accounted for when setting up a security
        monitoring system in the cloud.</para>
      <para>The selection and configuration of a host-based intrusion
        detection tool is highly deployment specific. We recommend
        starting by exploring the following open source projects which
        implement a variety of host-based intrusion detection and file
        monitoring features.</para>
      <itemizedlist>
        <listitem>
          <para><link xlink:href="http://www.ossec.net/"
            >OSSEC</link></para>
        </listitem>
        <listitem>
          <para><link xlink:href="http://la-samhna.de/samhain/"
              >Samhain</link></para>
        </listitem>
        <listitem>
          <para><link
              xlink:href="http://sourceforge.net/projects/tripwire/"
              >Tripwire</link></para>
        </listitem>
        <listitem>
          <para><link xlink:href="http://aide.sourceforge.net/"
              >AIDE</link></para>
        </listitem>
      </itemizedlist>
      <para>Network intrusion detection tools complement the
        host-based tools. OpenStack doesn't have a specific network
        IDS built-in, but OpenStack Networking
        provides a plug-in mechanism to enable different technologies
        through the Networking API. This plug-in architecture will allow
        tenants to develop API extensions to insert and configure
        their own advanced networking services like a firewall, an
        intrusion detection system, or a VPN between the VMs.</para>
      <para>Similar to host-based tools, the selection and
        configuration of a network-based intrusion detection tool is
        deployment specific. <link xlink:href="http://www.snort.org/"
          >Snort</link> is the leading open source networking
        intrusion detection tool, and a good starting place to learn
        more.</para>
      <para>There are a few important security considerations for
        network and host-based intrusion detection systems.</para>
      <itemizedlist>
        <listitem>
          <para>It is important to consider the placement of the
            Network IDS on the cloud (for example, adding it to the
            network boundary and/or around sensitive networks). The
            placement depends on your network environment but make
            sure to monitor the impact the IDS may have on your
            services depending on where you choose to add it.
            Encrypted traffic, such as SSL, cannot generally be
            inspected for content by a Network IDS. However, the
            Network IDS may still provide some benefit in identifying
            anomalous unencrypted traffic on the network.</para>
        </listitem>
        <listitem>
          <para>In some deployments it may be required to add
            host-based IDS on sensitive components on security domain
            bridges. A host-based IDS may detect anomalous activity
            by compromised or unauthorized processes on the component.
            The IDS should transmit alert and log information on the
            Management network.</para>
        </listitem>
      </itemizedlist>
    </section>
  </section>
</chapter>