openstack-manuals/doc/security-guide/ch013_node-bootstrapping.xml
Andreas Jaeger aceec15371 Edits on Security Guide
Better follow conventions, especially:
* remove Latinism like via and i.e.
* use variable lists
* Add missing <filename>
* wrap long lines

Change-Id: I2a537df78ddf4fbeb127b058bf05caaf42441d5f
2014-05-12 23:22:48 -04:00

400 lines
21 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter [
<!ENTITY % openstack SYSTEM "../common/entities/openstack.ent">
%openstack;
]>
<chapter xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://docbook.org/ns/docbook"
version="5.0"
xml:id="ch013_node-bootstrapping">
<?dbhtml stop-chunking?>
<title>Integrity life-cycle</title>
<para>We define integrity life cycle as a deliberate process that
provides assurance that we are always running the expected
software with the expected configurations throughout the cloud.
This process begins with secure bootstrapping and is maintained
through configuration management and security monitoring. This
chapter provides recommendations on how to approach the integrity
life-cycle process.</para>
<section xml:id="ch013_node-bootstrapping-idp44768">
<title>Secure bootstrapping</title>
<para>Nodes in the cloud&mdash;including compute, storage, network,
service, and hybrid nodes&mdash;should have an automated
provisioning process. This ensures that nodes are provisioned
consistently and correctly. This also facilitates security
patching, upgrading, bug fixing, and other critical changes.
Since this process installs new software that runs at the
highest privilege levels in the cloud, it is important to verify
that the correct software is installed. This includes the
earliest stages of the boot process.</para>
<para>There are a variety of technologies that enable verification
of these early boot stages. These typically require hardware
support such as the trusted platform module (TPM), Intel Trusted
Execution Technology (TXT), dynamic root of trust measurement
(DRTM), and Unified Extensible Firmware Interface (UEFI) secure
boot. In this book, we will refer to all of these collectively
as <emphasis>secure boot technologies</emphasis>. We recommend
using secure boot, while acknowledging that many of the pieces
necessary to deploy this require advanced technical skills in
order to customize the tools for each environment. Utilizing
secure boot will require deeper integration and customization
than many of the other recommendations in this guide. TPM
technology, while common in most business class laptops and
desktops for several years, and is now becoming available in
servers together with supporting BIOS. Proper planning is
essential to a successful secure boot deployment.</para>
<para>A complete tutorial on secure boot deployment is beyond the
scope of this book. Instead, here we provide a framework for how
to integrate secure boot technologies with the typical node
provisioning process. For additional details, cloud architects
should refer to the related specifications and software
configuration manuals.</para>
<section xml:id="ch013_node-bootstrapping-idp48720">
<title>Node provisioning</title>
<para>Nodes should use Preboot eXecution Environment (PXE) for
provisioning. This significantly reduces the effort required
for redeploying nodes. The typical process involves the node
receiving various boot stages&mdash;that is progressively more
complex software to execute&mdash; from a server.</para>
<para><inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="203" contentwidth="274"
fileref="static/node-provisioning-pxe.png" format="PNG"
scalefit="1"/>
</imageobject>
<imageobject role="fo">
<imagedata contentdepth="100%"
fileref="static/node-provisioning-pxe.png" format="PNG"
scalefit="1" width="100%"/>
</imageobject>
</inlinemediaobject></para>
<para>We recommend using a separate, isolated network within the
management security domain for provisioning. This network will
handle all PXE traffic, along with the subsequent boot stage
downloads depicted above. Note that the node boot process
begins with two insecure operations: DHCP and TFTP. Then the
boot process downloads over SSL the remaining information
required to deploy the node. This information might include an
initramfs and a kernel. This concludes by downloading the
remaining information needed to deploy the node. This may be
an operating system installer, a basic install managed by
<link xlink:href="http://www.opscode.com/chef/">Chef</link>
or <link xlink:href="https://puppetlabs.com/">Puppet</link>,
or even a complete file system image that is written directly
to disk.</para>
<para>While utilizing SSL during the PXE boot process is
somewhat more challenging, common PXE firmware projects, such
as iPXE, provide this support. Typically this involves
building the PXE firmware with knowledge of the allowed SSL
certificate chain(s) so that it can properly validate the
server certificate. This raises the bar for an attacker by
limiting the number of insecure, plain text network
operations.</para>
</section>
<section xml:id="ch013_node-bootstrapping-idp58144">
<title>Verified boot</title>
<para>In general, there are two different strategies for
verifying the boot process. Traditional <emphasis>secure
boot</emphasis> will validate the code run at each step in
the process, and stop the boot if code is incorrect.
<emphasis>Boot attestation</emphasis> will record which code
is run at each step, and provide this information to another
machine as proof that the boot process completed as expected.
In both cases, the first step is to measure each piece of code
before it is run. In this context, a measurement is
effectively a SHA-1 hash of the code, taken before it is
executed. The hash is stored in a platform configuration
register (PCR) in the TPM.</para>
<para>Note: SHA-1 is used here because this is what the TPM
chips support.</para>
<para>Each TPM has at least 24 PCRs. The TCG Generic Server
Specification, v1.0, March 2005, defines the PCR assignments
for boot-time integrity measurements. The table below shows a
typical PCR configuration. The context indicates if the values
are determined based on the node hardware (firmware) or the
software provisioned onto the node. Some values are influenced
by firmware versions, disk sizes, and other low-level
information. Therefore, it is important to have good practices
in place around configuration management to ensure that each
system deployed is configured exactly as desired.</para>
<informaltable rules="all" width="80%">
<colgroup>
<col/>
<col/>
<col/>
</colgroup>
<tbody>
<tr>
<td><para><emphasis role="bold"
>Register</emphasis></para></td>
<td><para><emphasis role="bold">What is
measured</emphasis></para></td>
<td><para><emphasis role="bold"
>Context</emphasis></para></td>
</tr>
<tr>
<td><para>PCR-00</para></td>
<td><para>Core Root of Trust Measurement (CRTM), BIOS
code, Host platform extensions</para></td>
<td><para>Hardware</para></td>
</tr>
<tr>
<td><para>PCR-01</para></td>
<td><para>Host platform configuration</para></td>
<td><para>Hardware</para></td>
</tr>
<tr>
<td><para>PCR-02</para></td>
<td><para>Option ROM code</para></td>
<td><para>Hardware</para></td>
</tr>
<tr>
<td><para>PCR-03</para></td>
<td><para>Option ROM configuration and data</para></td>
<td><para>Hardware</para></td>
</tr>
<tr>
<td><para>PCR-04</para></td>
<td><para>Initial Program Loader (IPL) code. For example,
master boot record.</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-05</para></td>
<td><para>IPL code configuration and data</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-06</para></td>
<td><para>State transition and wake events</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-07</para></td>
<td><para>Host platform manufacturer control</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-08</para></td>
<td><para>Platform specific, often kernel, kernel
extensions, and drivers</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-09</para></td>
<td><para>Platform specific, often Initramfs</para></td>
<td><para>Software</para></td>
</tr>
<tr>
<td><para>PCR-10 to PCR-23</para></td>
<td><para>Platform specific</para></td>
<td><para>Software</para></td>
</tr>
</tbody>
</informaltable>
<para>At the time of this writing, very few clouds are using
secure boot technologies in a production environment. As a
result, these technologies are still somewhat immature. We
recommend planning carefully in terms of hardware selection.
For example, ensure that you have a TPM and Intel TXT support.
Then verify how the node hardware vendor populates the PCR
values. For example, which values will be available for
validation. Typically the PCR values listed under the software
context in the table above are the ones that a cloud architect
has direct control over. But even these may change as the
software in the cloud is upgraded. Configuration management
should be linked into the PCR policy engine to ensure that the
validation is always up to date.</para>
<para>Each manufacturer must provide the BIOS and firmware code
for their servers. Different servers, hypervisors, and
operating systems will choose to populate different PCRs. In
most real world deployments, it will be impossible to validate
every PCR against a known good quantity ("golden
measurement"). Experience has shown that, even within a single
vendor's product line, the measurement process for a given PCR
may not be consistent. We recommend establishing a baseline
for each server and monitoring the PCR values for unexpected
changes. Third-party software may be available to assist in
the TPM provisioning and monitoring process, depending upon
your chosen hypervisor solution.</para>
<para>The initial program loader (IPL) code will most likely be
the PXE firmware, assuming the node deployment strategy
outlined above. Therefore, the secure boot or boot attestation
process can measure all of the early stage boot code, such as,
bios, firmware, and the like, the PXE firmware, and the node
kernel. Ensuring that each node has the correct versions of
these pieces installed provides a solid foundation on which to
build the rest of the node software stack.</para>
<para>Depending on the strategy selected, in the event of a
failure the node will either fail to boot or it can report the
failure back to another entity in the cloud. For secure boot,
the node will fail to boot and a provisioning service within
the management security domain must recognize this and log the
event. For boot attestation, the node will already be running
when the failure is detected. In this case the node should be
immediately quarantined by disabling its network access. Then
the event should be analyzed for the root cause. In either
case, policy should dictate how to proceed after a failure. A
cloud may automatically attempt to re-provision a node a
certain number of times. Or it may immediately notify a cloud
administrator to investigate the problem. The right policy
here will be deployment and failure mode specific.</para>
</section>
<section xml:id="ch013_node-bootstrapping-idp3728">
<title>Node hardening</title>
<para>At this point we know that the node has booted with the
correct kernel and underlying components. There are many paths
for hardening a given operating system deployment. The
specifics on these steps are outside of the scope of this
book. We recommend following the guidance from a hardening
guide specific to your operating system. For example, the
<link xlink:href="http://iase.disa.mil/stigs/">security
technical implementation guides</link> (STIG) and the <link
xlink:href="http://www.nsa.gov/ia/mitigation_guidance/security_configuration_guides/"
>NSA guides</link> are useful starting places.</para>
<para>The nature of the nodes makes additional hardening
possible. We recommend the following additional steps for
production nodes:</para>
<itemizedlist>
<listitem>
<para>Use a read-only file system where possible. Ensure
that writeable file systems do not permit execution. This
can be handled through the mount options provided in
<filename>/etc/fstab</filename>.</para>
</listitem>
<listitem>
<para>Use a mandatory access control policy to contain the
instances, the node services, and any other critical
processes and data on the node. See the discussions on
sVirt / SELinux and AppArmor below.</para>
</listitem>
<listitem>
<para>Remove any unnecessary software packages. This should
result in a very stripped down installation because a
compute node has a relatively small number of
dependencies.</para>
</listitem>
</itemizedlist>
<para>Finally, the node kernel should have a mechanism to
validate that the rest of the node starts in a known good
state. This provides the necessary link from the boot
validation process to validating the entire system. The steps
for doing this will be deployment specific. As an example, a
kernel module could verify a hash over the blocks comprising
the file system before mounting it using <link
xlink:href="https://code.google.com/p/cryptsetup/wiki/DMVerity"
>dm-verity</link>.</para>
</section>
</section>
<section xml:id="ch013_node-bootstrapping-idp11376">
<title>Runtime verification</title>
<para>Once the node is running, we need to ensure that it remains
in a good state over time. Broadly speaking, this includes both
configuration management and security monitoring. The goals for
each of these areas are different. By checking both, we achieve
higher assurance that the system is operating as desired. We
discuss configuration management in the management section, and
security monitoring below.</para>
<section xml:id="ch013_node-bootstrapping-idp135504">
<title>Intrusion detection system</title>
<para>Host-based intrusion detection tools are also useful for
automated validation of the cloud internals. There are a wide
variety of host-based intrusion detection tools available.
Some are open source projects that are freely available, while
others are commercial. Typically these tools analyze data from
a variety of sources and produce security alerts based on rule
sets and/or training. Typical capabilities include log
analysis, file integrity checking, policy monitoring, and
rootkit detection. More advanced -- often custom -- tools can
validate that in-memory process images match the on-disk
executable and validate the execution state of a running
process.</para>
<para>One critical policy decision for a cloud architect is what
to do with the output from a security monitoring tool. There
are effectively two options. The first is to alert a human to
investigate and/or take corrective action. This could be done
by including the security alert in a log or events feed for
cloud administrators. The second option is to have the cloud
take some form of remedial action automatically, in addition
to logging the event. Remedial actions could include anything
from re-installing a node to performing a minor service
configuration. However, automated remedial action can be
challenging due to the possibility of false positives.</para>
<para>False positives occur when the security monitoring tool
produces a security alert for a benign event. Due to the
nature of security monitoring tools, false positives will most
certainly occur from time to time. Typically a cloud
administrator can tune security monitoring tools to reduce the
false positives, but this may also reduce the overall
detection rate at the same time. These classic trade-offs must
be understood and accounted for when setting up a security
monitoring system in the cloud.</para>
<para>The selection and configuration of a host-based intrusion
detection tool is highly deployment specific. We recommend
starting by exploring the following open source projects which
implement a variety of host-based intrusion detection and file
monitoring features.</para>
<itemizedlist>
<listitem>
<para><link xlink:href="http://www.ossec.net/"
>OSSEC</link></para>
</listitem>
<listitem>
<para><link xlink:href="http://la-samhna.de/samhain/"
>Samhain</link></para>
</listitem>
<listitem>
<para><link
xlink:href="http://sourceforge.net/projects/tripwire/"
>Tripwire</link></para>
</listitem>
<listitem>
<para><link xlink:href="http://aide.sourceforge.net/"
>AIDE</link></para>
</listitem>
</itemizedlist>
<para>Network intrusion detection tools complement the
host-based tools. OpenStack doesn't have a specific network
IDS built-in, but OpenStack Networking
provides a plug-in mechanism to enable different technologies
through the Networking API. This plug-in architecture will allow
tenants to develop API extensions to insert and configure
their own advanced networking services like a firewall, an
intrusion detection system, or a VPN between the VMs.</para>
<para>Similar to host-based tools, the selection and
configuration of a network-based intrusion detection tool is
deployment specific. <link xlink:href="http://www.snort.org/"
>Snort</link> is the leading open source networking
intrusion detection tool, and a good starting place to learn
more.</para>
<para>There are a few important security considerations for
network and host-based intrusion detection systems.</para>
<itemizedlist>
<listitem>
<para>It is important to consider the placement of the
Network IDS on the cloud (for example, adding it to the
network boundary and/or around sensitive networks). The
placement depends on your network environment but make
sure to monitor the impact the IDS may have on your
services depending on where you choose to add it.
Encrypted traffic, such as SSL, cannot generally be
inspected for content by a Network IDS. However, the
Network IDS may still provide some benefit in identifying
anomalous unencrypted traffic on the network.</para>
</listitem>
<listitem>
<para>In some deployments it may be required to add
host-based IDS on sensitive components on security domain
bridges. A host-based IDS may detect anomalous activity
by compromised or unauthorized processes on the component.
The IDS should transmit alert and log information on the
Management network.</para>
</listitem>
</itemizedlist>
</section>
</section>
</chapter>