Add Airship 1 Flow Doc

This adds some Airship 1 documentation that Rodolfo authored to describe the end-to-end Airship 1 deployment flow. Change-Id: Ie1d2ecfa2cc3ba9883ffcfbc238b8f3499def9da
2021-04-09 15:52:53 -05:00 · 2021-04-09 15:52:53 -05:00 · ff0c344f2a
commit ff0c344f2a
parent 6ee51e9ca5
3 changed files with 277 additions and 0 deletions
--- a/doc/source/airship1/airship-1-flow.png
+++ b/doc/source/airship1/airship-1-flow.png
--- a/doc/source/airship1/airship-1-flow.rst
+++ b/doc/source/airship1/airship-1-flow.rst
@ -0,0 +1,276 @@
+****************************
+Airship 1.0 Deployment Flows
+****************************
+
+.. |vspace| raw:: latex
+
+   \vspace{5mm}
+
+.. image:: airship-1-flow.png
+
+Airship 1.0 Deploy and Update Site Flow
+#######################################
+
+
+1. Pegleg facilitates cloning the repositories necessary to interact with
+   a site.  Each site has a single site-definition.yaml which contains
+   the repositories that “compose” that site.  These may be global
+   repositories, type level repositories (e.g. cruisers or cloud harbor),
+   and finally site-level repositories.  These may be entirely different
+   repositories with different permissions.  Pegleg facilitates cloning
+   all of these at the correct revisions according to the definition for
+   that site. Pegleg can be driven via a jenkins pipeline, which can be
+   further abstracted in something like an NC3C dashboard, or it can be
+   driven on the command line directly by imitating the behavior in the
+   pipeline.
+
+|vspace|
+
+2. Pegleg wears several different hats.  The CI/CD workflows leverage
+   different pipelines in order to call upon these hats but under the
+   hood, it’s really just different command line flags on the pegleg CLI
+   command depending on what type of action is occurring.  Pegleg can:
+
+   a. Generate (and re-generate/rotate) new secure secrets for a site
+      according to each secret’s  requirement (e.g. length, type, and so
+      on). For instance: UUIDs, passwords, keys, and so on.
+
+      |vspace|
+
+   b. Encrypt secrets, and Decrypt secrets.  When secrets are encrypted,
+      they are wrapped in a YAML envelope containing metadata for each
+      secret.  This allows for understanding when secrets are going to
+      expire, when they were last rotated, and so on.  All deployment and
+      update pipelines for instance would leverage the decrypt functionality
+      in order to render the documents successfully.
+
+      |vspace|
+
+   c. Lint the YAML to ensure it is valid and meets certain basic syntax
+      criteria and deckhand does not have an issue processing rules
+      encountered.  For instance, development gating pipelines that validate
+      changes to YAML would invoke pegleg in this way.
+
+      |vspace|
+
+   d. Render will actually process the documents through the deckhand
+      library, which will perform substitutions, pull in the secrets that
+      are referenced from the configuration YAML so you can see the target
+      document locally.  This is effectively a very in-depth linting process
+      and again would be used in development gates and potentially to
+      fast-fail in deployment and update pipelines if there was an issue.
+
+      |vspace|
+
+   e. Collect will bundle up all the documents but not actually render them
+      which is appropriate for deployment and update pipelines as it sends
+      the documents through raw (but presumably with decrypted secrets)
+      because each cloud site has its own deckhand instance running
+      maintaining its own revision history capable of rendering the
+      documents in-site.  It is used in every deployment and update pipeline
+      as the results of collect are what is sent to shipyard.
+
+|vspace|
+
+3. Once pegleg has decrypted the secrets in the document set within an
+   ephemeral jenkins pipeline, pegleg collect is called to assemble them
+   all, and finally that is piped to the shipyard client which will
+   publish them via REST API to a Shipyard API service running within the
+   site. There are two scenarios under which Shipyard may be running in
+   the site.
+
+   a. On the genesis host, which is a single node running Kubernetes in a
+      green-field site that will be expanded to a full cluster once more
+      nodes are provisioned.
+
+      |vspace|
+
+   b. On the control plane of a greenfield site, receiving a site-update
+      or expansion.
+
+   Simply put, the entire Shipyard workflow can be summarized as follows:
+
+   * Initial region/site data will be passed to Shipyard from either a
+     human operator or Jenkins
+   * The data (in YAML format) will be sent to Deckhand for validation and storage
+   * Shipyard will make use of the post-processed data from DeckHand to
+     interact with Drydock.
+   * Drydock will interact with Promenade to provision and deploy bare metal
+     nodes using Ubuntu MAAS and a resilient Kubernetes cluster will be created
+     at the end of the process
+   * Once the Kubernetes clusters are up and validated to be working properly,
+     Shipyard will interact with Armada to deploy OpenStack using OpenStack Helm
+   * Once the OpenStack cluster is deployed, Shipyard will trigger a workflow to
+     perform basic sanity health checks on the cluster
+
+|vspace|
+
+4. Shipyard will do a number of pre-validations before delivering the
+   document set to deckhand.  Things such as a concurrency check, to
+   ensure we don’t try to run updates in parallel unaware of each other.
+   It will also run a number of fail-fast validation checks.
+
+|vspace|
+
+5. Shipyard will leverage the deckhand client library to deliver the
+   documents to deckhand over its REST API, which will again validate
+   them and render them (which again involves performing all layering,
+   substitution, secret interpolation, and so on) and publishes a
+   document revision, so that there is an on-site record of every change
+   that has ever been requested.  This document revision that is fully
+   rendered will be available at a deckhand REST API URL that can be
+   retrieved by various Airship sub-components.
+
+|vspace|
+
+6. Deckhand will store secrets within Barbican so that they are not
+   stored in clear text within a database, and the rendered document set
+   revision itself is stored directly in a database.  Deckhand will
+   change every secret to a Barbican reference which will be rendered
+   on-demand by Deckhand whenever someone asks for that document revision
+   through the API.
+
+|vspace|
+
+7. At this point, with the documents stored in Deckhand, Shipyard will
+   perform another fail-fast step and ask each of the components
+   highlighted in yellow to perform a dry-run no-op validation of the
+   entire document set from their perspective.  This means that Drydock
+   for instance, would be validating and acknowledging it would not have
+   any issue processing the document set it sees in Deckhand.  This
+   helps ensure we do not encounter updates that fail in the middle
+   of the process.  If a component is unhappy with the document set
+   we want to know early and fail before making any changes.
+
+|vspace|
+
+8. Shipyard will now invoke Drydock to provision baremetal hosts that
+   have not already been provisioned and continue to call back or poll
+   for when Drydock has completed this process.  Airship has a concept
+   called deployment strategies because the hardware aspect of
+   deployment is not guaranteed or reliable, and we don’t always want
+   failures here to block every other process in the stack.  In other
+   words, our deployment strategies require that 100% of nodes marked
+   as control plane nodes must be provisioned successfully to
+   continue, but that a certain percentage of each rack of workers
+   could fail and we can still continue past the hardware
+   provisioning steps successfully.  In other words, this is where we
+   introduce a threshold of failure.
+
+|vspace|
+
+9. Shipyard will send Drydock the Deckhand URL to obtain the document set
+   for itself for this update.  Drydock will retrieve the entire document
+   set from Deckhand but it will only process documents it cares
+   about.
+
+|vspace|
+
+10. Drydock will process any Drydock/BootAction documents that have
+    external references in them to render those upfront before writing an
+    operating system to the physical host.  Most importantly, this allows
+    Promenade to construct a host-specific join script.   In other words,
+    Drydock calls out to the Promenade REST API to construct a join shell
+    script for each host and this is driven by Drydock/Bootaction
+    documents.
+
+|vspace|
+
+11. Drydock will orchestrate MaaS based on the document set.  It does this
+    through several internal tasks, prepare_site, prepare_nodes, and
+    deploy_site.  Within prepare_site, upfront orchestration of MaaS
+    occurs setting non-host specific settings via the MaaS API, such as
+    CIDRs, and VLANs.  Within prepare_nodes, we identify hosts that
+    haven’t already been provisioned and then power cycle hosts, wait for
+    them to be discovered by MaaS, and then aligning and renaming them to
+    hosts in our static inventory. Then the host configuration is
+    orchestrated in MaaS so they have the proper networking and storage
+    configuration as well as receive the correct static overlays, like
+    Kubernetes join scripts, the correct Drivers, and so on, on
+    first-boot.  Finally within Drydock’s deploy_nodes task we orchestrate
+    several MaaS flows to actually provision the nodes with an operating
+    system where they execute any additional static scripts delivered on
+    first-boot.
+
+|vspace|
+
+12. During the deploy_nodes phase of Drydock, MaaS is effectively writing
+    an operating system to the baremetal nodes.
+
+|vspace|
+
+13. Driven by cloud-init on first boot post provision, the nodes will
+    actually make a rest call back to the MaaS API to inform it that
+    provisioning has completed and they have successfully booted up into
+    functional networking and have booted up successfully.  Drydock can
+    use this status within MaaS to understand the nodes were provisioned
+    successfully.
+
+|vspace|
+
+14. The nodes run the Promenade generated shell script to join them to
+    Kubernetes. This host-specific script installs the appropriate
+    dependencies and joins the node as a Kubernetes node, either as a
+    worker, or as a control plane host depending on the hosts profile in
+    the YAML inventory.
+
+|vspace|
+
+15. Shipyard has been polling Drydock for completion of processing the
+    site update. Once the polling for Drydock provisioning completes,
+    Shipyard will move on to performing a similar request to Armada.
+    Armada is asked to update the site and given a Deckhand URL and
+    revision to pull from.
+
+|vspace|
+
+16. Armada pulls the rendered document set from Deckhand.
+
+|vspace|
+
+17. Armada then proceeds to help orchestrate any helm installs or upgrades
+    necessary in the site, and helps do this across a vast number of
+    charts, their ordering, and dependencies. Armada also supports
+    fetching Helm chart source and then building charts from source from
+    various local and remote locations, such as Git endpoints, tarballs or
+    local directories.  It will also give the operator some indication of
+    what is about to change by assisting with diffs for both values,
+    values overrides, and actual template changes. Its functionality
+    extends beyond Helm, assisting in interacting with Kubernetes directly
+    to perform basic pre- and post-steps, such as removing completed or
+    failed jobs, running backup jobs, blocking on chart readiness, or
+    deleting resources that do not support upgrades. However, primarily,
+    it is an interface to support orchestrating Helm.
+
+|vspace|
+
+18. Armada effectively interacts with Tiller for installation (although it
+    may interact with k8s directly to poll, wait, remove jobs, and
+    otherwise help protect helm from failures).  Tiller will then interact
+    with k8s to perform helm chart installations or upgrades.
+
+Airship 1.0 Update Software Flow
+################################
+
+The Update Software flow (or “action” in Shipyard -- depicted with
+green numbers in the image) is effectively a subset of the above flow.
+It is used primarily to speed the process up by bypassing the Drydock
+flow entirely. The reason for this is both speed as interacting with
+MaaS is slow, as well as times where you want to avoid trying to
+process hardware requests (e.g. waiting for Drydock to try and
+provision a piece of failed hardware only to ultimately timeout some
+time later before moving on to the next step because the deployment
+strategy allows it).
+
+Further Documentation
+#####################
+
+* https://airshipit.readthedocs.io/projects/shipyard/en/latest/
+* https://airshipit.readthedocs.io/projects/pegleg/en/latest/
+* https://airshipit.readthedocs.io/projects/armada/en/latest/
+* https://airshipit.readthedocs.io/projects/promenade/en/latest/
+* https://airshipit.readthedocs.io/projects/drydock/en/latest/
+* https://airshipit.readthedocs.io/projects/deckhand/en/latest/
+* https://airshipit.readthedocs.io/en/latest/
+
+
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -75,6 +75,7 @@ developers.
   Seaworthy: Production-grade Airship <https://docs.airshipit.org/treasuremap/seaworthy.html>
   develop/airship1-developers.rst
   develop/conventions.rst
+   airship1/airship-1-flow.rst

 Other Resources
 ---------------