Enabling VDU auto-healing

This spec aim to add new action "vdu_autohealing" to bring back the failed VDU instead of existing action "respawn" that performs deleting entire VNF and creating a new one. Change-Id: Idb4f22faa9327ee8d63e545aaa8916d1ec7d3f68 Implements: blueprint vdu-auto-healing Co-Authored-By: Tushar Patil <tushar.vitthal.patil@gmail.com> Co-Authered-By: Bhagyashri Shewale <bhagyashri.shewale@nttdata.com>
2018-09-12 15:13:01 +09:00 · 2018-09-12 15:13:01 +09:00 · 5a6c75d55d
commit 5a6c75d55d
parent cf9b74dcf3
1 changed files with 188 additions and 0 deletions
--- a/specs/stein/vdu-auto-healing.rst
+++ b/specs/stein/vdu-auto-healing.rst
@ -0,0 +1,188 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+
+================
+VDU auto healing
+================
+
+https://blueprints.launchpad.net/tacker/+spec/vdu-auto-healing
+
+With anti-affinity policy now in place, it's possible to deploy high
+available VNF applications (High availability will be taken care by the
+application running on VDUs). If one of the VDUs is not responding, then
+the existing ``respawn`` action deletes entire stack and creates new
+ones.
+
+Our plan is to add a new action ``vdu_autohealing`` to bring back the
+failed VDU instead of deleting entire stack and creating a new one.
+
+
+Problem description
+===================
+
+If one of the VDUs is not responding, then there is no way to bring back
+that particular failed VDU as the existing ``respawn`` action deletes
+the entire stack and creates a new one. If all VDUs are deleted and
+replaced by new ones, high availability feature will be hampered as
+there will be some down time until the VDUs are back again.
+
+Our plan is to add a new action ``vdu_autohealing`` to bring back the
+failed VDU thereby enabling other VDU in VNF to switch over to master to
+keep services uninterrupted.
+
+
+Proposed change
+===============
+
+Add a new action 'vdu_autohealing' which will first mark the status
+(here, this is the status of "Heat" side) of
+VDU and CPs assigned to that particular VDU as unhealthy using
+'resource-mark-unhealthy' heat api and then the second step will be to
+update the stack which will bring back the VDU and CPs marked as
+unhealthy to "CHECK_COMPLETE" again. Internally, when stack is updated,
+heat deletes the VDUs and CPs marked as unhealthy and replaces it with a
+new ones. The heat-apis will be called from respective infra driver used
+by the VNF. Presently, we are going to support this action for
+`openstack` infra driver. After stack is updated, it will keep checking
+the status of VNF to `CREATE_COMPLETE` until that point monitoring of
+that particular VNF will be stopped.
+
+Note: No plan to support this new action `vdu_autohealing` for policy
+type `tosca.policies.tacker.alarming` so this action will not be
+included in constant `DEFAULT_ALARM_ACTIONS` as defined in
+tacker.plugins.common.constants. If this action needs to be supported
+for policy type alarms, then there is need to pass `metadata` where the
+action is invoked. This metadata contains the name of the
+`metering.server_group` that you specify in VDU metadata. In the action
+itself, based on policy type and `metadata`, we can scan VNFD template
+available in 'vnf_dict' parameter to get VDU name. Once we know the VDU
+name, the same logic will be reused for auto-healing.
+
+An example of VNFD is shown below. The VDU will be monitored using any
+of the existing monitoring drivers. If it fails to monitor any specific
+VDU, it will execute the new action `vdu_autohealing`.
+
+.. code-block:: yaml
+
+  :caption: Example VNFD
+  tosca_definitions_version: tosca_simple_profile_for_nfv_1_0_0
+  description: Monitoring policy action : vdu_autohealing
+  topology_templete:
+    node_templates:
+      VDU:
+        type: tosca.nodes.nfv.VDU.Tacker
+   # ...snip...
+        properties:
+            monitoring_policy:
+                name: ping
+                parameters:
+                    monitoring_delay: 45
+                    count: 3
+                    interval: 1
+                    timeout: 2
+                actions:
+                    failure: vdu_autohealing
+
+Alternatives
+------------
+
+The other alternative solution is the same one as described in
+alternative section of vdu-affinity-policy spec [#f1]_. In NSD, we can
+add two VNFs, one for active and another for standby. If tacker detects
+any issues during monitoring, then `respawn` action will delete that
+particular VNF and create a new one. If the failed VNF is active, then
+the other standby VNF will become active and continue servicing request
+without interruption.
+
+Data model impact
+-----------------
+
+None
+
+REST API impact
+---------------
+
+None
+
+Security impact
+---------------
+
+None
+
+Notifications impact
+--------------------
+
+None
+
+Other end user impact
+---------------------
+
+None
+
+Performance Impact
+------------------
+
+None
+
+Other deployer impact
+---------------------
+
+None
+
+Developer impact
+----------------
+
+None
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee:
+  Bhagyashri Shewale <bhagyashri.shewale@nttdata.com>
+
+Other contributors:
+  Hiroyuki Jo <jo.hiroyuki@lab.ntt.co.jp>
+
+  Tushar Patil <tushar.vitthal.patil@gmail.com>
+
+Work Items
+----------
+
+* Add `vdu_autohealing` action to mark VDU status to unhealthy and
+  update stack
+* Unit Tests
+* Functional Tests
+* Update documentation
+
+Dependencies
+============
+
+None
+
+Testing
+=======
+
+Unit and functional tests are sufficient to test `vdu_autohealing`
+action.
+
+Documentation Impact
+====================
+
+* Add VNFD tosca-template under samples to show how to configure
+  `vdu_autohealing` action.
+
+* Add a new action `vdu_autohealing` in Tacker Monitoring Framework
+  [#f2]_.
+
+References
+==========
+
+.. [#f1] https://specs.openstack.org/openstack/tacker-specs/specs/rocky/vdu-affinity-policy.html
+.. [#f2] https://docs.openstack.org/tacker/latest/contributor/monitor-api.html