Files
distcloud/distributedcloud/dccommon/exceptions.py
Kyle MacLeod 886697755b Add timeout for prestage ansible playbooks
Fix an issue observed during the testing of a large-scale
subcloud prestage operation. In one of many rounds of test,
ansible hung in the middle of prestage of a subcloud causing
the whole strategy to hang for many hours. The process had
to be manually killed as strategy abort did not work in
this case.

The issue is addressed by invoking the 'ansible-playbook' call
via '/usr/bin/timeout'. The timeout command will kill the
ansible-playbook tree if the given timeout value is hit.

For now, only the prestaging operations are using the
new timeout. The original 'run_playbook' method is
preserved in order to reduce any risk in this new
method of invoking a subprocess.

When a timeout occurs, the ansible log is updated before
the process is killed. Example:

    2022-04-28-17:28:44 TIMEOUT (1800s) - playbook is terminated

Default timeout:
- We use a global timeout (default: 3600s / 1hr)
- The default can be modified from the [DEFAULTS] section
  in /etc/dcmanager/dcmanager.conf. To change it, add the
  'playbook_timeout' as shown below, then restart the
  dcmanager-manager service.

      playbook_timeout=3600

Future considerations (not part of this commit):
- In python3, this code can be simplified to
  use the new subprocess.run(timeout=val) method
  or Popen with p.wait(timeout=val)
- Beginning with ansible 2.10, we can introduce
  the ANSIBLE_TASK_TIMEOUT value to set a
  task-level timeout. This is not available
  in our current version of ansible (2.7.5)

Test Plan:

PASS:
Add unit tests covering:
  - no timeout given (maintain current functionality)
  - timeout given but not hit
  - timeout given; process is killed
  - timeout given; hung process (ignoring SIGTERM) is killed

Run prestage operations as normal
  - no regression

Modify default timeout to 5s, run prestage operations
  - verify that timeout occurs
  - verify that ansible-playbook is terminated
  - verify that ansible log file shows TIMEOUT log

Modify default timeout to 5s for a single sublcoud, then
run prestage operations
  - verify that only the single subcloud operation is killed

Modify prestage prestage-sw-packages/tasks/main.yml to use
'--bwlimit=128' in the rsync from registry.central. This slows down
the package prestaging, and the playbook timeout is reached.

Add a 'pause' task in the prestage-sw-packages ansible for a
single subcloud. Ensure just the one task times out.

Exercise non-prestaging ansible playbook (to ensure subprocess
Popen change does not impact other playbooks
  - provisioned a new subcloud

Closes-Bug: 1971994
Change-Id: Iaf1bee786afc505594c6671c959cc2650202ee6c
Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>
2022-05-09 20:58:47 +00:00

116 lines
3.3 KiB
Python

# Copyright 2015 Huawei Technologies Co., Ltd.
# Copyright 2015 Ericsson AB.
# All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.
#
# Copyright (c) 2020 Wind River Systems, Inc.
#
"""
DC Orchestrator base exception handling.
"""
import six
from oslo_utils import encodeutils
from oslo_utils import excutils
from dcorch.common.i18n import _
class DCCommonException(Exception):
"""Base Commond Driver Exception.
To correctly use this class, inherit from it and define
a 'message' property. That message will get printf'd
with the keyword arguments provided to the constructor.
"""
message = _("An unknown exception occurred.")
def __init__(self, **kwargs):
try:
super(DCCommonException, self).__init__(self.message % kwargs) # pylint: disable=W1645
self.msg = self.message % kwargs # pylint: disable=W1645
except Exception:
with excutils.save_and_reraise_exception() as ctxt:
if not self.use_fatal_exceptions():
ctxt.reraise = False
# at least get the core message out if something happened
super(DCCommonException, self).__init__(self.message) # pylint: disable=W1645
if six.PY2:
def __unicode__(self):
return encodeutils.exception_to_unicode(self.msg)
def use_fatal_exceptions(self):
return False
class NotFound(DCCommonException):
pass
class Forbidden(DCCommonException):
message = _("Requested API is forbidden")
class Conflict(DCCommonException):
pass
class ServiceUnavailable(DCCommonException):
message = _("The service is unavailable")
class InvalidInputError(DCCommonException):
message = _("An invalid value was provided")
class InternalError(DCCommonException):
message = _("Error when performing operation")
class ProjectNotFound(NotFound):
message = _("Project %(project_id)s doesn't exist")
class OAMAddressesNotFound(NotFound):
message = _("OAM Addresses Not Found")
class CertificateNotFound(NotFound):
message = _("Certificate in region=%(region_name)s with signature "
"%(signature)s not found")
class LoadNotFound(NotFound):
message = _("Load in region=%(region_name)s with id %(load_id)s not found")
class LoadNotInVault(NotFound):
message = _("Load at path %(path)s not found")
class LoadMaxReached(Conflict):
message = _("Load in region=%(region_name)s at maximum number of loads")
class PlaybookExecutionFailed(DCCommonException):
message = _("Playbook execution failed, command=%(playbook_cmd)s")
class PlaybookExecutionTimeout(PlaybookExecutionFailed):
message = _("Playbook execution failed [TIMEOUT (%(timeout)s)], "
"command=%(playbook_cmd)s")