Handle online_data_migrations exceptions

When online_data_migrations raise exceptions, nova/cinder-manage catches
the exceptions, prints fairly useless "something didn't work" messages,
and moves on. Two issues:

1) The user(/admin) has no way to see what actually failed (exception
   detail is not logged)

2) The command returns exit status 0, as if all possible migrations have
   been completed successfully - this can cause failures to get missed,
   especially if automated

This change adds logging of the exceptions, and introduces a new exit
status of 2, which indicates that no updates took effect in the last
batch attempt, but some are (still) failing, which requires intervention.

Change-Id: Ib684091af0b19e62396f6becc78c656c49a60504
Closes-Bug: #1796192
This commit is contained in:
imacdonn
2018-10-04 21:27:18 +00:00
committed by Matt Riedemann
parent 1c8b14d24c
commit 3eea37b85b
5 changed files with 96 additions and 26 deletions

View File

@@ -86,16 +86,25 @@ Nova Database
Lists and optionally deletes database records where instance_uuid is NULL.
``nova-manage db online_data_migrations [--max-count]``
Perform data migration to update all live data. Return exit code 0 if
migrations were successful or exit code 1 for partial updates. This command
should be called after upgrading database schema and nova services on all
controller nodes. If the command exits with partial updates (exit code 1)
the command will need to be called again.
Perform data migration to update all live data.
``--max-count`` controls the maximum number of objects to migrate in a given
call. If not specified, migration will occur in batches of 50 until fully
complete.
Returns exit code 0 if no (further) updates are possible, 1 if the ``--max-count``
option was used and some updates were completed successfully (even if others generated
errors), 2 if some updates generated errors and no other migrations were able to take
effect in the last batch attempted, or 127 if invalid input is provided (e.g.
non-numeric max-count).
This command should be called after upgrading database schema and nova services on
all controller nodes. If it exits with partial updates (exit status 1) it should
be called again, even if some updates initially generated errors, because some updates
may depend on others having completed. If it exits with status 2, intervention is
required to resolve the issue causing remaining updates to fail. It should be
considered successfully completed only when the exit status is 0.
``nova-manage db ironic_flavor_migration [--all] [--host] [--node] [--resource_class]``
Perform the ironic flavor migration process against the database
while services are offline. This is `not recommended` for most

View File

@@ -140,14 +140,18 @@ same time.
``nova-manage db online_data_migrations --max-count <number>``. Note
that you can use the ``--max-count`` argument to reduce the load this
operation will place on the database, which allows you to run a
small chunk of the migrations until all of the work is done. Each
time it is run, it will show a summary of completed and remaining
records. You run this command until you see completed and
remaining records as zeros. The chunk size you should use depend
on your infrastructure and how much additional load you can
impose on the database. To reduce load, perform smaller batches
with delays between chunks. To reduce time to completion, run
larger batches.
small chunk of the migrations until all of the work is done. The chunk size
you should use depends on your infrastructure and how much additional load
you can impose on the database. To reduce load, perform smaller batches
with delays between chunks. To reduce time to completion, run larger batches.
Each time it is run, the command will show a summary of completed and remaining
records. If using the ``--max-count`` option, the command should be rerun
while it returns exit status 1 (which indicates that some migrations took
effect, and more work may remain to be done), even if some migrations
produce errors. If all possible migrations have completed and some are
still producing errors, exit status 2 will be returned. In this case, the
cause of the errors should be investigated and resolved. Migrations should be
considered successfully completed only when the command returns exit status 0.
* At this point, you must also ensure you update the configuration, to stop
using any deprecated features or options, and perform any required work

View File

@@ -75,6 +75,8 @@ from nova.virt import ironic
CONF = nova.conf.CONF
LOG = logging.getLogger(__name__)
QUOTAS = quota.QUOTAS
# Keep this list sorted and one entry per line for readability.
@@ -670,14 +672,18 @@ Error: %s""") % six.text_type(e))
def _run_migration(self, ctxt, max_count):
ran = 0
exceptions = False
migrations = {}
for migration_meth in self.online_migrations:
count = max_count - ran
try:
found, done = migration_meth(ctxt, count)
except Exception:
print(_("Error attempting to run %(method)s") % dict(
method=migration_meth))
msg = (_("Error attempting to run %(method)s") % dict(
method=migration_meth))
print(msg)
LOG.exception(msg)
exceptions = True
found = done = 0
name = migration_meth.__name__
@@ -691,7 +697,7 @@ Error: %s""") % six.text_type(e))
ran += done
if ran >= max_count:
break
return migrations
return migrations, exceptions
@args('--max-count', metavar='<number>', dest='max_count',
help='Maximum number of objects to consider')
@@ -713,8 +719,9 @@ Error: %s""") % six.text_type(e))
ran = None
migration_info = {}
exceptions = False
while ran is None or ran != 0:
migrations = self._run_migration(ctxt, max_count)
migrations, exceptions = self._run_migration(ctxt, max_count)
ran = 0
for name in migrations:
migration_info.setdefault(name, (0, 0))
@@ -734,6 +741,18 @@ Error: %s""") % six.text_type(e))
t.add_row([name, info[0], info[1]])
print(t)
# NOTE(imacdonn): In the "unlimited" case, the loop above will only
# terminate when all possible migrations have been effected. If we're
# still getting exceptions, there's a problem that requires
# intervention. In the max-count case, exceptions are only considered
# fatal if no work was done by any other migrations ("not ran"),
# because otherwise work may still remain to be done, and that work
# may resolve dependencies for the failing migrations.
if exceptions and (unlimited or not ran):
print(_("Some migrations failed unexpectedly. Check log for "
"details."))
return 2
# TODO(mriedem): Potentially add another return code for
# "there are more migrations, but not completable right now"
return ran and 1 or 0

View File

@@ -799,13 +799,38 @@ Running batches of 50 until complete
self.assertEqual(0, total[0])
self.assertEqual([50, 50, 50, 50], runs)
def test_online_migrations_error(self):
fake_migration = mock.MagicMock()
fake_migration.side_effect = Exception
fake_migration.__name__ = 'fake'
command_cls = self._fake_db_command((fake_migration,))
@mock.patch('nova.context.get_admin_context')
def test_online_migrations_error(self, mock_get_context):
good_remaining = [50]
def good_migration(context, count):
self.assertEqual(mock_get_context.return_value, context)
found = good_remaining[0]
done = min(found, count)
good_remaining[0] -= done
return found, done
bad_migration = mock.MagicMock()
bad_migration.side_effect = test.TestingException
bad_migration.__name__ = 'bad'
command_cls = self._fake_db_command((bad_migration, good_migration))
command = command_cls()
command.online_data_migrations(None)
# bad_migration raises an exception, but it could be because
# good_migration had not completed yet. We should get 1 in this case,
# because some work was done, and the command should be reiterated.
self.assertEqual(1, command.online_data_migrations(max_count=50))
# When running this for the second time, there's no work left for
# good_migration to do, but bad_migration still fails - should
# get 2 this time.
self.assertEqual(2, command.online_data_migrations(max_count=50))
# When --max-count is not used, we should get 2 if all possible
# migrations completed but some raise exceptions
good_remaining = [125]
self.assertEqual(2, command.online_data_migrations(None))
def test_online_migrations_bad_max(self):
self.assertEqual(127,
@@ -817,19 +842,19 @@ Running batches of 50 until complete
def test_online_migrations_no_max(self):
with mock.patch.object(self.commands, '_run_migration') as rm:
rm.return_value = {}
rm.return_value = {}, False
self.assertEqual(0,
self.commands.online_data_migrations())
def test_online_migrations_finished(self):
with mock.patch.object(self.commands, '_run_migration') as rm:
rm.return_value = {}
rm.return_value = {}, False
self.assertEqual(0,
self.commands.online_data_migrations(max_count=5))
def test_online_migrations_not_finished(self):
with mock.patch.object(self.commands, '_run_migration') as rm:
rm.return_value = {'mig': (10, 5)}
rm.return_value = {'mig': (10, 5)}, False
self.assertEqual(1,
self.commands.online_data_migrations(max_count=5))

View File

@@ -0,0 +1,13 @@
---
upgrade:
- |
The ``nova-manage db online_data_migrations`` command now returns exit
status 2 in the case where some migrations failed (raised exceptions) and
no others were completed successfully from the last batch attempted. This
should be considered a fatal condition that requires intervention. Exit
status 1 will be returned in the case where the ``--max-count`` option was
used and some migrations failed but others succeeded (updated at least one
row), because more work may remain for the non-failing migrations, and
their completion may be a dependency for the failing ones. The command
should be reiterated while it returns exit status 1, and considered
completed successfully only when it returns exit status 0.