164 Commits

Author SHA1 Message Date
Jenkins
9b4a981f95 Merge "Fix race during major-upgrade-pacemaker step" 2016-11-10 19:00:08 +00:00
Jenkins
1efaa8c6a2 Merge "Reload haproxy configuration as a post-deployment step" 2016-11-09 18:10:35 +00:00
Michele Baldessari
dde12b075f Fix race during major-upgrade-pacemaker step
Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
    check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
    migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
    manage_systemd_service stop "${service%%-clone}"
    ...
done
"""

The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)

Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.

We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.

Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>

Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
2016-11-09 14:51:51 +01:00
Pradeep Kilambi
094bbefe71 ceilometer compute agent needs restart on compute upgrade
After compute nodes are upgraded, the ceilometer compute agent
doesnt poll and throws warnings. Restarting the compute agent
at this step gets the service back to its normal state.

Closes-Bug: #1640177

Change-Id: I7392de43e933b1d16002e12e407748ae289d5e99
2016-11-08 15:52:04 +00:00
Carlos Camacho
17e727d716 Reload haproxy configuration as a post-deployment step
After deploying a fresh installed Overcloud or updating the stack
the haproxy configuration is updated correctly but no change in the
HA proxy stats happens.

This submission will add the missing resources to run pre and post
puppet tasks.

Closes-bug: 1640175

Change-Id: I2f08704daeee502c618256695a30ce244a1d7ba5
2016-11-08 13:56:18 +00:00
Jenkins
cac8c7d285 Merge "Update openstack-puppet-modules dependencies" 2016-11-04 14:08:15 +00:00
Jenkins
27a9382dd8 Merge "Fixup the start of swift services" 2016-11-04 14:08:08 +00:00
Jenkins
2df74cc829 Merge "Rework gnocchi-upgrade to run in a separate upgrade step" 2016-11-03 17:28:58 +00:00
marios
a7af5b90e4 Fixup the start of swift services
Seems the conditional has changed and we should pickup the
tripleo::profile::base::swift::storage::enable_swift_storage
hiera data.

After controller nodes are upgraded the swift services were down
even though there was no stand-alone swift node (the current
conditional was failing as that hiera isn't set any more)

Closes-Bug: 1638821
Change-Id: Id1383c1e54f9cae13fd375e90da525230e5d23eb
2016-11-03 07:33:40 +00:00
Lukas Bezdicka
d8fa70d2fd Update openstack-puppet-modules dependencies
OPM package is metadata package with unversioned requirements which
means that update does not update the dependencies. This leaves us
with old puppet modules and old puppet during the puppet run.

Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8
Closes-Bug: #1638266
Resolves: rhbz#1390559
2016-11-01 13:44:57 +01:00
Pradeep Kilambi
a8e119094f Rework gnocchi-upgrade to run in a separate upgrade step
gnocchi when configured with swift will require keystone
to be available to authenticate to migrate to v3. At this
step keystone is not available and gnocchi upgrade fails
with auth error. Instead start apache in step 3, start
apache first and then run gnocchi upgrade in a separate
step and let upgrade happen here.

Closes-Bug: #1634897

Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0
2016-11-01 08:33:23 -04:00
Mathieu Bultel
61cba946cd Add replacepkgs to the manual ovs upgrade workaround and fix a typo
rpm command will return an exit 1 if ovs package is already
there and will exit the step_1.sh script. To get around this
force the update with --replacepkgs

Also remove the \ just before the $ which cause a syntax
error for the ceph storage

Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9
Closes-bug: 1636748
2016-10-27 11:38:12 +03:00
Jenkins
ab00d9393b Merge "Fix the stonith property during upgrades" 2016-10-25 14:38:50 +00:00
Michele Baldessari
3866490052 Fix the rabbitmq/redis pacemaker resource timeouts on updates
With the following two changes we increased the timeout for redis and
rabbit for both starting and stopping to 200s:
https://review.openstack.org/386618 newton (merged)
https://review.openstack.org/385555 master (merged)

We want to also fix that on minor updates on all our supported
releases upstream and downstream (newton, mitaka, liberty, kilo).
This way we can guarantee that we have a uniform timeout for
sart and stop for rabbit and redis across all our releases.

Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8
Closes-Bug: #1634851
2016-10-22 20:59:16 +00:00
Michele Baldessari
7ce217909a Fix the stonith property during upgrades
We currently set the stonith property from all controller nodes during
upgrade. This is racy and can actually end up disabling stonith after
the upgrade even if when it was enabled.

Let's set the property only from the bootstrap node.

Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068
Closes-Bug: #1635294
2016-10-20 20:16:28 +02:00
marios
7e09b70bc3 Add special case handling for OVS upgrade in updates and upgrades
This adds a special case handling for the opensvswitch package
as discussed at the related bug below.
This is added/handled here for both the minor update and the
major mitaka...newton upgrade.

Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae
Closes-Bug: 1635205
2016-10-20 13:42:37 +03:00
Michele Baldessari
30a570a7f4 Actually start the systemd services in step3 of the major-upgrade step
We have the following function in the upgrade process after we updated
the packages and called the db-sync commands:
services=$(services_to_migrate)
...
for service in $(services); do
    manage_systemd_service start "${service%%-clone}"
    check_resource_systemd "${service%%-clone}" started 600
done

The above is broken because $services contains a list of services to
start, so $(services) will return gibberish and the for loop will never
execute anything.

One of the symptoms for this is the openstack-nova-compute service not
restarting on the compute nodes during the yum -y upgrade. The reason
for this is that during the service restart, nova-compute waits for
nova-conductor to show up in the rabbitmq queues, which cannot happen
since the service was actually never started.

Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3
Closes-Bug: #1630580
2016-10-10 21:18:26 +02:00
Pradeep Kilambi
eaf91da5ef Ceilometer Wsgi Mitaka->Newton upgrades
In Newton, ceilometer api is changed to run under apache wsgi
instead of eventlet. This will require upgrades for mitaka
deployments to switch to wsgi.

Closes-Bug: 1631297
Change-Id: If9d6987cd0a8fc5d3f9de518ba422d97d5149732
2016-10-07 11:43:33 +03:00
marios
2e6cc07c1a Adds Environment File for Removing Sahara during M/N upgrade
The default path if the operator does nothing is to keep the
sahara services on mitaka to newton upgrades.

If the operator wishes to remove sahara services then they
need to specify the provided major-upgrade-remove-sahara.yaml
environment file in the stack upgrade commands.

The existing migration to ha arch already removes the constraints
and pcs resource for sahara api/engine so we just need to stop
it from starting again if we want to remove it.

This adds a  KeepSaharaServiceOnUpgrade parameter to determine if
Sahara is disabled from starting up after the controllers are
upgraded (defaults true).

Finally it is worth noting that we default the sahara services
as 'on' during converge here in the resource_registry of the
converge environment file; any subsequent stack updates where
the deployment contains sahara services will need to
include the -e /environments/services/sahara.yaml environment
file.

Related-Bug: 1630247
Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407
2016-10-05 16:32:31 +03:00
Jenkins
575bf581ea Merge "Set ceph osd max object name and namespace len on upgrade when on ext4" 2016-10-04 03:01:11 +00:00
Jenkins
5f7d913c10 Merge "Update $service to $resource this variable does not exist in the context" 2016-10-03 18:17:47 +00:00
Mathieu Bultel
dc6f93da4f Update $service to $resource this variable does not exist in the context
heat failed due to a:
service: unbound variable
In the context $service is never set.

Change-Id: If82ee4562612f2617b676732956396278ee40a88
Closes-Bug: #1629903
2016-10-03 17:28:08 +02:00
Michele Baldessari
1d7231aae2 Change the rabbitmq ha policies during an M/N Upgrade
This takes care of the M->N upgrade path when changing
the ha rabbitmq policy.

Partial-Bug: #1628998

Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02
2016-10-03 10:49:39 +02:00
Jenkins
01198c81d2 Merge "Use -L with chown and set crush map tunables when upgrading Ceph" 2016-09-29 23:58:01 +00:00
Jenkins
f5f41504e5 Merge "Fix typo in fixing gnocchi upgrade." 2016-09-29 23:57:43 +00:00
Giulio Fidente
27e1d105fb Set ceph osd max object name and namespace len on upgrade when on ext4
As per [1] we need to lower osd max object name and namespace len when
upgrading from Hammer and the OSD is backed by ext4.

These could also be given via ExtraConfig but on upgrade we only run
puppet apply after this script is executed, so the values won't be
effective unless the daemon is restarted. Yet we do not want puppet
to restart the daemon because we can't bring all OSDs down
unconditionally or guests will die.

1. http://tracker.ceph.com/issues/16187

Co-Authored-By: Michele Baldessari <michele@acksyn.org>
Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com>
Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a
Closes-Bug: 1628874
2016-09-29 16:15:13 +00:00
Jenkins
72aa430246 Merge "Relax pre-upgrade check for failed actions" 2016-09-29 14:56:49 +00:00
Jenkins
6fadcce868 Merge "Fix races in major-upgrade-pacemaker Step2" 2016-09-29 14:56:41 +00:00
Sofer Athlan-Guyot
371698a203 Fix typo in fixing gnocchi upgrade.
Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482
Related-Bug: #1626592
2016-09-29 15:22:16 +02:00
Jenkins
77480ec29c Merge "Full HA->HA NG migration might fail setting maintenance-mode" 2016-09-29 13:08:35 +00:00
Giulio Fidente
059307718f Use -L with chown and set crush map tunables when upgrading Ceph
Previously the chown command wasn't traversing symlinks, causing
the new ownership to not be set for some needed files.

This change also ensures the crush map tunables are set to the 'default'
profile after the upgrade.

Finally redirects the output of a pidof to /dev/null to avoid spurious
logging.

Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8
2016-09-29 13:35:05 +02:00
Michele Baldessari
32c54304f4 Relax pre-upgrade check for failed actions
Before this change we checked the cluster for any failed actions and
we stopped the upgrade process if there were any.
This is likely eccessive as a failed action could have happened in the
past and the cluster is now fully functional.

Better to check if any of the resources are in Stopped state and break
the upgrade process if any of them are.

We also need to restrict this check to the bootstrap node because
otherwise the following might happen:
1) Bootstrap node does the check, it is successful and it starts
   the full HA -> HA NG migration which *will* create failed actions
   and will start stopping resources
2) If the check now starts on a non-bootstrap node while 1) is ongoing,
   it will find either failed actions or stopped resources so it will
   fail.

Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f
Closes-Bug: #1628653
2016-09-29 09:02:24 +02:00
Michele Baldessari
ad07a29f94 Fix races in major-upgrade-pacemaker Step2
tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
has the following code:
...
check_resource mongod started 600

if [[ -n $(is_bootstrap_node) ]]; then
...
    tstart=$(date +%s)
    while ! clustercheck; do
        sleep 5
        tnow=$(date +%s)
        if (( tnow-tstart > galera_sync_timeout )) ; then
            echo_error "ERROR galera sync timed out"
            exit 1
        fi
    done

    # Run all the db syncs
    cinder-manage db sync
...
fi

start_or_enable_service rabbitmq
check_resource rabbitmq started 600
start_or_enable_service redis
check_resource redis started 600
start_or_enable_service openstack-cinder-volume
check_resource openstack-cinder-volume started 600

systemctl_swift start

for service in $(services_to_migrate); do
    manage_systemd_service start "${service%%-clone}"
    check_resource_systemd "${service%%-clone}" started 600
done
"""

The problem with the above code is that it is open to the following race
condition:
1) Bootstrap node is busy checking the galera status via cluster check
2) Non-bootstrap node has already reached: start_or_enable_service
   rabbitmq and later lines. These lines will be skipped because
   start_or_enable_service is a noop on non-bootstrap nodes and
   check_resource rabbitmq only checks that pcs status |grep rabbitmq
   returns true.
3) Non-bootstrap node can then reach the manage_systemd_service start
   and it will fail with stuff like:
  "Job for openstack-nova-scheduler.service failed because the control
  process exited with error code. See \"systemctl status
  openstack-nova-scheduler.service\" and \"journalctl -xe\" for
  details.\n" (because the db tables are not migrated yet)

This happens because 3) was started on non-bootstrap nodes before the
db-sync statements are complete on the bootstrap node. I did not feel
like changing the semantics of check_resource and remove the noop on
non-bootstrap nodes as other parts of the tree might rely on this
behaviour.

Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed
Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9
Closes-Bug: #1627965
2016-09-29 07:41:28 +02:00
Sofer Athlan-Guyot
89efa79599 Update gnocchi database during M/N upgrade.
We call gnocchi-upgrade to make sure we update all the needed schemas
during the major-upgrade-pacemaker step.

We also make sure that redis is started before we call gnocchi-upgrade
otherwise the command will be stuck in a loop trying to contact redis.

Closes-Bug: #1626592
Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed
2016-09-28 22:46:01 +02:00
Michele Baldessari
35da6af8bd Full HA->HA NG migration might fail setting maintenance-mode
Currently we do the following in the migration path:
pcs property set maintenance-mode=true
if ! timeout -k 10 300 crm_resource --wait; then
     echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
     exit 1
fi

crm_resource --wait can actually take forever under certain conditions.
The property will be set atomically across the cluster nodes so we should be good
without this.

Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04
Closes-Bug: #1628393
2016-09-28 12:28:39 +02:00
Michele Baldessari
da53e9c00b Fix "Not all flavors have been migrated to the API database"
After a successful upgrade to Newton, I ran the tripleo.sh
--overcloud-pingtest and it failed with the following:

resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409)

The issue is the fact that some tables have migrated to the
nova_api db and we need to migrate the data as well.

Currently we do:
    nova-manage db sync
    nova-manage api_db sync

We want to add:
    nova-manage db online_data_migrations

After launching this command the overcloud-pingtest works correctly:
tripleo.sh -- Overcloud pingtest SUCCEEDED

Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096
Closes-Bug: #1628450
2016-09-28 12:20:33 +02:00
Jenkins
a7b7a118bf Merge "Remove deprecated scheduler_driver settings" 2016-09-27 08:50:49 +00:00
Jenkins
9e1d7f0495 Merge "Disable openstack-cinder-volume in step1 and reenable it in step2" 2016-09-27 06:50:12 +00:00
Jenkins
fb8338ec24 Merge "Fix ignore warning on ceph major upgrade." 2016-09-27 02:04:27 +00:00
Jenkins
7565e03a82 Merge "A few major-upgrade issues" 2016-09-27 01:11:46 +00:00
Jenkins
9023746e1f Merge "Start mongod before calling ceilometer-dbsync" 2016-09-27 01:11:39 +00:00
Jenkins
a85936ea65 Merge "Reinstantiate parts of code that were accidentally removed" 2016-09-27 01:11:32 +00:00
Sofer Athlan-Guyot
def3801fcb Fix ignore warning on ceph major upgrade.
The paramater IgnoreCephUpgradeWarnings is type cast into a boolean
which is rendered as 'True' or 'False' as a string not 'true' or
'false'.  This fix the check.

Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623
Closes-Bug: 1627736
2016-09-26 16:23:18 +00:00
Michele Baldessari
9393a3e2a5 get_param calls with multiple arguments need brackets around them
This issue was spotted during major upgrade where we had calls like
this:

   servers: {get_param: servers, Controller}

These get_param calls are hanging indefinitely and make the whole
upgrade end in a timeout. We need to put brackets around the get_param
function when there are multiple arguments:
http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param

This is already done in most of the tree, and the few places where this
was not happening were parts not under CI. After this change the
following grep returns only one false positive:

   grep -ir get_param: |grep -v -- '\[' |grep ','

Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614
Closes-Bug: #1626628
2016-09-25 22:05:00 +02:00
Michele Baldessari
f9e6a26f32 A few major-upgrade issues
This commit does the following:
1. We now explicitly disable/stop and then remove the resources that are
   moving to systemd. We do this because we want to make sure they are all
   stopped before doing a yum upgrade, which otherwise would take ages due
   to rabbitmq and galera being down. It is best if we do this via pcs
   while we do the HA Full -> HA NG migration because it is simpler to make
   sure all the services are stopped at that stage. For extra safety we can
   still do a check by hand. By doing it via pacemaker we have the
   guarantee that all the migrated services are down already when we stop
   the cluster (which happens to be a syncronization point between all
   controller nodes). That way we can be certain that they are all down on
   all nodes before starting the yum upgrade process.

2. We actually need to start the systemd services in
   major_upgrade_controller_pacemaker_2.sh and not stop them.

3. We need to use the proper bash variable name

4. Use is_bootstrap_node everywhere to make the code more consistent

Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c
Closes-Bug: #1627490
2016-09-25 14:10:31 +02:00
Michele Baldessari
b70d6e6f34 Disable openstack-cinder-volume in step1 and reenable it in step2
Currently we do not disable openstack-cinder-volume during our
major-upgrade-pacemaker step. This leads to the following scenario. In
major_upgrade_controller_pacemaker_2.sh we do:

  start_or_enable_service galera
  check_resource galera started 600
  ....
  if [[ -n $(is_bootstrap_node) ]]; then
  ...
      cinder-manage db sync
  ...

What happens here is that since openstack-cinder-volume was never
disabled it will already be started by pacemaker before we call
cinder-manage and this will give us the following errors during the
start:
06:05:21.861 19482 ERROR cinder.cmd.volume DBError:
                   (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'")

Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf
Closes-Bug: #1627470
2016-09-25 11:52:04 +02:00
Michele Baldessari
9593981149 Start mongod before calling ceilometer-dbsync
Currently we in major_upgrade_controller_pacemaker_2.sh we are calling
ceilometer-dbsync before mongod is actually started (only galera is
started at this point). This will make the dbsync hang indefinitely
until the heat stack times out.

Now this approach should be okay, but do note that when we start mongod
via systemctl we are not guaranteed that it will be up on all nodes
before we call ceilometer-dbsync. This *should* be okay because
ceilometer-dbsync keeps retrying and eventually one of the nodes will
be available. A completely clean fix here would be to add another
step in heat to have the guarantee that all mongo servers are up and
running before the dbsync call.

Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca
Closes-Bug: #1627453
2016-09-25 10:49:15 +02:00
Michele Baldessari
16aba8f2b6 Remove deprecated scheduler_driver settings
In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the
scheduler_host setting which got deprecated in newton. It seems also the
scheduler_driver settings needs tweaking:

systemctl status openstack-nova-scheduler.service:
2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler
2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.',
                              {'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'})

Let's set this to default during the upgrade step. From newton's nova.conf:

  The class of the driver used by the scheduler. This should be chosen
  from one of the entrypoints under the namespace 'nova.scheduler.driver'
  of file 'setup.cfg'. If nothing is specified in this option, the
  'filter_scheduler' is used.

  This option also supports deprecated full Python path to the class to
  be used.  For example, "nova.scheduler.filter_scheduler.FilterScheduler".
  But note: this support will be dropped in the N Release.

Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1
Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6
Closes-Bug: #1627450
2016-09-25 10:34:22 +02:00
Michele Baldessari
24a73efdd0 Reinstantiate parts of code that were accidentally removed
With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be
"Rework the pacemaker_common_functions for M..N upgrades" we
accidentally removed some lines that fixed M/N upgrade issues.
Namely:
extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh

  -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047
  -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435
  -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit
  -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058
  -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists
  -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server"
  -# LP: 1615035, required only for M/N upgrade.
  -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager

extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh
  nova-manage db sync
- nova-manage api_db sync

This patch simply puts that code back without reverting the
whole commit that broke things, because that is needed.

Closes-Bug: #1627448

Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6
2016-09-25 10:18:57 +02:00
Sofer Athlan-Guyot
bc7f6ab041 Make sure major upgrade script fails.
Running upgrade-non-controller.sh against compute and object storage did
not fail if the /root/tripleo_upgrade_node.sh failed.

This make it harder to detect error in CI system for instance.

Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff
Closes-Bug: 1620973
2016-09-21 08:02:17 +00:00