tripleo-heat-templates

Author	SHA1	Message	Date
Jenkins	9b4a981f95	Merge "Fix race during major-upgrade-pacemaker step"	2016-11-10 19:00:08 +00:00
Jenkins	1efaa8c6a2	Merge "Reload haproxy configuration as a post-deployment step"	2016-11-09 18:10:35 +00:00
Michele Baldessari	dde12b075f	Fix race during major-upgrade-pacemaker step Currently when we call the major-upgrade step we do the following: """ ... if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster fi ... if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha fi ... for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}" ... done """ The problem with the above code is that it is open to the following race condition: 1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services 2. Pacemaker notices will notice that services are down and will mark the service as stopped 3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit 4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3) Let's make sure we first only call the HA NG migration step as a separate heat step. Only afterwards we start shutting down the systemd services on all nodes. We also need to move the STONITH_STATE variable into a file because it is being used across two different scripts (1 and 2) and we need to store that state. Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com> Closes-Bug: #1640407 Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60	2016-11-09 14:51:51 +01:00
Pradeep Kilambi	094bbefe71	ceilometer compute agent needs restart on compute upgrade After compute nodes are upgraded, the ceilometer compute agent doesnt poll and throws warnings. Restarting the compute agent at this step gets the service back to its normal state. Closes-Bug: #1640177 Change-Id: I7392de43e933b1d16002e12e407748ae289d5e99	2016-11-08 15:52:04 +00:00
Carlos Camacho	17e727d716	Reload haproxy configuration as a post-deployment step After deploying a fresh installed Overcloud or updating the stack the haproxy configuration is updated correctly but no change in the HA proxy stats happens. This submission will add the missing resources to run pre and post puppet tasks. Closes-bug: 1640175 Change-Id: I2f08704daeee502c618256695a30ce244a1d7ba5	2016-11-08 13:56:18 +00:00
Jenkins	cac8c7d285	Merge "Update openstack-puppet-modules dependencies"	2016-11-04 14:08:15 +00:00
Jenkins	27a9382dd8	Merge "Fixup the start of swift services"	2016-11-04 14:08:08 +00:00
Jenkins	2df74cc829	Merge "Rework gnocchi-upgrade to run in a separate upgrade step"	2016-11-03 17:28:58 +00:00
marios	a7af5b90e4	Fixup the start of swift services Seems the conditional has changed and we should pickup the tripleo::profile::base::swift::storage::enable_swift_storage hiera data. After controller nodes are upgraded the swift services were down even though there was no stand-alone swift node (the current conditional was failing as that hiera isn't set any more) Closes-Bug: 1638821 Change-Id: Id1383c1e54f9cae13fd375e90da525230e5d23eb	2016-11-03 07:33:40 +00:00
Lukas Bezdicka	d8fa70d2fd	Update openstack-puppet-modules dependencies OPM package is metadata package with unversioned requirements which means that update does not update the dependencies. This leaves us with old puppet modules and old puppet during the puppet run. Change-Id: I80f8a73142a09bb4178bb5a396d256ba81ba98a8 Closes-Bug: #1638266 Resolves: rhbz#1390559	2016-11-01 13:44:57 +01:00
Pradeep Kilambi	a8e119094f	Rework gnocchi-upgrade to run in a separate upgrade step gnocchi when configured with swift will require keystone to be available to authenticate to migrate to v3. At this step keystone is not available and gnocchi upgrade fails with auth error. Instead start apache in step 3, start apache first and then run gnocchi upgrade in a separate step and let upgrade happen here. Closes-Bug: #1634897 Change-Id: I22d02528420e4456f84b80905a7b3a80653fa7b0	2016-11-01 08:33:23 -04:00
Mathieu Bultel	61cba946cd	Add replacepkgs to the manual ovs upgrade workaround and fix a typo rpm command will return an exit 1 if ovs package is already there and will exit the step_1.sh script. To get around this force the update with --replacepkgs Also remove the \ just before the $ which cause a syntax error for the ceph storage Change-Id: I11fcf688982ceda5eef7afc8904afae44300c2d9 Closes-bug: 1636748	2016-10-27 11:38:12 +03:00
Jenkins	ab00d9393b	Merge "Fix the stonith property during upgrades"	2016-10-25 14:38:50 +00:00
Michele Baldessari	3866490052	Fix the rabbitmq/redis pacemaker resource timeouts on updates With the following two changes we increased the timeout for redis and rabbit for both starting and stopping to 200s: https://review.openstack.org/386618 newton (merged) https://review.openstack.org/385555 master (merged) We want to also fix that on minor updates on all our supported releases upstream and downstream (newton, mitaka, liberty, kilo). This way we can guarantee that we have a uniform timeout for sart and stop for rabbit and redis across all our releases. Change-Id: If59bf3386832ee78d3a654f01077aff2e8be76e8 Closes-Bug: #1634851	2016-10-22 20:59:16 +00:00
Michele Baldessari	7ce217909a	Fix the stonith property during upgrades We currently set the stonith property from all controller nodes during upgrade. This is racy and can actually end up disabling stonith after the upgrade even if when it was enabled. Let's set the property only from the bootstrap node. Change-Id: Id4afb867b485ac853be874a0179a7ed7cc914068 Closes-Bug: #1635294	2016-10-20 20:16:28 +02:00
marios	7e09b70bc3	Add special case handling for OVS upgrade in updates and upgrades This adds a special case handling for the opensvswitch package as discussed at the related bug below. This is added/handled here for both the minor update and the major mitaka...newton upgrade. Change-Id: I9b1f0eaa0d36a28e20b507bec6a4e9b3af1781ae Closes-Bug: 1635205	2016-10-20 13:42:37 +03:00
Michele Baldessari	30a570a7f4	Actually start the systemd services in step3 of the major-upgrade step We have the following function in the upgrade process after we updated the packages and called the db-sync commands: services=$(services_to_migrate) ... for service in $(services); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done The above is broken because $services contains a list of services to start, so $(services) will return gibberish and the for loop will never execute anything. One of the symptoms for this is the openstack-nova-compute service not restarting on the compute nodes during the yum -y upgrade. The reason for this is that during the service restart, nova-compute waits for nova-conductor to show up in the rabbitmq queues, which cannot happen since the service was actually never started. Change-Id: I811ff19d7b44a935b2ec5c5e66e5b5191b259eb3 Closes-Bug: #1630580	2016-10-10 21:18:26 +02:00
Pradeep Kilambi	eaf91da5ef	Ceilometer Wsgi Mitaka->Newton upgrades In Newton, ceilometer api is changed to run under apache wsgi instead of eventlet. This will require upgrades for mitaka deployments to switch to wsgi. Closes-Bug: 1631297 Change-Id: If9d6987cd0a8fc5d3f9de518ba422d97d5149732	2016-10-07 11:43:33 +03:00
marios	2e6cc07c1a	Adds Environment File for Removing Sahara during M/N upgrade The default path if the operator does nothing is to keep the sahara services on mitaka to newton upgrades. If the operator wishes to remove sahara services then they need to specify the provided major-upgrade-remove-sahara.yaml environment file in the stack upgrade commands. The existing migration to ha arch already removes the constraints and pcs resource for sahara api/engine so we just need to stop it from starting again if we want to remove it. This adds a KeepSaharaServiceOnUpgrade parameter to determine if Sahara is disabled from starting up after the controllers are upgraded (defaults true). Finally it is worth noting that we default the sahara services as 'on' during converge here in the resource_registry of the converge environment file; any subsequent stack updates where the deployment contains sahara services will need to include the -e /environments/services/sahara.yaml environment file. Related-Bug: 1630247 Change-Id: I59536cae3260e3df52589289b4f63e9ea0129407	2016-10-05 16:32:31 +03:00
Jenkins	575bf581ea	Merge "Set ceph osd max object name and namespace len on upgrade when on ext4"	2016-10-04 03:01:11 +00:00
Jenkins	5f7d913c10	Merge "Update $service to $resource this variable does not exist in the context"	2016-10-03 18:17:47 +00:00
Mathieu Bultel	dc6f93da4f	Update $service to $resource this variable does not exist in the context heat failed due to a: service: unbound variable In the context $service is never set. Change-Id: If82ee4562612f2617b676732956396278ee40a88 Closes-Bug: #1629903	2016-10-03 17:28:08 +02:00
Michele Baldessari	1d7231aae2	Change the rabbitmq ha policies during an M/N Upgrade This takes care of the M->N upgrade path when changing the ha rabbitmq policy. Partial-Bug: #1628998 Change-Id: I2468a096b5d7042bc801a742a7a85fb1521c1c02	2016-10-03 10:49:39 +02:00
Jenkins	01198c81d2	Merge "Use -L with chown and set crush map tunables when upgrading Ceph"	2016-09-29 23:58:01 +00:00
Jenkins	f5f41504e5	Merge "Fix typo in fixing gnocchi upgrade."	2016-09-29 23:57:43 +00:00
Giulio Fidente	27e1d105fb	Set ceph osd max object name and namespace len on upgrade when on ext4 As per [1] we need to lower osd max object name and namespace len when upgrading from Hammer and the OSD is backed by ext4. These could also be given via ExtraConfig but on upgrade we only run puppet apply after this script is executed, so the values won't be effective unless the daemon is restarted. Yet we do not want puppet to restart the daemon because we can't bring all OSDs down unconditionally or guests will die. 1. http://tracker.ceph.com/issues/16187 Co-Authored-By: Michele Baldessari <michele@acksyn.org> Co-Authored-By: Dimitri Savineau <dsavinea@redhat.com> Change-Id: I7fec4e2426bdacd5f364adbebd42ab23dcfa523a Closes-Bug: 1628874	2016-09-29 16:15:13 +00:00
Jenkins	72aa430246	Merge "Relax pre-upgrade check for failed actions"	2016-09-29 14:56:49 +00:00
Jenkins	6fadcce868	Merge "Fix races in major-upgrade-pacemaker Step2"	2016-09-29 14:56:41 +00:00
Sofer Athlan-Guyot	371698a203	Fix typo in fixing gnocchi upgrade. Change-Id: I44451a280dd928cd694dd6845d5d83040ad1f482 Related-Bug: #1626592	2016-09-29 15:22:16 +02:00
Jenkins	77480ec29c	Merge "Full HA->HA NG migration might fail setting maintenance-mode"	2016-09-29 13:08:35 +00:00
Giulio Fidente	059307718f	Use -L with chown and set crush map tunables when upgrading Ceph Previously the chown command wasn't traversing symlinks, causing the new ownership to not be set for some needed files. This change also ensures the crush map tunables are set to the 'default' profile after the upgrade. Finally redirects the output of a pidof to /dev/null to avoid spurious logging. Change-Id: Id4865ffff207edfc727d729f9cc04e6e81ad19d8	2016-09-29 13:35:05 +02:00
Michele Baldessari	32c54304f4	Relax pre-upgrade check for failed actions Before this change we checked the cluster for any failed actions and we stopped the upgrade process if there were any. This is likely eccessive as a failed action could have happened in the past and the cluster is now fully functional. Better to check if any of the resources are in Stopped state and break the upgrade process if any of them are. We also need to restrict this check to the bootstrap node because otherwise the following might happen: 1) Bootstrap node does the check, it is successful and it starts the full HA -> HA NG migration which will create failed actions and will start stopping resources 2) If the check now starts on a non-bootstrap node while 1) is ongoing, it will find either failed actions or stopped resources so it will fail. Change-Id: Ib091f6dd8884025d2e23bf2fa700169e2dec778f Closes-Bug: #1628653	2016-09-29 09:02:24 +02:00
Michele Baldessari	ad07a29f94	Fix races in major-upgrade-pacemaker Step2 tripleo-heat-templates/extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh has the following code: ... check_resource mongod started 600 if [[ -n $(is_bootstrap_node) ]]; then ... tstart=$(date +%s) while ! clustercheck; do sleep 5 tnow=$(date +%s) if (( tnow-tstart > galera_sync_timeout )) ; then echo_error "ERROR galera sync timed out" exit 1 fi done # Run all the db syncs cinder-manage db sync ... fi start_or_enable_service rabbitmq check_resource rabbitmq started 600 start_or_enable_service redis check_resource redis started 600 start_or_enable_service openstack-cinder-volume check_resource openstack-cinder-volume started 600 systemctl_swift start for service in $(services_to_migrate); do manage_systemd_service start "${service%%-clone}" check_resource_systemd "${service%%-clone}" started 600 done """ The problem with the above code is that it is open to the following race condition: 1) Bootstrap node is busy checking the galera status via cluster check 2) Non-bootstrap node has already reached: start_or_enable_service rabbitmq and later lines. These lines will be skipped because start_or_enable_service is a noop on non-bootstrap nodes and check_resource rabbitmq only checks that pcs status \|grep rabbitmq returns true. 3) Non-bootstrap node can then reach the manage_systemd_service start and it will fail with stuff like: "Job for openstack-nova-scheduler.service failed because the control process exited with error code. See \"systemctl status openstack-nova-scheduler.service\" and \"journalctl -xe\" for details.\n" (because the db tables are not migrated yet) This happens because 3) was started on non-bootstrap nodes before the db-sync statements are complete on the bootstrap node. I did not feel like changing the semantics of check_resource and remove the noop on non-bootstrap nodes as other parts of the tree might rely on this behaviour. Depends-On: Ia016264b51f485b97fa150ebd357b109581342ed Change-Id: I663313e183bb05b35d0c5af016c2d1705c772bd9 Closes-Bug: #1627965	2016-09-29 07:41:28 +02:00
Sofer Athlan-Guyot	89efa79599	Update gnocchi database during M/N upgrade. We call gnocchi-upgrade to make sure we update all the needed schemas during the major-upgrade-pacemaker step. We also make sure that redis is started before we call gnocchi-upgrade otherwise the command will be stuck in a loop trying to contact redis. Closes-Bug: #1626592 Change-Id: Ia016264b51f485b97fa150ebd357b109581342ed	2016-09-28 22:46:01 +02:00
Michele Baldessari	35da6af8bd	Full HA->HA NG migration might fail setting maintenance-mode Currently we do the following in the migration path: pcs property set maintenance-mode=true if ! timeout -k 10 300 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting." exit 1 fi crm_resource --wait can actually take forever under certain conditions. The property will be set atomically across the cluster nodes so we should be good without this. Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04 Closes-Bug: #1628393	2016-09-28 12:28:39 +02:00
Michele Baldessari	da53e9c00b	Fix "Not all flavors have been migrated to the API database" After a successful upgrade to Newton, I ran the tripleo.sh --overcloud-pingtest and it failed with the following: resources.test_flavor: Not all flavors have been migrated to the API database (HTTP 409) The issue is the fact that some tables have migrated to the nova_api db and we need to migrate the data as well. Currently we do: nova-manage db sync nova-manage api_db sync We want to add: nova-manage db online_data_migrations After launching this command the overcloud-pingtest works correctly: tripleo.sh -- Overcloud pingtest SUCCEEDED Change-Id: Id2d5b28b5d4ade7dff6c5e760be0f509b4fe5096 Closes-Bug: #1628450	2016-09-28 12:20:33 +02:00
Jenkins	a7b7a118bf	Merge "Remove deprecated scheduler_driver settings"	2016-09-27 08:50:49 +00:00
Jenkins	9e1d7f0495	Merge "Disable openstack-cinder-volume in step1 and reenable it in step2"	2016-09-27 06:50:12 +00:00
Jenkins	fb8338ec24	Merge "Fix ignore warning on ceph major upgrade."	2016-09-27 02:04:27 +00:00
Jenkins	7565e03a82	Merge "A few major-upgrade issues"	2016-09-27 01:11:46 +00:00
Jenkins	9023746e1f	Merge "Start mongod before calling ceilometer-dbsync"	2016-09-27 01:11:39 +00:00
Jenkins	a85936ea65	Merge "Reinstantiate parts of code that were accidentally removed"	2016-09-27 01:11:32 +00:00
Sofer Athlan-Guyot	def3801fcb	Fix ignore warning on ceph major upgrade. The paramater IgnoreCephUpgradeWarnings is type cast into a boolean which is rendered as 'True' or 'False' as a string not 'true' or 'false'. This fix the check. Change-Id: I8840c384d07f9d185a72bde5f91a3872a321f623 Closes-Bug: 1627736	2016-09-26 16:23:18 +00:00
Michele Baldessari	9393a3e2a5	get_param calls with multiple arguments need brackets around them This issue was spotted during major upgrade where we had calls like this: servers: {get_param: servers, Controller} These get_param calls are hanging indefinitely and make the whole upgrade end in a timeout. We need to put brackets around the get_param function when there are multiple arguments: http://docs.openstack.org/developer/heat/template_guide/hot_spec.html#get-param This is already done in most of the tree, and the few places where this was not happening were parts not under CI. After this change the following grep returns only one false positive: grep -ir get_param: \|grep -v -- '\[' \|grep ',' Change-Id: I65b23bb44f37b93e017dd15a5212939ffac76614 Closes-Bug: #1626628	2016-09-25 22:05:00 +02:00
Michele Baldessari	f9e6a26f32	A few major-upgrade issues This commit does the following: 1. We now explicitly disable/stop and then remove the resources that are moving to systemd. We do this because we want to make sure they are all stopped before doing a yum upgrade, which otherwise would take ages due to rabbitmq and galera being down. It is best if we do this via pcs while we do the HA Full -> HA NG migration because it is simpler to make sure all the services are stopped at that stage. For extra safety we can still do a check by hand. By doing it via pacemaker we have the guarantee that all the migrated services are down already when we stop the cluster (which happens to be a syncronization point between all controller nodes). That way we can be certain that they are all down on all nodes before starting the yum upgrade process. 2. We actually need to start the systemd services in major_upgrade_controller_pacemaker_2.sh and not stop them. 3. We need to use the proper bash variable name 4. Use is_bootstrap_node everywhere to make the code more consistent Change-Id: Ic565c781b80357bed9483df45a4a94ec0423487c Closes-Bug: #1627490	2016-09-25 14:10:31 +02:00
Michele Baldessari	b70d6e6f34	Disable openstack-cinder-volume in step1 and reenable it in step2 Currently we do not disable openstack-cinder-volume during our major-upgrade-pacemaker step. This leads to the following scenario. In major_upgrade_controller_pacemaker_2.sh we do: start_or_enable_service galera check_resource galera started 600 .... if [[ -n $(is_bootstrap_node) ]]; then ... cinder-manage db sync ... What happens here is that since openstack-cinder-volume was never disabled it will already be started by pacemaker before we call cinder-manage and this will give us the following errors during the start: 06:05:21.861 19482 ERROR cinder.cmd.volume DBError: (pymysql.err.InternalError) (1054, u"Unknown column 'services.cluster_name' in 'field list'") Change-Id: I01b2daf956c30b9a4985ea62cbf4c941ec66dcdf Closes-Bug: #1627470	2016-09-25 11:52:04 +02:00
Michele Baldessari	9593981149	Start mongod before calling ceilometer-dbsync Currently we in major_upgrade_controller_pacemaker_2.sh we are calling ceilometer-dbsync before mongod is actually started (only galera is started at this point). This will make the dbsync hang indefinitely until the heat stack times out. Now this approach should be okay, but do note that when we start mongod via systemctl we are not guaranteed that it will be up on all nodes before we call ceilometer-dbsync. This should be okay because ceilometer-dbsync keeps retrying and eventually one of the nodes will be available. A completely clean fix here would be to add another step in heat to have the guarantee that all mongo servers are up and running before the dbsync call. Change-Id: I10c960b1e0efdeb1e55d77c25aebf1e3e67f17ca Closes-Bug: #1627453	2016-09-25 10:49:15 +02:00
Michele Baldessari	16aba8f2b6	Remove deprecated scheduler_driver settings In bug https://bugs.launchpad.net/tripleo/+bug/1615035 we fixed the scheduler_host setting which got deprecated in newton. It seems also the scheduler_driver settings needs tweaking: systemctl status openstack-nova-scheduler.service: 2016-09-24 20:24:54.337 15278 WARNING stevedore.named [-] Could not load nova.scheduler.filter_scheduler.FilterScheduler 2016-09-24 20:24:54.338 15278 CRITICAL nova [-] RuntimeError: (u'Cannot load scheduler driver from configuration %(conf)s.', {'conf': 'nova.scheduler.filter_scheduler.FilterScheduler'}) Let's set this to default during the upgrade step. From newton's nova.conf: The class of the driver used by the scheduler. This should be chosen from one of the entrypoints under the namespace 'nova.scheduler.driver' of file 'setup.cfg'. If nothing is specified in this option, the 'filter_scheduler' is used. This option also supports deprecated full Python path to the class to be used. For example, "nova.scheduler.filter_scheduler.FilterScheduler". But note: this support will be dropped in the N Release. Change-Id: Ic384292ad05a57757158995ec4c1a269fe4b00f1 Depends-On: I89124ead8928ff33e6b6907a7c2178169e91f4e6 Closes-Bug: #1627450	2016-09-25 10:34:22 +02:00
Michele Baldessari	24a73efdd0	Reinstantiate parts of code that were accidentally removed With commit fb25385d34e604d2f670cebe3e03fd57c14fa6be "Rework the pacemaker_common_functions for M..N upgrades" we accidentally removed some lines that fixed M/N upgrade issues. Namely: extraconfig/tasks/major_upgrade_controller_pacemaker_1.sh -# https://bugzilla.redhat.com/show_bug.cgi?id=1284047 -# Change-Id: Ib3f6c12ff5471e1f017f28b16b1e6496a4a4b435 -crudini --set /etc/ceilometer/ceilometer.conf DEFAULT rpc_backend rabbit -# https://bugzilla.redhat.com/show_bug.cgi?id=1284058 -# Ifd1861e3df46fad0e44ff9b5cbd58711bbc87c97 Swift Ceilometer middleware no longer exists -crudini --set /etc/swift/proxy-server.conf pipeline:main pipeline "catch_errors healthcheck cache ratelimit tempurl formpost authtoken keystone staticweb proxy-logging proxy-server" -# LP: 1615035, required only for M/N upgrade. -crudini --set /etc/nova/nova.conf DEFAULT scheduler_host_manager host_manager extraconfig/tasks/major_upgrade_controller_pacemaker_2.sh nova-manage db sync - nova-manage api_db sync This patch simply puts that code back without reverting the whole commit that broke things, because that is needed. Closes-Bug: #1627448 Change-Id: I89124ead8928ff33e6b6907a7c2178169e91f4e6	2016-09-25 10:18:57 +02:00
Sofer Athlan-Guyot	bc7f6ab041	Make sure major upgrade script fails. Running upgrade-non-controller.sh against compute and object storage did not fail if the /root/tripleo_upgrade_node.sh failed. This make it harder to detect error in CI system for instance. Change-Id: I12b7d640547d3b8ec1f70104d159d6052b7638ff Closes-Bug: 1620973	2016-09-21 08:02:17 +00:00

1 2 3 4

164 Commits