904 Commits

Author SHA1 Message Date
Anderson, Craig (ca846m)
48a0c09fea Truncate long host names for overrides
Long hostnames can cause the 63 char name limit to be exceeded.
Truncate the hostname if hostname > 20 char.

Change-Id: Ieb7e4dafb41d1fe3ab3d663d2614f75c814afee6
2018-11-26 17:04:58 -08:00
Zuul
0730df5973 Merge "Prometheus: Add session affinity to ingress" 2018-11-26 18:21:14 +00:00
Zuul
4b76f8c280 Merge "Nagios: Update image tag" 2018-11-26 17:40:20 +00:00
Steve Wilkerson
71c1a16758 Prometheus: Add session affinity to ingress
This adds session affinity to Prometheus's ingress. This allows for
the use of cookies for Prometheus's session affinity

Change-Id: I2e7e1d1b5120c1fb3ddecb5883845e46d61273de
2018-11-26 14:30:08 +00:00
Steve Wilkerson
439079693d Nagios: Update image tag
This updates the Nagios image tag to include the updated plugin
for querying Elasticsearch for alerting on logged events

Change-Id: Idd61d82463b79baab0e94c20b32da1dc6a8b3634
2018-11-26 08:29:22 -06:00
Zuul
8e369d2c9c Merge "Ingress: Update version of ingress controller image" 2018-11-23 20:39:38 +00:00
Zuul
89b651dc1d Merge "Ingress: Make healthz port configurable" 2018-11-21 20:01:26 +00:00
Pete Birley
4d2085f0af Ingress: Update version of ingress controller image
This PS updates the version of the ingress controller image used.

This brings in the ability to update the ingress configuration without
reloading nginx. There may also need to be some changes for prom based
monitoring:
 * https://github.com/kubernetes/ingress-nginx/blob/master/Changelog.md#0100

Change-Id: Ia0bf3dbb9b726f3a5cfb1f95d7ede456af13374a
Signed-off-by: Pete Birley <pete@port.direct>
2018-11-21 19:21:40 +00:00
Zuul
16072765bf Merge "Ingress: Allow status port to be customised" 2018-11-20 18:29:16 +00:00
Pete Birley
ea875b1dcc Ingress: Make healthz port configurable
This PS updates the healthz port to be configurable

Change-Id: Ifa5ea4b7b422156a7309886ecc21668fc096065b
Signed-off-by: Pete Birley <pete@port.direct>
2018-11-20 12:28:14 -06:00
Pete Birley
f3e1fa4e72 Ingress: Allow status port to be customised
This PS updates the ingress chart to allow the status pport to be
changed.

Change-Id: Ia38223c56806f6113622a809e792b4fedd010d87
Signed-off-by: Pete Birley <pete@port.direct>
2018-11-20 09:57:56 -06:00
Matthew Heler
5ce9f2eb3b Enable Ceph charts to be rack aware for CRUSH
Add support for a rack level CRUSH map. Rack level CRUSH support is
enabled by using the "rack_replicated_rule" crush rule.

Change-Id: I4df224f2821872faa2eddec2120832e9a22f4a7c
2018-11-20 09:07:36 -06:00
Zuul
5d356f9265 Merge "Document howto recover from a Ceph namspace deletion" 2018-11-15 17:27:45 +00:00
Matthew Heler
cfc2d4abd8 Document howto recover from a Ceph namspace deletion
Change-Id: Ib1b03cd046fbdad6f18478cfa9c9f0bf70ec9430
2018-11-14 13:31:16 -06:00
Zuul
dd6b2a0a1d Merge "Additional Ceph RGW tuning and cleanups" 2018-11-14 18:48:36 +00:00
Zuul
5bf9c26bd8 Merge "Move default CEPH journal size from 5GB to 10GB" 2018-11-13 05:28:45 +00:00
Matthew Heler
225b85eb5f Additional Ceph RGW tuning and cleanups
Set RGW rados handles from 1 to 4
Remove support for fastcgi (it's no longer supported)

Change-Id: Ie260a3e1e5eab2065ec6a4d0637c144965a4214d
2018-11-12 20:13:33 +00:00
Zuul
2640e7422d Merge "This fixes host-specific overrides" 2018-11-10 02:37:53 +00:00
Zuul
2c9ff8bee8 Merge "Fix the checkPGs cronjob" 2018-11-09 22:57:50 +00:00
Ian Howell
9b132225c6 This fixes host-specific overrides
This properly assigns k8s secrets to volumes, rather than using
configMaps

Change-Id: Ifcabd3565fb2abee063f5da117d83ac3a5602536
2018-11-09 16:24:03 -06:00
Steve Wilkerson
dfb4654fba Nagios: Configuration updates
This moves to update the host used for the ceph health checks, as
we should be checking the ceph-mgr service directly for ceph
metrics instead of trying to curl the host directly.

This also changes the ceph_health_check to use the base-os
hostgroup instead of the placeholder ceph-mgr host group, as we're
just executing a simple check against the ceph-mgr service.

This also adds default configuration values for the
max_concurrent_checks (60) and check_workers (4) values instead
of leaving them at the defaults Nagios uses (0 and # cores,
respectively)

Change-Id: Ib4072fcd545d8c05d5e9e4a93085a8330be6dfe0
2018-11-09 13:28:50 -06:00
Steve Wilkerson
325b3cea4d Nagios: Update host check mechanism
This updates the Nagios image to use a tag that includes a fix for
the service discovery mechanism used for updating host checks.
After moving the Nagios chart to either run in shared or host PID
namespaces, the service discovery mechanism no longer worked due
to the plugin attempting to restart PID 1 instead of determining
the appropriate PID to restart.

For reference, see:
https://review.gerrithub.io/#/c/att-comdev/nagios/+/432205/

Change-Id: Ie01c3a93dd109a9dc99cfac5d27991583546605a
2018-11-09 09:12:16 -06:00
Zuul
b55e9b10a7 Merge "Nagios: Add session affinity to ingress" 2018-11-09 04:45:36 +00:00
Zuul
98c9b148f3 Merge "Nagios: Update ceph_health check" 2018-11-09 03:24:23 +00:00
Steve Wilkerson
2c6aa8ad1b Nagios: Add session affinity to ingress
This adds session affinity to Nagios's ingress. This allows for
the use of cookies for Nagios's session affinity

Change-Id: I6054a92f644dc533dd06d35a2541fb44d46cba88
2018-11-09 02:07:39 +00:00
Zuul
a90ebb784c Merge "Prometheus: Update discovery configuration for ceph-mgr services" 2018-11-09 01:01:54 +00:00
Zuul
77772547e2 Merge "RGW: Fix multinode deploy for ceph rgw" 2018-11-08 22:54:01 +00:00
Zuul
d530635348 Merge "Do not use OSH_INFRA_PATH in osh-infra" 2018-11-08 22:54:00 +00:00
Meg Heisler
774e0cb654 RGW: Fix multinode deploy for ceph rgw
Change deployment script for rgw to not use the docker
bridge for public and cluster network overrides. Instead,
calculate network values in same way as other ceph multinodes
deployment steps

Change-Id: I2bacd1af1cc331d76a5d61f3b589ca6ef80b1b2e
2018-11-08 11:39:23 -06:00
Matthew Heler
55446e1f41 Move default CEPH journal size from 5GB to 10GB
Request from downstream to use 10GB journal sizes. Currently journals 
are created manually today, but there is upcoming work to have the
journals created by the Helm charts themselves. This value needs to be
put in as a default to ensure journals are sized appropiately.

Change-Id: Idaf46fac159ffc49063cee1628c63d5bd42b4bc6
2018-11-08 17:34:12 +00:00
Zuul
7274c5f95f Merge "Revert "Fix rally deployment config to rally 1.2.0"" 2018-11-07 22:26:22 +00:00
Zuul
47d49bcfd4 Merge "prometheus ceph.rules changes" 2018-11-07 20:51:42 +00:00
Pete Birley
b7e77dfea0 Revert "Fix rally deployment config to rally 1.2.0"
This reverts commit 5c2859c3e9026e464bf0c35b591aaae810ff2a1c.

This commit breaks the ability to declare users to use with rally/helm test - and needs to be refactored to match the commit message's intent.

Change-Id: I2bc66ef40694c277058b4324b8a3528f4f25d1d1
2018-11-07 19:31:49 +00:00
Zuul
b28aed8331 Merge "Fix rally deployment config to rally 1.2.0" 2018-11-07 14:12:32 +00:00
Matthew Heler
e1c82f3465 Fix the checkPGs cronjob
Currently the cronjob is broken due to syntax and
permission issues.

Additionally move the cronjob from once a month to
every 15 minutes, and automatically disable the job
unless explicitly enabled.

Change-Id: Id72bdb286c805ccb0ea4e9fcf65fabca94a180dd
2018-11-06 19:39:23 -06:00
Steve Wilkerson
ba22b0e726 Nagios: Update ceph_health check
The ceph_health check in Nagios incorrectly sets the warning and
error level to 0. The ceph_health_status metric's value of 0
indicates the cluster is healthy, while 1 indicates a warning and
2 indicates an error state. The Nagios check for ceph_health is
updated to reflect these values

Change-Id: Iffe80f1c34f6edee6370dd7e707e5f55f83f1ec1
2018-11-06 14:51:40 -06:00
Steve Wilkerson
e0f2d66ee3 Prometheus: Update discovery configuration for ceph-mgr services
This updates the Prometheus scrape configuration to use the
service based discovery mechanism instead of endpoints. This
removes issues associated with multiple ceph-mgr replicas deployed

Change-Id: I2c557af0c7200d0c4aea646c5f9ecd1a070db33e
2018-11-06 13:56:37 -06:00
Jean-Philippe Evrard
ff1f75fc45 Do not use OSH_INFRA_PATH in osh-infra
If OSH_INFRA_PATH is never used in the openstack-helm-infra repository,
as all the references are using relative paths.

The keystone script is not using a relative path, and relies on
OSH_INFRA_PATH to be defined to work.

This is a problem, because when it is not defined, the expected path
for ldap chart is /ldap, which is an incorrect path.

This fixes the problem by ensuring the path is relative.

Change-Id: I04a8d5c074b7c1e6fa66617bbb907f2ad4dcb3af
2018-11-05 13:36:03 +00:00
Zuul
fca344900f Merge "Enable the mgr balancer module by default." 2018-11-02 22:36:13 +00:00
Steve Wilkerson
69196031cd Nagios: Ensure processes are reaped
This moves Nagios to run as child processes of either
the pause container or use the hosts init system (for k8s <1.10)
to prevent defunct process sprawl

Change-Id: I6a93d446577674b0b012f9567d5e6a5794ebc44b
2018-11-02 08:12:24 -05:00
Matthew Heler
a79562a28b Enable the mgr balancer module by default.
The balancer module will distribute PGs more evenly across OSDs. 
While CRUSH does a good job at this, it is not perfect and hot spots
(where an OSD has more PGs then it's peers) can occur.

Change-Id: Ic45a6bf745bdd09a3f5782e9e8bda89c3d3da2aa
2018-11-01 15:52:51 +00:00
inspurericzhang
f1c2bf976f [Trivial Fix] modify spelling error of "resource"
Although it is spelling mistakes, it affects reading.

Change-Id: I75a1f66002ec46fe206f31fec02fbd47f9cee443
2018-11-01 09:52:04 +08:00
kranthi guttikonda
fac358a575 prometheus ceph.rules changes
With new ceph luminous ceph.rules are obsolete.

Added a new rule for ceph-mgr count

Changed ceph_monitor_quorum_count to ceph_mon_quorum_count

Updated ceph_cluster_usage_highas ceph_cluster_used_bytes,
ceph_cluster_capacity_bytes aren't valid

Updated ceph_placement_group_degrade_pct_high as
ceph_degraded_pgs, ceph_total_pgs aren't valid

Updated ceph_osd_down_pct_high as ceph_osds_down,
ceph_osds_up aren't available, ceph_osd_up is
available but ceph_osd_down isn't. Need to
calculate the down based on count(ceph_osd_up==0)
and total osd using count(ceph_osd_metadata)

Removed ceph_monitor_clock_skew_high as the metric
ceph_monitor_clock_skew_seconds isn't  valid anymore

Added new alarms ceph_osd_down, ceph_osd_out

Implements: prometheus ceph.rules changes with new valid metrics
Closes-Bug: #1800548
Change-Id: Id68e64472af12e8dadffa61373c18bbb82df96a3
Signed-off-by: Kranthi Guttikonda <kranthi.guttikonda@b-yond.com>
2018-10-31 10:23:11 -04:00
Matthew Heler
3e7ba37290 Ensure latest Ceph packages during deployment
Change-Id: Ia5bc0802577e2b72a1de078085f5fe7e60f63604
2018-10-31 02:16:50 -05:00
Tin Lam
5730631ba6 Clean-up script
This patch set cleans up the script to be consistent with other OSH
installation scripts.

Change-Id: I212cd0cf0e818f1fc924b9b690d18f5d107b850b
Signed-off-by: Tin Lam <tin@irrational.io>
2018-10-30 16:22:45 +00:00
Zuul
31a9bb6ad4 Merge "[gate] Use Kubernetes 1.10.9" 2018-10-30 08:05:08 +00:00
Steve Wilkerson
45da8c2b69 Ceph: Update log directory host mount path
This updates the ceph-mon and ceph-osd charts to use the release
name for the hostpath defined for mounting the /var/log/ceph
directories to. This gives us a mechanism for creating unique log
directories for multiple releases of the same chart without the
need for specifying an override for each deployment of that chart

Change-Id: Ie6e05b99c32f24440fbade02d59c7bb14d8aa4c8
2018-10-29 13:05:46 -05:00
Chris Wedgwood
b10ebbb63a [gate] Use Kubernetes 1.10.9
Change-Id: I5bb951f455fa6d7d344a264336a2a9b985fd85f4
2018-10-29 15:10:35 +00:00
Matthew Heler
6ef48d3706 Further performance tuning changes for Ceph
- Throttle down snap trimming as to lessen it's performance impact
(Setting just osd_snap_trim_priority isn't effective enough to throttle
down the impact)
osd_snap_trim_sleep: 0.1 (default 0)
osd_pg_max_concurrent_snap_trims: 1 (default 2)

- Align filestore_merge_threshold with upstream Ceph values
(A negative number disables this function, no change in behavior)
filestore_merge_threshold: -10 (formerly -50, default 10)

- Increase RGW pool thread size for more concurrent connections
rgw_thread_pool_size: 512 (default 100)

- Disable in-memory logs for the ms subsytem.
debug_ms: 0/0 (default 0/5)

- Formating cleanups

Change-Id: I4aefcb6e774cb3e1252e52ca6003cec495556467
2018-10-26 15:10:50 +00:00
Zuul
62f49e7c74 Merge "Define OSH_PATH by default" 2018-10-26 11:35:11 +00:00