Files
openstack-armada-app/openstack-helm-infra/debian/deb_folder/patches/0023-Update-libvirt-cgroup-controllers-initiation.patch
Daniel Caires 3c33e176d0 Update libvirt cgroup controllers initialization
A bug in STX-O prevented the libvirt container to be
killed if a server was running in a system where
only 1 worker is available. It was found that the VM
processes run by QEMU were placed in sub-cgroups under
the libvirt pod’s cgroup, creating a parent/child
relationship in the cgroup hierarchy.
Some of the cgroup resource controllers (defined in [1])
used by the VM could not be moved back to the host
root cgroup when deleting the libvirt pod, which
prevented the kernel from fully terminating the container.

This caused the pod’s preStop hook to attempt to kill all
libvirt processes, but because some of the VM’s cgroup
resource controllers remained attached to the pod’s own
cgroup hierarchy, the kernel could not fully release those
resources back to the host. As a result, the libvirt
container could not transition to a clean terminated state
and stayed stuck in a terminating condition.

This caused the pod preStop hook to kill all libvirt
processes but because some of the VM's controllers
where linked to the pod, the container could not be
definitely killed.

The libvirt cgroup initialization in the caracal version
uses a small hard-coded list of controllers, that are set
in the libvirt bash file.

In addition, a cgroup_controllers list was added to the
libvirt static Helm overrides so that the set of controllers
used can be configured explicitly from the chart values.
This ensures that any future changes in the available
cgroup controllers can be handled through the override
file without requiring further changes to the libvirt
initialization script.

This review creates a patch to update the .sh to it's
latest version, where it compares a list of controllers
set in the values file with the controllers available in
the host [2], and use that list to initialize the controllers
in the libvirt process. The patch also removes a hugepage
validation that existed in the bash file since the validation
is not necessary, given that libvirt is not running in
the pod cgroup anymore [3].

These are the commits that added the changes to the
upstream OSH repository:

[1] - https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
[2] - 3903f54d0c
[3] - ea3c04a7d9

This patch can be dropped after the epoxy upversion.

Test-Plan:
PASS - Build all STX-O packages and tarball
PASS - Patch is present on OSH-I package
PASS - Upload and apply STX-Openstack
PASS - Launch 3 VMs and delete libvirt pod
PASS - Upgrade k8s version from 1.29 to 1.31
PASS - Libvirt pod always comes back running
PASS - All VMs are accessible after pod restart

Closes-Bug: #2125753

Change-Id: Ie7dcac64a55834d670a3a2e0b689b22f25e01ce0
Signed-off-by: Daniel Caires <DanielMarques.Caires@windriver.com>
2025-09-29 16:51:23 -03:00

163 lines
6.5 KiB
Diff

From 809afdbc5bada6acbe0e16fcd650b0fed8d4824e Mon Sep 17 00:00:00 2001
From: Daniel Caires <DanielMarques.Caires@windriver.com>
Date: Fri, 26 Sep 2025 07:07:05 -0300
Subject: [PATCH] Update libvirt cgroup controllers initialization
The libvirt cgroup initialization in the caracal version
uses a hard-coded list of controllers, that are set
in the libvirt bash file. This patch updates the .sh
to it's latest version [1], where it compares a list of
controllers set in the values file with the controllers
available in the host, and use that list to initialize
the controllers in the libvirt process. This patch also
removes a hugepage that existed in the bash file, as
it was removed from the upstream repo as well [2].
Commit's SHA that added the change in this patch, on the
upstream repository:
[1] - https://opendev.org/openstack/openstack-helm/commit/3903f54d0c1701f86f92da9023b67b7b453c4760
[2] - https://opendev.org/openstack/openstack-helm/commit/ea3c04a7d9e39d63402751353e00d21762d988e5
Signed-off-by: Daniel Caires <DanielMarques.Caires@windriver.com>
---
libvirt/templates/bin/_libvirt.sh.tpl | 76 +++++----------------------
libvirt/values.yaml | 14 +++++
2 files changed, 26 insertions(+), 64 deletions(-)
diff --git a/libvirt/templates/bin/_libvirt.sh.tpl b/libvirt/templates/bin/_libvirt.sh.tpl
index d16cdca3..af1b4f5e 100644
--- a/libvirt/templates/bin/_libvirt.sh.tpl
+++ b/libvirt/templates/bin/_libvirt.sh.tpl
@@ -24,13 +24,6 @@ if [ -f /tmp/vnc.crt ]; then
mv /tmp/vnc-ca.crt /etc/pki/libvirt-vnc/ca-cert.pem
fi
-# TODO: We disable cgroup functionality for cgroup v2, we should fix this in the future
-if $(stat -fc %T /sys/fs/cgroup/ | grep -q cgroup2fs); then
- CGROUP_VERSION=v2
-else
- CGROUP_VERSION=v1
-fi
-
if [ -n "$(cat /proc/*/comm 2>/dev/null | grep -w libvirtd)" ]; then
set +x
for proc in $(ls /proc/*/comm 2>/dev/null); do
@@ -55,16 +48,14 @@ if [ "$(cat /etc/os-release | grep -w NAME= | grep -w CentOS)" ]; then
fi
fi
-if [ $CGROUP_VERSION != "v2" ]; then
- #Setup Cgroups to use when breaking out of Kubernetes defined groups
- CGROUPS=""
- for CGROUP in cpu rdma hugetlb; do
- if [ -d /sys/fs/cgroup/${CGROUP} ]; then
- CGROUPS+="${CGROUP},"
- fi
- done
- cgcreate -g ${CGROUPS%,}:/osh-libvirt
-fi
+#Setup Cgroups to use when breaking out of Kubernetes defined groups
+CGROUPS=""
+for CGROUP in {{ .Values.conf.kubernetes.cgroup_controllers | include "helm-toolkit.utils.joinListWithSpace" }}; do
+ if [ -d /sys/fs/cgroup/${CGROUP} ] || grep -w $CGROUP /sys/fs/cgroup/cgroup.controllers; then
+ CGROUPS+="${CGROUP},"
+ fi
+done
+cgcreate -g ${CGROUPS%,}:/osh-libvirt
# We assume that if hugepage count > 0, then hugepages should be exposed to libvirt/qemu
hp_count="$(cat /proc/meminfo | grep HugePages_Total | tr -cd '[:digit:]')"
@@ -86,50 +77,11 @@ if [ 0"$hp_count" -gt 0 ]; then
echo "ERROR: Hugepages configured in kernel, but libvirtd container cannot access /dev/hugepages"
exit 1
fi
-
- if [ $CGROUP_VERSION != "v2" ]; then
- # Kubernetes 1.10.x introduced cgroup changes that caused the container's
- # hugepage byte limit quota to zero out. This workaround sets that pod limit
- # back to the total number of hugepage bytes available to the baremetal host.
- if [ -d /sys/fs/cgroup/hugetlb ]; then
- limits="$(ls /sys/fs/cgroup/hugetlb/{{ .Values.conf.kubernetes.cgroup }}/hugetlb.*.limit_in_bytes)" || \
- (echo "ERROR: Failed to locate any hugetable limits. Did you set the correct cgroup in your values used for this chart?"
- exit 1)
- for limit in $limits; do
- target="/sys/fs/cgroup/hugetlb/$(dirname $(awk -F: '($2~/hugetlb/){print $3}' /proc/self/cgroup))/$(basename $limit)"
- # Ensure the write target for the hugepage limit for the pod exists
- if [ ! -f "$target" ]; then
- echo "ERROR: Could not find write target for hugepage limit: $target"
- fi
-
- # Write hugetable limit for pod
- echo "$(cat $limit)" > "$target"
- done
- fi
-
- # Determine OS default hugepage size to use for the hugepage write test
- default_hp_kb="$(cat /proc/meminfo | grep Hugepagesize | tr -cd '[:digit:]')"
-
- # Attempt to write to the hugepage mount to ensure it is operational, but only
- # if we have at least 1 free page.
- num_free_pages="$(cat /sys/kernel/mm/hugepages/hugepages-${default_hp_kb}kB/free_hugepages | tr -cd '[:digit:]')"
- echo "INFO: '$num_free_pages' free hugepages of size ${default_hp_kb}kB"
- if [ 0"$num_free_pages" -gt 0 ]; then
- (fallocate -o0 -l "$default_hp_kb" /dev/hugepages/foo && rm /dev/hugepages/foo) || \
- (echo "ERROR: fallocate failed test at /dev/hugepages with size ${default_hp_kb}kB"
- rm /dev/hugepages/foo
- exit 1)
- fi
- fi
fi
if [ -n "${LIBVIRT_CEPH_CINDER_SECRET_UUID}" ] || [ -n "${LIBVIRT_EXTERNAL_CEPH_CINDER_SECRET_UUID}" ] ; then
- if [ $CGROUP_VERSION != "v2" ]; then
- #NOTE(portdirect): run libvirtd as a transient unit on the host with the osh-libvirt cgroups applied.
- cgexec -g ${CGROUPS%,}:/osh-libvirt systemd-run --scope --slice=system libvirtd --listen &
- else
- systemd-run --scope --slice=system libvirtd --listen &
- fi
+
+ cgexec -g ${CGROUPS%,}:/osh-libvirt systemd-run --scope --slice=system libvirtd --listen &
tmpsecret=$(mktemp --suffix .xml)
if [ -n "${LIBVIRT_EXTERNAL_CEPH_CINDER_SECRET_UUID}" ] ; then
@@ -205,9 +157,5 @@ EOF
fi
-if [ $CGROUP_VERSION != "v2" ]; then
- #NOTE(portdirect): run libvirtd as a transient unit on the host with the osh-libvirt cgroups applied.
- cgexec -g ${CGROUPS%,}:/osh-libvirt systemd-run --scope --slice=system libvirtd --listen
-else
- systemd-run --scope --slice=system libvirtd --listen
-fi
+# NOTE(vsaienko): changing CGROUP is required as restart of the pod will cause domains restarts
+cgexec -g ${CGROUPS%,}:/osh-libvirt systemd-run --scope --slice=system libvirtd --listen
diff --git a/libvirt/values.yaml b/libvirt/values.yaml
index b3a4373b..7f41ae60 100644
--- a/libvirt/values.yaml
+++ b/libvirt/values.yaml
@@ -125,6 +125,20 @@ conf:
group: "kvm"
kubernetes:
cgroup: "kubepods.slice"
+ # List of cgroup controller we want to use when breaking out of
+ # Kubernetes defined groups
+ cgroup_controllers:
+ - blkio
+ - cpu
+ - devices
+ - freezer
+ - hugetlb
+ - memory
+ - net_cls
+ - perf_event
+ - rdma
+ - misc
+ - pids
vencrypt:
# Issuer to use for the vencrypt certs.
issuer:
--
2.34.1