This upgrade is needed in support of A100 GPU, kernel
upgrade and bug 1948050. It eliminates the requirement
to create nvidia specific runtimeclass prior to installing
the charts by pre-installing the toolkit through toolkit-
installer subchart.
This commit has been tested with the following:
driver: 470.57.02
toolkit: 1.7.1-ubi8
defaultRuntime: containerd
Test Plan:
PASS: Verify gpu-operator starts and adds nvidia.com/gpu
to the node.
PASS: Verify nvidia-toolkit is removed with helm override
of global.toolkit_force_clean=true.
PASS: Verify pods can access gpu device and nvidia tools
to monitor the GPU.
PASS: Verify pod can build and execute cuda sample code.
PASS: Verify driver pod prints out warning when building
on Low Latency kernel with helm override of:
--set driver.env[0].name=IGNORE_PREEMPT_RT_PRESENCE
Closes-Bug: 1948050
Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
Change-Id: I18dd2a0ab1adc6f9364314a22373aadc93cad27f
This commit adds nvidia gpu-operator helm charts use case for
custom container runtime feature. To load nvidia-gpu-operator
on starlingx:
system service-parameter-add platform container_runtime \
custom_container_runtime=\
nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
And define runtimeClass for nvidia gpu pods:
kind: RuntimeClass
apiVersion: node.k8s.io/v1beta1
metadata:
name: nvidia
handler: nvidia
The above will direct all containerd creations of pods with nvidia
runtimeClass to nvidia-container-runtime -- where the nvidia-conta
iner-runtime is installed by the operator onto a hostMount.
Story: 2008434
Task: 41978
Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
Change-Id: Ifea8cdf6eb89a159f446c53566279e72fcf0e45e
This reverts commit 41bdf53f65684b54abaa3098a5fe3acf568cdf2a.
Reason for revert: gpu operator patch is breaking stx-master build.
e.g.,
08:06:44 Failed to build packages: gpu-operator-1.6.0-0.tis.1.src.rpm; problem with:
Patch #2 (enablement-support-on-starlingx-cloud-platform.patch):
. .
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file deployments/gpu-operator/templates/operator.yaml.rej
patching file deployments/gpu-operator/values.yaml
error: Bad exit status from /var/tmp/rpm-tmp.VQuqLh (%prep)
Change-Id: Id7a05987586582c940d605874d1e0f813333f2c3
This commit adds nvidia gpu-operator helm charts use case for
custom container runtime feature. To load nvidia-gpu-operator
on starlingx:
system service-parameter-add platform container_runtime \
custom_container_runtime=\
nvidia:/usr/local/nvidia/toolkit/nvidia-container-runtime
And define runtimeClass for nvidia gpu pods:
kind: RuntimeClass
apiVersion: node.k8s.io/v1beta1
metadata:
name: nvidia
handler: nvidia
The above will direct all containerd creations of pods with nvidia
runtimeClass to nvidia-container-runtime -- where the nvidia-conta
iner-runtime is installed by the operator onto a hostMount.
Story: 2008434
Task: 41978
Signed-off-by: Babak Sarashki <babak.sarashki@windriver.com>
Change-Id: I999804d4697349bc0966d0a6e653d7bce15e18fc