This section describes the Nvidia GPU support on Konvoy. This document assumes familiarity with Kubernetes GPU support. The AMD GPU is currently not supported.
Konvoy GPU Overview
GPU support on Konvoy is achieved by using the nvidia-container-runtime and Nvidia Driver Container. With the Nvidia GPU turned on, Konvoy configures the container runtime, for running GPU containers, and installs all the necessary items to power up the Nvidia GPU devices. Konvoy runs every Nvidia GPU dependent component as daemonsets, making it easier to manage and upgrade.
The following components provide Nvidia GPU support on Konvoy:
- libnvidia-container and nvidia-container-runtime: Konvoy uses containerd as kubernetes container runtime by default. libnvidia-container and nvidia-container-runtime shim between containerd and runc, which simplifies the container runtime integration with GPU support. Another benefit is this avoids using nvidia-docker2.
- Nvidia Driver Container: Allows running Nvidia driver inside of a container making it easier to deploy, faster to install, and reproducible.
- [Nvidia Device Plugin][nvidia_device_plugin]: This plugin displays the number of GPUs on each node, tracks the health of the GPUs, and enables running GPU enabled containers on Kubernetes.
- Nvidia GPU Metrics Exporter and NVIDIA Data Center GPU Manager: This is a Prometheus exporter that displays Nvidia GPU metrics.
Requirements
-
Ubuntu 16.04, or Centos 7 with the IPMI driver enabled and the Nouveau driver disabled.
-
NVIDIA GPU with Fermie architecture version 2.1 or greater.
-
On Centos 7, we recommend using the latest kernel version in production. The default AMI’s kernel version
3.10.0-1062
should work as well. If you want to use a dedicated kernel version on Centos, refer to the following sections for detailed configuration. Note that Ubuntu does not have the kernel version restriction.
Configuration
Enable Nvidia GPU on Konvoy
To enable Nvidia GPU support on Konvoy, add the Nvidia GPU nodePools
in ClusterProvisioner and ClusterConfiguration, then enable the nvidia
addon. Here is an example of the cluster.yaml
:
kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
provider: aws
nodePools:
- name: gpu-worker
count: 4
machine:
rootVolumeSize: 80
rootVolumeType: gp2
imagefsVolumeEnabled: true
imagefsVolumeSize: 160
imagefsVolumeType: gp2
imagefsVolumeDevice: xvdb
type: p2.xlarge
---
kind: ClusterConfiguration
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
nodePools:
- name: gpu-worker
gpu:
nvidia: {}
addons:
- configRepository: https://github.com/mesosphere/kubernetes-base-addons
configVersion: stable-1.18-3.0.1
addonsList:
- name: nvidia
enabled: true
Nvidia Driver Configuration Based on Kernel Version
Additional configuration may be needed if your kernel version of the GPU nodes is not identical to the latest one from the yum repository. You need to pick the right image tag for your Nvidia Driver container. For example, the following list is the supported kernel versions from the Nvidia upstream nvidia/driver
image:
440.64.00-1.0.0-3.10.0-1127.8.2.el7.x86_64-centos7
440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7
440.64.00-1.0.0-3.10.0-1062.18.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.12.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.9.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.7.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.4.3.el7.x86_64-centos7
418.87.01-3.10.0-1062.4.3.el7.x86_64-centos7
418.87.01-3.10.0-1062.4.1.el7.x86_64-centos7
418.87.01-3.10.0-1062.1.2.el7.x86_64-centos7
418.40.04-3.10.0-957.21.3.el7.x86_64-centos7
418.40.04-3.10.0-957.21.2.el7.x86_64-centos7
418.40.04-3.10.0-957.12.2.el7.x86_64-centos7
418.40.04-3.10.0-957.12.1.el7.x86_64-centos7
418.40.04-3.10.0-957.10.1.el7.x86_64-centos7
396.37-3.10.0-957.5.1.el7.x86_64-centos7
396.37-3.10.0-957.1.3.el7.x86_64-centos7
396.37-3.10.0-862.14.4.el7.x86_64-centos7
396.37-3.10.0-862.14.4-centos7
396.37-3.10.0-862.11.6-centos7
396.37-3.10.0-862.9.1-centos7
This list is from Nvidia Public Hub Repository. You need to identify your kernel version first. For example,
[centos@ip-10-0-128-77 ~]$ uname -r
3.10.0-1062.12.1.el7.x86_64
Then, find the corresponding pre-built nvidia/driver
image from the list above. In this case, 440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7
is the right image tag to pick. Configure the cluster.yaml
to adopt this image tag for your nvidia driver container:
- name: nvidia
enabled: true
values: |
nvidia-driver:
image:
tag: "440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7"
GPU on Air-gapped On-prem Cluster
Follow the Konvoy Air-gapped Installations doc. Re-tag the nvidia/driver
image with the corresponding tag, identified from the above section, and push it to your local registry. For example:
REGISTRY=yourlocalregistry.com:6443
docker tag nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7 ${REGISTRY}/nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7
docker push ${REGISTRY}/nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7
Konvoy GPU Support on Ubuntu
By default, Konvoy assumes the cluster OS is CentOS. If you want to run the GPU workloads on Ubuntu, you must update the Nvidia Driver Container image in the cluster.yaml
:
......
---
kind: ClusterConfiguration
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
addons:
- configRepository: https://github.com/mesosphere/kubernetes-base-addons
configVersion: stable-1.18-3.0.1
addonsList:
- name: nvidia
enabled: true
values: |
nvidia-driver:
enabled: true
image:
tag: "418.87.01-ubuntu16.04"
See Nvidia Public Hub Repository for available driver container images.
How to prevent other workloads from running on GPU nodes
Use Kubernetes taints to ensure only dedicated workloads are deployed on GPU machines. You must add tolerations to your GPU workloads to deploy on the dedicated GPU nodes. See Kubernetes Taints and Tolerations for details.
Setting custom tolerations for an addon replaces the tolerations set in the config repository used for that addon. To add tolerations to an addon, while maintaining those set in the config repository, add the tolerations set in the config repository to your cluster.yaml
.
Here is an example of cluster.yaml
using custom taints and tolerations:
kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
......
spec:
......
nodePools:
- name: gpu-worker
count: 4
machine:
rootVolumeSize: 80
rootVolumeType: gp2
imagefsVolumeEnabled: true
imagefsVolumeSize: 160
imagefsVolumeType: gp2
imagefsVolumeDevice: xvdb
type: p2.xlarge
......
---
kind: ClusterConfiguration
......
spec:
......
nodePools:
- name: gpu-worker
gpu:
nvidia: {}
labels:
- key: dedicated
value: gpu-worker
taints:
- key: dedicated
value: gpu-worker
effect: NoExecute
addons:
- configRepository: https://github.com/mesosphere/kubernetes-base-addons
configVersion: stable-1.18-3.0.1
addonsList:
......
- name: nvidia
enabled: true
values: |
nvidia-driver:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-worker"
effect: "NoExecute"
nvidia-device-plugin:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-worker"
effect: "NoExecute"
nvidia-dcgm-exporter:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "gpu-worker"
effect: "NoExecute"
......
Nvidia GPU Monitoring
Konvoy uses Nvidia GPU Metrics Exporter and NVIDIA Data Center GPU Manager to display Nvidia GPU metrics. By default, Konvoy has a grafana dashboard called GPUs/Nvidia
to monitor GPU metrics. This GPU dashboard is shown in Konvoy’s Grafana UI.
Upgrade
Konvoy is capable of automatically upgrading the Nvidia GPU addon. However, due to a limitation in Helm, which is used internally by Konvoy, the GPU addon pods will repeatedly restart with the CrashLoopBackOff
status for 5-10 minutes. This is because the Nvidia driver has a requirement that there must be at most one driver container running on any single node. This is to ensure the driver can successfully load the necessary kernel modules. However, this conflicts with the current helm upgrade
strategy. When Helm tries to upgrade charts, it deploys new pods for the newer versioned chart while the old pods are still in the Terminating
state. This causes a race condition in Nvidia’s singleton policy.
To overcome this limitation and upgrade the Nvidia GPU addon manually:
-
Delete all GPU workloads on the GPU nodes where the Nvidia addon needs to be upgraded.
-
Delete the existing Nvidia addon:
kubectl delete clusteraddon nvidia
-
Wait for all Nvidia-related resources in the
Terminating
state to be cleaned up. You can check pod status with:kubectl get pod -A | grep nvidia
-
Specify the desired
configVersion
in yourcluster.yaml
. Then, deploy addons to upgrade the Nvidia GPU addon:konvoy deploy addons
Debugging
-
Determine if all Nvidia pods are in
Running
state, as expected:kubectl get pod -A | grep nvidia
-
If there are any Nvidia pods crashing, returning errors, or flapping, collect the logs for the problematic pod. For example:
kubectl logs -n kube-system nvidia-kubeaddons-nvidia-driver-bbkwg
-
To recover from this problem, you must restart all Nvidia-addon pods that are running on the SAME host. This is because both
nvidia-dcgm-exporter
andnvidia-device-plugin
are dependent onnvidia-driver
. In the example below, all Nvidia resources are restarted on the nodeip-10-0-129-201.us-west-2.compute.interna
:$ kubectl get pod -A -o wide | grep nvidia kube-system nvidia-kubeaddons-nvidia-device-plugin-dxtch 1/1 Running 0 4m20s 192.168.57.153 ip-10-0-129-191.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-device-plugin-j4dm2 1/1 Running 0 4m20s 192.168.39.88 ip-10-0-128-134.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-device-plugin-qb29b 1/1 Running 0 4m20s 192.168.119.35 ip-10-0-128-208.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-device-plugin-tsbk2 1/1 Running 0 4m20s 192.168.243.99 ip-10-0-129-201.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-driver-6m59m 1/1 Running 3 4m20s 192.168.119.34 ip-10-0-128-208.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-driver-79rmt 1/1 Running 3 4m20s 192.168.57.152 ip-10-0-129-191.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-driver-fnhts 1/1 Running 3 4m20s 192.168.39.87 ip-10-0-128-134.us-west-2.compute.internal <none> <none> kube-system nvidia-kubeaddons-nvidia-driver-ks9hf 1/1 Running 3 4m20s 192.168.243.98 ip-10-0-129-201.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-8ngx9 2/2 Running 0 4m20s 192.168.57.154 ip-10-0-129-191.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6 2/2 Running 0 4m20s 192.168.243.100 ip-10-0-129-201.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-ttjqs 2/2 Running 0 4m20s 192.168.39.89 ip-10-0-128-134.us-west-2.compute.internal <none> <none> kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-xqj6r 2/2 Running 0 4m20s 192.168.119.36 ip-10-0-128-208.us-west-2.compute.internal <none> <none> $ kubectl delete pod -n kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6 pod "nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6" deleted $ kubectl delete pod -n kube-system nvidia-kubeaddons-nvidia-device-plugin-tsbk2 nvidia-kubeaddons-nvidia-driver-ks9hf pod "nvidia-kubeaddons-nvidia-device-plugin-tsbk2" deleted pod "nvidia-kubeaddons-nvidia-driver-ks9hf" deleted
-
To collect more debug information on the Nvidia addon, run:
helm get nvidia-kubeaddons