GPUs

Configure GPU for Konvoy cluster

This section describes the Nvidia GPU support on Konvoy. This document assumes familiarity with Kubernetes GPU support. The AMD GPU is currently not supported.

Konvoy GPU Overview

GPU support on Konvoy is achieved by using the nvidia-container-runtime and Nvidia Driver Container. With the Nvidia GPU turned on, Konvoy configures the container runtime, for running GPU containers, and installs all the necessary items to power up the Nvidia GPU devices. Konvoy runs every Nvidia GPU dependent component as daemonsets, making it easier to manage and upgrade.

NOTE: Konvoy assumes there are no legacy GPU drivers or device plugins running on the host machine. Legacy GPU drivers and plugins can conflict with the Konvoy deployed drivers and plugins.

The following components provide Nvidia GPU support on Konvoy:

Requirements

  1. Ubuntu 16.04, or Centos 7 with the IPMI driver enabled and the Nouveau driver disabled.

  2. NVIDIA GPU with Fermie architecture version 2.1 or greater.

  3. On Centos 7, we recommend using the latest kernel version in production. The default AMI’s kernel version 3.10.0-1062 should work as well. If you want to use a dedicated kernel version on Centos, refer to the following sections for detailed configuration. Note that Ubuntu does not have the kernel version restriction.

Configuration

Enable Nvidia GPU on Konvoy

To enable Nvidia GPU support on Konvoy, add the Nvidia GPU nodePools in ClusterProvisioner and ClusterConfiguration, then enable the nvidia addon. Here is an example of the cluster.yaml:

kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
  provider: aws
  nodePools:
  - name: gpu-worker
    count: 4
    machine:
      rootVolumeSize: 80
      rootVolumeType: gp2
      imagefsVolumeEnabled: true
      imagefsVolumeSize: 160
      imagefsVolumeType: gp2
      imagefsVolumeDevice: xvdb
      type: p2.xlarge
---
kind: ClusterConfiguration
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
  nodePools:
  - name: gpu-worker
    gpu:
      nvidia: {}
  addons:
  - configRepository: https://github.com/mesosphere/kubernetes-base-addons
    configVersion: stable-1.18-3.0.0
    addonsList:
    - name: nvidia
      enabled: true

Nvidia Driver Configuration Based on Kernel Version

Additional configuration may be needed if your kernel version of the GPU nodes is not identical to the latest one from the yum repository. You need to pick the right image tag for your Nvidia Driver container. For example, the following list is the supported kernel versions from the Nvidia upstream nvidia/driver image:

440.64.00-1.0.0-3.10.0-1127.8.2.el7.x86_64-centos7
440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7
440.64.00-1.0.0-3.10.0-1062.18.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.12.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.9.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.7.1.el7.x86_64-centos7
440.33.01-3.10.0-1062.4.3.el7.x86_64-centos7
418.87.01-3.10.0-1062.4.3.el7.x86_64-centos7
418.87.01-3.10.0-1062.4.1.el7.x86_64-centos7
418.87.01-3.10.0-1062.1.2.el7.x86_64-centos7
418.40.04-3.10.0-957.21.3.el7.x86_64-centos7
418.40.04-3.10.0-957.21.2.el7.x86_64-centos7
418.40.04-3.10.0-957.12.2.el7.x86_64-centos7
418.40.04-3.10.0-957.12.1.el7.x86_64-centos7
418.40.04-3.10.0-957.10.1.el7.x86_64-centos7
396.37-3.10.0-957.5.1.el7.x86_64-centos7
396.37-3.10.0-957.1.3.el7.x86_64-centos7
396.37-3.10.0-862.14.4.el7.x86_64-centos7
396.37-3.10.0-862.14.4-centos7
396.37-3.10.0-862.11.6-centos7
396.37-3.10.0-862.9.1-centos7

This list is from Nvidia Public Hub Repository. You need to identify your kernel version first. For example,

[centos@ip-10-0-128-77 ~]$ uname -r
3.10.0-1062.12.1.el7.x86_64

Then, find the corresponding pre-built nvidia/driver image from the list above. In this case, 440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7 is the right image tag to pick. Configure the cluster.yaml to adopt this image tag for your nvidia driver container:

    - name: nvidia
      enabled: true
      values: |
        nvidia-driver:
          image:
            tag: "440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7"

GPU on air gapped On-prem Cluster

Follow the Konvoy air gapped installations doc. Re-tag the nvidia/driver image with the corresponding tag, identified from the above section, and push it to your local registry. For example:

REGISTRY=yourlocalregistry.com:6443

docker tag nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7 ${REGISTRY}/nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7

docker push ${REGISTRY}/nvidia/driver:440.64.00-1.0.0-3.10.0-1127.el7.x86_64-centos7

Konvoy GPU Support on Ubuntu

By default, Konvoy assumes the cluster OS is CentOS. If you want to run the GPU workloads on Ubuntu, you must update the Nvidia Driver Container image in the cluster.yaml:

......
---
kind: ClusterConfiguration
apiVersion: konvoy.mesosphere.io/v1beta2
spec:
  addons:
  - configRepository: https://github.com/mesosphere/kubernetes-base-addons
    configVersion: stable-1.18-3.0.0
    addonsList:
    - name: nvidia
      enabled: true
      values: |
      nvidia-driver:
        enabled: true
        image:
          tag: "418.87.01-ubuntu16.04"

See Nvidia Public Hub Repository for available driver container images.

How to prevent other workloads from running on GPU nodes

Use Kubernetes taints to ensure only dedicated workloads are deployed on GPU machines. You must add tolerations to your GPU workloads to deploy on the dedicated GPU nodes. See Kubernetes Taints and Tolerations for details. Here is an example of cluster.yaml:

kind: ClusterProvisioner
apiVersion: konvoy.mesosphere.io/v1beta2
......
spec:
......
  nodePools:
  - name: gpu-worker
    count: 4
    machine:
      rootVolumeSize: 80
      rootVolumeType: gp2
      imagefsVolumeEnabled: true
      imagefsVolumeSize: 160
      imagefsVolumeType: gp2
      imagefsVolumeDevice: xvdb
      type: p2.xlarge
......
---
kind: ClusterConfiguration
......
spec:
......
  nodePools:
  - name: gpu-worker
    gpu:
      nvidia: {}
    labels:
      - key: dedicated
        value: gpu-worker
    taints:
      - key: dedicated
        value: gpu-worker
        effect: NoExecute
  addons:
  - configRepository: https://github.com/mesosphere/kubernetes-base-addons
    configVersion: stable-1.18-3.0.0
    addonsList:
......
    - name: nvidia
      enabled: true
      values: |
        nvidia-driver:
          tolerations:
            - key: "dedicated"
              operator: "Equal"
              value: "gpu-worker"
              effect: "NoExecute"
        nvidia-device-plugin:
          tolerations:
            - key: "dedicated"
              operator: "Equal"
              value: "gpu-worker"
              effect: "NoExecute"
        nvidia-dcgm-exporter:
          tolerations:
            - key: "dedicated"
              operator: "Equal"
              value: "gpu-worker"
              effect: "NoExecute"
......

Nvidia GPU Monitoring

Konvoy uses Nvidia GPU Metrics Exporter and NVIDIA Data Center GPU Manager to display Nvidia GPU metrics. By default, Konvoy has a grafana dashboard called GPUs/Nvidia to monitor GPU metrics. This GPU dashboard is shown in Konvoy’s Grafana UI.

Upgrade

Konvoy is capable of automatically upgrading the Nvidia GPU addon. However, due to a limitation in Helm, which is used internally by Konvoy, the GPU addon pods will repeatedly restart with the CrashLoopBackOff status for 5-10 minutes. This is because the Nvidia driver has a requirement that there must be at most one driver container running on any single node. This is to ensure the driver can successfully load the necessary kernel modules. However, this conflicts with the current helm upgrade strategy. When Helm tries to upgrade charts, it deploys new pods for the newer versioned chart while the old pods are still in the Terminating state. This causes a race condition in Nvidia’s singleton policy.

To overcome this limitation and upgrade the Nvidia GPU addon manually:

  1. Delete all GPU workloads on the GPU nodes where the Nvidia addon needs to be upgraded.

  2. Delete the existing Nvidia addon:

    kubectl delete clusteraddon nvidia
    
  3. Wait for all Nvidia-related resources in the Terminating state to be cleaned up. You can check pod status with:

    kubectl get pod -A | grep nvidia
    
  4. Specify the desired configVersion in your cluster.yaml. Then, deploy addons to upgrade the Nvidia GPU addon:

    konvoy deploy addons
    

Debugging

  1. Determine if all Nvidia pods are in Running state, as expected:

    kubectl get pod -A | grep nvidia
    
  2. If there are any Nvidia pods crashing, returning errors, or flapping, collect the logs for the problematic pod. For example:

    kubectl logs -n kube-system nvidia-kubeaddons-nvidia-driver-bbkwg
    
  3. To recover from this problem, you must restart all Nvidia-addon pods that are running on the SAME host. This is because both nvidia-dcgm-exporter and nvidia-device-plugin are dependent on nvidia-driver. In the example below, all Nvidia resources are restarted on the node ip-10-0-129-201.us-west-2.compute.interna:

    $ kubectl get pod -A -o wide | grep nvidia
    kube-system    nvidia-kubeaddons-nvidia-device-plugin-dxtch                         1/1     Running     0          4m20s   192.168.57.153    ip-10-0-129-191.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-device-plugin-j4dm2                         1/1     Running     0          4m20s   192.168.39.88     ip-10-0-128-134.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-device-plugin-qb29b                         1/1     Running     0          4m20s   192.168.119.35    ip-10-0-128-208.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-device-plugin-tsbk2                         1/1     Running     0          4m20s   192.168.243.99    ip-10-0-129-201.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-driver-6m59m                                1/1     Running     3          4m20s   192.168.119.34    ip-10-0-128-208.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-driver-79rmt                                1/1     Running     3          4m20s   192.168.57.152    ip-10-0-129-191.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-driver-fnhts                                1/1     Running     3          4m20s   192.168.39.87     ip-10-0-128-134.us-west-2.compute.internal   <none>           <none>
    kube-system    nvidia-kubeaddons-nvidia-driver-ks9hf                                1/1     Running     3          4m20s   192.168.243.98    ip-10-0-129-201.us-west-2.compute.internal   <none>           <none>
    kubeaddons     nvidia-kubeaddons-nvidia-dcgm-exporter-8ngx9                         2/2     Running     0          4m20s   192.168.57.154    ip-10-0-129-191.us-west-2.compute.internal   <none>           <none>
    kubeaddons     nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6                         2/2     Running     0          4m20s   192.168.243.100   ip-10-0-129-201.us-west-2.compute.internal   <none>           <none>
    kubeaddons     nvidia-kubeaddons-nvidia-dcgm-exporter-ttjqs                         2/2     Running     0          4m20s   192.168.39.89     ip-10-0-128-134.us-west-2.compute.internal   <none>           <none>
    kubeaddons     nvidia-kubeaddons-nvidia-dcgm-exporter-xqj6r                         2/2     Running     0          4m20s   192.168.119.36    ip-10-0-128-208.us-west-2.compute.internal   <none>           <none>
    $ kubectl delete pod -n kubeaddons nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6
    pod "nvidia-kubeaddons-nvidia-dcgm-exporter-mwwl6" deleted
    $ kubectl delete pod -n kube-system nvidia-kubeaddons-nvidia-device-plugin-tsbk2 nvidia-kubeaddons-nvidia-driver-ks9hf
    pod "nvidia-kubeaddons-nvidia-device-plugin-tsbk2" deleted
    pod "nvidia-kubeaddons-nvidia-driver-ks9hf" deleted
    
  4. To collect more debug information on the Nvidia addon, run:

    helm get nvidia-kubeaddons