Kommander GPU Settings and Troubleshoot

Validate and troubleshoot GPU

Validate that the Application has Started Correctly

Run the following command to validate that your application has started correctly:

CODE

kubectl get pods -A | grep nvidia

The output should be similar to the following:

CODE

nvidia-container-toolkit-daemonset-7h2l5   1/1 	Running 	0          	   150m
nvidia-container-toolkit-daemonset-mm65g   1/1 	Running 	0          	   150m
nvidia-container-toolkit-daemonset-mv7xj   1/1 	Running 	0          	   150m
nvidia-cuda-validator-pdlz8            	0/1 	Completed   0          	   150m
nvidia-cuda-validator-r7qc4            	0/1 	Completed   0          	   150m
nvidia-cuda-validator-xvtqm            	0/1 	Completed   0          	   150m
nvidia-dcgm-exporter-9r6rl             	1/1 	Running 	1 (149m ago)   150m
nvidia-dcgm-exporter-hn6hn             	1/1 	Running 	1 (149m ago)   150m
nvidia-dcgm-exporter-j7g7g             	1/1 	Running 	0          	   150m
nvidia-dcgm-jpr57                      	1/1 	Running 	0          	   150m
nvidia-dcgm-jwldh                      	1/1 	Running 	0          	   150m
nvidia-dcgm-qg2vc                      	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-2gv8h   	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-tcmgk   	1/1 	Running 	0          	   150m
nvidia-device-plugin-daemonset-vqj88   	1/1 	Running 	0          	   150m
nvidia-device-plugin-validator-9xdqr   	0/1 	Completed   0          	   149m
nvidia-device-plugin-validator-jjhdr   	0/1 	Completed   0          	   149m
nvidia-device-plugin-validator-llxjk   	0/1 	Completed   0          	   149m
nvidia-operator-validator-9kzv4        	1/1 	Running 	0          	   150m
nvidia-operator-validator-fvsr7        	1/1 	Running 	0          	   150m
nvidia-operator-validator-qr9cj        	1/1 	Running 	0          	   150m

If you are seeing errors, ensure that you set the container toolkit version appropriately based on your OS, as described in the previous section.

NVIDIA GPU Monitoring

Kommander uses the NVIDIA Data Center GPU Manager to export GPU metrics towards Prometheus. By default, Kommander has a Grafana dashboard called NVIDIA DCGM Exporter Dashboard to monitor GPU metrics. This GPU dashboard is shown in Kommander’s Grafana UI.

NVIDIA MIG Settings

MIG stands for Multi-Instance-GPU. It is a mode of operation for future NVIDIA GPUs that allows the user to partition a GPU into a set of MIG devices. Each set appears to the software that is consuming them as a mini-GPU with a fixed partition of memory and a fixed partition of compute resources.

NOTE: MIG is only available for the following NVIDIA devices: H100, A100, and A30.

To Configure MIG

Set the MIG strategy according to your GPU topology.
• mig.strategy should be set to mixed when MIG mode is not enabled on all GPUs on a node.
• mig.strategy should be set to single when MIG mode is enabled on all GPUs on a node and they have the same MIG device types across all of them.

For the Management Cluster, this can be set at install time by modifying the Kommander configuration file to add configuration for the nvidia-gpu-operator application:
CODE
```
apiVersion: config.kommander.mesosphere.io/v1alpha1
kind: Installation
apps:
  nvidia-gpu-operator:
    values: |
      mig:
        strategy: single
...
```
Or by modifying the clusterPolicy object for the GPU operator once it has already been installed.
Set the MIG profile for the GPU you are using. In our example, we are using the A30 GPU that supports the following MIG profiles:
CODE
```
4 GPU instances @ 6GB each
2 GPU instances @ 12GB each
1 GPU instance @ 24GB
```
Set the mig profile by labeling the node ${NODE} with the profile as in the example below:
CODE
```
kubectl label nodes ${NODE} nvidia.com/mig.config=all-1g.6gb --overwrite
```

Check the node labels to see if the changes were applied to your MIG enabled GPU node

CODE

kubectl get no -o json | jq .items[0].metadata.labels

CODE

"nvidia.com/mig.config": "all-1g.6gb",
  "nvidia.com/mig.config.state": "success",
  "nvidia.com/mig.strategy": "single"

Deploy a sample workload:

CODE

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
      - name: cuda-vectoradd
      image: "nvidia/samples:vectoradd-cuda11.2.1"
      resources:
        limits:
          nvidia.com/gpu: 1

  nodeSelector:
    "nvidia.com/gpu.product": NVIDIA-A30-MIG-1g.6gb

If the workload successfully finishes, then your GPU has been properly MIG partitioned.

Troubleshooting NVIDIA GPU Operator on Kommander

In case you run into any errors with NVIDIA GPU Operator, here are a couple commands you can run to troubleshoot:

Connect (using SSH or similar) to your GPU enabled nodes and run the nvidia-smi command. Your output should be similar to the following example:

CODE

  [ec2-user@ip-10-0-0-241 ~]$ nvidia-smi
  Thu Nov  3 22:52:59 2022       
  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 535.183.06    Driver Version: 535.183.06    CUDA Version: 12.2.2     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |                               |                      |               MIG M. |
  |===============================+======================+======================|
  |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
  | N/A   54C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
  |                               |                      |                  N/A |
  +-------------------------------+----------------------+----------------------+                                                                        
  +-----------------------------------------------------------------------------+
  | Processes:                                                                  |
  |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
  |        ID   ID                                                   Usage      |
  |=============================================================================|
  |  No running processes found                                                 |
  +-----------------------------------------------------------------------------+

Another common issue is having a misconfigured toolkit version, resulting in NVIDIA pods in a bad state
For example:

CODE

nvidia-container-toolkit-daemonset-jrqt2                          1/1     Running                 0             29s
nvidia-dcgm-exporter-b4mww                                        0/1     Error                   1 (9s ago)    16s
nvidia-dcgm-pqsz8                                                 0/1     CrashLoopBackOff        1 (13s ago)   27s
nvidia-device-plugin-daemonset-7fkzr                              0/1     Init:0/1                0             14s
nvidia-operator-validator-zxn4w                                   0/1     Init:CrashLoopBackOff   1 (7s ago)    11s

To modify the toolkit version, run the following commands to modify the AppDeployment for the nvidia gpu operator application:

• Provide the name of a ConfigMap with the custom configuration in the AppDeployment:

CODE

cat <<EOF | kubectl apply -f -
apiVersion: apps.kommander.d2iq.io/v1alpha3
kind: AppDeployment
metadata:
  name: nvidia-gpu-operator
  namespace: kommander
spec:
  appRef:
    kind: ClusterApp
    name: nvidia-gpu-operator-1.11.1
  configOverrides:
    name: nvidia-gpu-operator-overrides
EOF

• Create the ConfigMap with the name provided in the previous step, which provides the custom configuration on top of the default configuration in the config map, set the version appropriately:

CODE

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: kommander
  name: nvidia-gpu-operator-overrides
data:
  values.yaml: |
    toolkit:
      version: v1.10.0-centos7
EOF

If a node has an NVIDIA GPU installed and the nvidia-gpu-operator application is enabled on the cluster, but the node is still not accepting GPU workloads, it's possible that the nodes do not have a label that indicates there is an NVIDIA GPU present.
By default the GPU operator will attempt to configure nodes with the following labels present, which are usually applied by the node feature discovery component:
CODE
```
	"feature.node.kubernetes.io/pci-10de.present":      "true",
	"feature.node.kubernetes.io/pci-0302_10de.present": "true",
	"feature.node.kubernetes.io/pci-0300_10de.present": "true",
```
If these labels are not present on a node that you know contains an NVIDIA GPU, you can manually label the node using the following command:
CODE
```
kubectl label node ${NODE} feature.node.kubernetes.io/pci-0302_10de.present=true
```

Disable NVIDIA GPU Operator Platform Application on Kommander

Delete all GPU workloads on the GPU nodes where the NVIDIA GPU Operator platform application is present.
Delete the existing NVIDIA GPU Operator AppDeployment using the following command:
CODE
```
kubectl delete appdeployment -n kommander nvidia-gpu-operator
```
Wait for all NVIDIA related resources in the Terminating state to be cleaned up. You can check pod status with the following command:
CODE
```
kubectl get pods -A | grep nvidia
```

For information on how to delete nodepools, refer to Pre-provisioned Create and Delete Node Pools