KIB for GPU

Create a GPU supported OS image using Konvoy Image Builder

Using the Konvoy Image Builder, you can build an image that has support to use NVIDIA GPU hardware to support GPU workloads.

NOTE: The NVIDIA driver requires a specific Linux kernel version. Make sure that the base image for the OS version has the required kernel version.

See Supported Infrastructure Operating Systems for a list of OS versions and the corresponding kernel versions known to work with the NVIDIA driver.

If the NVIDIA runfile installer has not been downloaded, then retrieve and install the download first by running the following command. The first line in the command below downloads and installs the runfile and the second line places it in the artifacts directory (you must create an artifacts directory if you don’t already have one).

CODE

curl -O https://download.nvidia.com/XFree86/Linux-x86_64/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run
mv NVIDIA-Linux-x86_64-470.82.01.run artifacts

DKP supported NVIDIA driver version is 470.x.

To build an image for use on GPU enabled hardware, perform the following steps.

In your overrides/nvidia.yaml file, add the following to enable GPU builds. You can also access and use the overrides repo or in the documentation under Nvidia GPU Override File or Offline Nvidia Override file.
1. Non-air-gapped GPU override:
  CODE
```
gpu:
  types:
    - nvidia
build_name_extra: "-nvidia"
```
2. Air-gapped GPU override:
  CODE
```
# Use this file when building a machine image, not as a override secret for preprovisioned environments
nvidia_runfile_local_file: "{{ playbook_dir}}/../artifacts/{{ nvidia_runfile_installer }}"
gpu:
  types:
    - nvidia

build_name_extra: "-nvidia"
```
  NOTE: For RHEL Pre-provisioned Override Files used with KIB, see specific note for GPU.
Build your image using the build Konvoy Image Builder command, making sure to include the flag --instance-type that specifies an AWS instance that has an available GPU:
AWS Example:
CODE
```
konvoy-image build --region us-east-1 --instance-type=p2.xlarge --source-ami=ami-12345abcdef images/ami/centos-7.yaml --overrides overrides/nvidia.yaml
```
In this example, we chose an instance type with an NVIDIA GPU using the --instance-type flag, and we provided the NVIDIA overrides using the --overrides flag. See KIB with AWS for more information on creating an AMI.

Additional helpful information can be found in the NVIDIA Device Plug-in for Kubernetes instructions and the Installation Guide of Supported Platforms.

Verification

To verify that the NVIDIA driver is working, connect to the node and execute this command:

CODE

nvidia-smi

When drivers are successfully installed, the display looks like the following:

CODE

Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    73W / 149W |      0MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+