Create a GPU supported OS image using Konvoy Image Builder

Using the Konvoy Image Builder, you can build an image that has support to use NVIDIA GPU hardware to support GPU workloads.

If the NVIDIA runfile installer has not been downloaded, then retrieve and install the download first by running the following command. The first line in the command below downloads and installs the runfile and the second line places it in the artifacts directory.

curl -O https://download.nvidia.com/XFree86/Linux-x86_64/470.82.01/NVIDIA-Linux-x86_64-470.82.01.run
mv NVIDIA-Linux-x86_64-470.82.01.run artifacts
CODE

DKP supported NVIDIA driver version is 470.x.

To build an image for use on GPU enabled hardware, perform the following steps.

  1. In your overrides/nvidia.yaml file, add the following to enable GPU builds. You can also access and use the overrides repo or in the documentation under NVIDIA GPU Overrides or Offline NVIDIA Override.

    gpu:
      type:
        - nvidia
    CODE

  2. Build your image using the following Konvoy Image Builder commands, making sure to include the flag --instance-type that specifies an AWS instance that has an available GPU:
    AWS Example:

    konvoy-image build --region us-west-2 --instance-type=p2.xlarge --source-ami=ami-12345abcdef images/ami/centos-7.yaml --overrides overrides/nvidia.yaml
    CODE

    By default, your image builds in the us-west-2 region. To specify another region, set the --region flag:

    konvoy-image build --region us-east-1 --instance-type=p2.xlarge --overrides override-source-ami.yaml images/ami/<Your OS>.yaml
    CODE

     

NOTE: Ensure that an AMI file is available for your OS selection.

When the command is complete the ami id is printed and written to ./manifest.json.

To use the built ami with Konvoy, specify it with the --ami flag when calling cluster create.

dkp create cluster aws --cluster-name=$(whoami)-aws-cluster --region us-west-2 --ami <ami>
CODE

For GPU Steps in Pre-provisioned section of the documentation to use the overrides/nvidia.yaml.

Verification

To verify that the NVIDIA driver is working, connect to the node and execute this command:

nvidia-smi
CODE

When drivers are successfully installed, the display looks like the following:

Fri Jun 11 09:05:31 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    73W / 149W |      0MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
CODE

Additional helpful information can be found in the NVIDIA Device Plug-in for Kubernetes instructions and the Installation Guide of Supported Platforms.

See also: NVIDIA documentation