VMware Modern Apps Community
_paddy_
Contributor
Contributor

Photon OS v4 or v5 with NVidia CUDA

I have a requirement for a docker container to utilise the NVidia CUDA system.

Currently I use an Ubunutu Server VM in ESXi 6.7u2 with the NVidia GFX card passed exclusivly through. I want to move to Photon OS due to the lower system footprint, and consolidation of OS types!

I found the following answer on this community but following those steps results in errors.

Below are the steps followed, which combine the instructions from the VMware Communities post, and the NVidia Installation Guide for Docker.

Any help or advise is welcome.

VM Creation

 

Create new VM in ESXi
Add PCI Device and select GP107GL [Quadro P620]
20GB disk - thin prov
8GB RAM - All reserved
Mount disk ISO of photon-minimal-4.0-rev2-c001795b8.iso
VM setting of:
Hypervisor.CPUID.v0 FALSE

 

Photon Install

 

>> Start VM
>> Select "VMware kernel (not generic linux)

# System update
tdnf -y update
tdnf -y upgrade

# Configure SSH
systemctl start sshd
systemctl enable sshd
vim /etc/ssh/sshd_config
AllowRootLogin yes
systemctl restart sshd

# Docker start
systemctl start docker
systemctl enable docker

 

Install NVidia drivers

 

# Get sources
tdnf install -y linux-esx-devel
reboot

# install kernel api headers and devel
tdnf install -y build-essential wget tar

# Resize tmp
umount /tmp
mount -t tmpfs -o size=2G tmpfs /tmp

# NVidia drivers from here: https://www.nvidia.com/en-us/drivers/unix/
wget https://uk.download.nvidia.com/XFree86/Linux-x86_64/525.105.17/NVIDIA-Linux-x86_64-525.105.17.run
chmod a+x ./NVIDIA-Linux-x86_64-525.105.17.run
./NVIDIA-Linux-x86_64-525.105.17.run
reboot

# check nvidia device is found
nvidia-smi

Thu Apr 27 07:19:37 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         Off  | 00000000:0B:00.0 Off |                  N/A |
| 40%   47C    P0    N/A /  40W |      0MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

 

Drivers installed ok.

Install NVidia Container Toolkit

 

# Setup the package repository and the GPG key:
tdnf install -y gpg
cd /etc/pki/rpm-gpg/
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /etc/pki/rpm-gpg/nvidia-container-toolkit-keyring.gpg

cat << EOF >>/etc/yum.repos.d/nvidia-container-toolkit.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/x86_64
gpgcheck=0
enabled=1
EOF

# Install the toolkit
tdnf makecache
tdnf install nvidia-container-toolkit

# Register the runtime with docker
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

rm /etc/yum.repos.d/nvidia-container-toolkit.repo

 

Test with a base CUDA container

According to the installation guide, the output of the following should be the same NVidia-smi table above:

 

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi

 

but I get:

 

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

 

dmesg:

 

[36185.054996] audit: type=1006 audit(1682579973.442:412): pid=20385 uid=0 subj=unconfined old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=3 res=1
[36823.023793] docker0: port 2(veth5e0f5e7) entered blocking state
[36823.023796] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.023843] device veth5e0f5e7 entered promiscuous mode
[36823.023864] audit: type=1700 audit(1682580611.410:413): dev=veth5e0f5e7 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
[36823.109463] nvc:[driver][20748]: segfault at 30 ip 00007f50a8466866 sp 00007fff51909d30 error 4 in libnvidia-container.so.1.13.1[7f50a8444000+39000]
[36823.109468] Code: 00 e8 fe 4a 00 00 39 c5 7c 12 45 85 e4 0f 85 f9 00 00 00 5b 5d 41 5c c3 0f 1f 40 00 48 8b 05 21 af 21 00 48 63 fd 48 8d 04 f8 <48> 39 18 75 db 81 fd ff 03 00 00 48 c7 00 00 00 00 00 7f 7e e8 e1
[36823.109496] audit: type=1701 audit(1682580611.498:414): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=unconfined pid=20748 comm="nvc:[driver]" exe="/usr/bin/nvidia-container-cli" sig=11 res=1
[36823.262165] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.262536] device veth5e0f5e7 left promiscuous mode
[36823.262549] docker0: port 2(veth5e0f5e7) entered disabled state
[36823.262576] audit: type=1700 audit(1682580611.650:415): dev=veth5e0f5e7 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295

 

Searching on the internet reveals people on various platforms with the same error, but no general resolution.

As I said, this is working fine with Ubunut, but I would like to consolidate my VMs to use Photon.

Any help or advise is welcome.

---------

Photon OS 5_RC

Trying with photon-minimal-5.0_RC-4d5974638.x86_64 doesn't work:

* Installing the drivers works ok
* Installing the NVidia-container-toolkit works ok

Registering the toolkit with docker

 

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

 

Results in the error

 

INFO[0000] Loading docker config from /etc/docker/daemon.json
INFO[0000] Config file does not exist, creating new one
ERRO[0000] unable to flush config: unable to open /etc/docker/daemon.json for writing: open /etc/docker/daemon.json: no such file or directory

 

Trying to manually register the runtime with a systemmd dropin works:

 

sudo mkdir -p /etc/systemd/system/docker.service.d

sudo tee /etc/systemd/system/docker.service.d/override.conf <<EOF
[Service]
ExecStart=
ExecStart=/usr/bin/dockerd --host=fd:// --add-runtime=nvidia=/usr/bin/nvidia-container-runtime
EOF

sudo systemctl daemon-reload \
&& sudo systemctl restart docker

 

Running the CUDA test docker still results in the same error message. But dmesg is slightly different:

 

[ 2057.563936] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
[ 2057.563941] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation
[ 2057.563946] __vm_enough_memory: pid: 2461, comm: nvc:[driver], no enough memory for the allocation

 

 

Reply
0 Kudos
3 Replies
DCasota
Expert
Expert

Hi @_paddy_ ,

A very comprehensive step-by-step guide! Congrats for the research and findings.

 

The NVidia driver page shows up a hint about a workaround for a runc issue for the actually 2nd latest driver version they published. You've used that version.

 

DCasota_0-1682636303188.png

In Nvidia docs Step3 the tests explicitly run with sudo. Did that not work?

 

Accordingly to https://github.com/opencontainers/runc/issues/3708 the issue has been resolved lately.
The runc spec file for Photon OS 5.0 actually contains 1.1.4, and the latest runc release is 1.1.7, see Release Notes about Systemd v240+ and DeviceAllow.

 

If it's not possible to go with the latest combo ESXi/Photon/NVidia driver/Cuda/Container-tool-kit, try an older combo. 



Assuming the latest runc version fixes the issue completely, make build Photon OS with latest runc could be a possibility, too, but unknown dependencies may be difficult to resolve. But, the Photon OS team can help. You could ask on https://github.com/vmware/photon/issues for prioritizing the newer runc release.

 

About the script: EOF is an identification, it should be unique. You could enumerate each block.

 

cat << EOF1 >>/etc/yum.repos.d/nvidia-container-toolkit.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/x86_64
gpgcheck=0
enabled=1
EOF1

 

 

 

 

Reply
0 Kudos
_paddy_
Contributor
Contributor

Thanks for your research!

I've been running all my commands while logged in as root, so haven't been using `sudo`.

Trying those steps in Photon OS 5 gives the same result:

 

root@Photon5 [ /etc/pki/rpm-gpg ]# cat /etc/photon-release
VMware Photon OS 5.0
PHOTON_BUILD_NUMBER=4d5974638
root@Photon5 [ /etc/pki/rpm-gpg ]# nvidia-smi
Fri Apr 28 08:42:09 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         Off  | 00000000:0B:00.0 Off |                  N/A |
| 38%   46C    P0    N/A /  40W |      0MiB /  2048MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@Photon5 [ /etc/pki/rpm-gpg ]# sudo ctr run --rm -t \
    --runc-binary=/usr/bin/nvidia-container-runtime \
    --env NVIDIA_VISIBLE_DEVICES=all \
    docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04 \
    cuda-11.6.2-base-ubuntu20.04 nvidia-smi
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown

 

 

[  780.689512] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation
[  780.689529] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation
[  780.689535] __vm_enough_memory: pid: 1861, comm: nvc:[driver], no enough memory for the allocation

 

I've raised an issue on the Photon githug to move to runc 1.1.7.

 

In the meantime I will try some older drivers, see if they work

 

Reply
0 Kudos
DCasota
Expert
Expert

Hi @_paddy_ ,

a remark about

# System update
tdnf -y update
tdnf -y upgrade

System update/upgrade always brings me in a mode of "testing" and no longer in the direction of "resilience".

For example on Photon OS 4.0rev2, runc has been updated to 1.1.1 on May 13th 2022, to 1.1.4 on October 18th 2022 and the actual 1.1.4-X release is from March 23rd 2023. Accordingly to the runc 1.1.7 Release Notes, the issue described began with 1.1.3.

Hence for improving reproducibility, it is better to specify the sequence of updates by package releases and, if reasonable, to double check the behavior with root privileges as well. This is a learning note for myself, too.

Best luck for your project.  
Daniel

Reply
0 Kudos