tomtomek
Enthusiast
Enthusiast

Unable to load NVIDIA driver on ESXI 7.0.2 / Dell R740

Jump to solution

Hi all,

Trying to configure POC environment. Currently stuck at getting the Nvidia driver to load.

Spoiler

[root@localhost:~] nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

 

[root@localhost:~] esxcli software vib list | grep -i nvidia

NVIDIA-VMware_ESXi_7.0.2_Driver  470.63-1OEM.702.0.0.17630552         NVIDIA  VMwareAccepted    2021-09-13

 

[root@localhost:~] dmesg | grep -E "NVRM|nvidia"

2021-09-13T23:36:11.216Z cpu0:2097152)Loading nvidia_b.v00...

2021-09-13T23:36:11.217Z cpu0:2097152)VisorFSTar: 1871: nvidia_b.v00 for 0x5e18082 bytes

2021-09-13T23:36:43.142Z cpu93:2100393)SchedVsi: 2098: Group: host/vim/vmvisor/plugins/nvidia(18804): max=70 min=70 minLimit=unlimited shares=1000, units: mb

2021-09-13T23:36:43.182Z cpu80:2098541)Starting service nvidia-init

2021-09-13T23:36:43.243Z cpu80:2098541)Activating Jumpstart plugin nvidia-init.

2021-09-13T23:36:51.802Z cpu56:2098541)Jumpstart plugin nvidia-init activated.

2021-09-13T23:36:52.801Z cpu42:2100998)SchedVsi: 1016: Group nvidia could not be created: Already exists

Host is a Dell R740. I have enabled SR-IOV in Bios and disabled inbuilt Video controller. Also changed MIMO to 12TB.

Thanks.

 
 
 
Tom
Tags (3)
0 Kudos
1 Solution

Accepted Solutions
tomtomek
Enthusiast
Enthusiast

I can confirm that updating to latest vCenter and ESXI the issue is no longer present!

 

Thanks for your help!

View solution in original post

21 Replies
fabio1975
Expert
Expert

Ciao 

What is the model of the NVIDIA card?

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
0 Kudos
tomtomek
Enthusiast
Enthusiast

Sorry, should have said, its A40.

0 Kudos
a_p_
Leadership
Leadership

Please check whether the card is configured for Passthrough in the ESXi host's settings.

André

tomtomek
Enthusiast
Enthusiast

Hi there,

Yes, it is:

tomtomek_0-1631605692739.png

Unless I need to do the below on the host as well? It errors out anyway.

 

tomtomek_1-1631605775617.png

 

0 Kudos
fabio1975
Expert
Expert

CIao 

Remove the GPU from running in Passthrough, and use a vGPU Profile instead. Then run nvidia-smi again.

 

Fabio 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

Hi Fabio,

 

Ok, removed the pass through and now get the nvidia-smi output as per below:

[root@localhost:~] nvidia-smi
Tue Sep 14 10:53:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63 Driver Version: 470.63 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 Off | 00000000:3B:00.0 Off | 0 |
| 0% 31C P0 101W / 300W | 0MiB / 45634MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

 

But when trying to select the vGPU profile, the selection is empty. May try to reboot the host again.

 

tomtomek_0-1631609697317.png

 

0 Kudos
fabio1975
Expert
Expert

Ok try another reboot 

in the meantime what vSphere licenses do you have? and check how the Hardware Graphics is configured on the host

fabio1975_0-1631610445064.png

 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

Hi Fabio,

 

After removing the card from pass through and rebooting host, I can now see profiles. Thank you for your help.

 

Regards,

Tom

0 Kudos
tomtomek
Enthusiast
Enthusiast

I am now seeing the below when trying to power on the machine with vGPU configured. This only appears in vCenter. I can power it on fine from the ESXI host.

 

Any suggestions?

 

  • Power On virtual machine
  • Target

TESTVM

  • Status

The operation is not allowed in the current state of the host.

0 Kudos
tomtomek
Enthusiast
Enthusiast

Just to add to this. if I power the machine in the ESXI host it will start fine. I managed to configure the card there and obtain a license from the Nvidia license server. When trying to start the VM in vCenter, this fails. I need to fix this before deploying desktop pools. Any help is greatly appreciated.

0 Kudos
fabio1975
Expert
Expert

Ciao 

what vsphere licenses do you have installed? what error do you see when starting VM from vCenter?

 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

Hi Fabio,

Error is:

The operation is not allowed in the current state of the host.

Licenses:

VMware vCenter Server 7 Standard

vSphere 7 Desktop Host

Regards,

Tom

0 Kudos
fabio1975
Expert
Expert

Ciao

It could be a communication problem between the ESXi host and the vCenter.

Try disconnecting and reconnecting the ESXi host to the vCenter.

 

 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

I did try this but it did not resolve the issue. The issue only exists when the vGPU is added, without the vGPU the machine starts up fine in vCenter.

0 Kudos
fabio1975
Expert
Expert

CIao 

Can you check the VM log and check if there are any errors?

Locating virtual machine log files on an ESXi host (1007805) (vmware.com)

 

 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

I tried that yesterday, the issue is the log is not appended with any data when trying to boot the machine from vCenter. Is there any location on the vCenter that perhaps would have a detailed logging for the vCenter activities?

0 Kudos
fabio1975
Expert
Expert

Ciao 

Do you have the DRS enabled?
Have you already tried to remove the VM from the vCenter inventory and put it back?
Can you send me the screenshot of the HW assigned to the VM?

 

 

Fabio
BLOG: https://vmvirtual.blog

if satisfied give me a kudos
tomtomek
Enthusiast
Enthusiast

Hi Fabio,

 

Just checked with other colleagues from Nvidia and noticed the vCenter is not on the latest version. i am updating this now to see if this will resolve the issue. Once it is updated I will provide future feedback.

 

Thanks for your help so far!

Regards,

Tom

0 Kudos
tomtomek
Enthusiast
Enthusiast

I can confirm that updating to latest vCenter and ESXI the issue is no longer present!

 

Thanks for your help!

View solution in original post