VMware Cloud Community
Memnarch
Enthusiast
Enthusiast

RTX 4090 GPU passthru, esxi 8.0

I'm working on passing an RTX 4090 GPU to a VM on an Intel 13900k system.

Things I did first-- enabled Vt-d, > 4G addressing in BIOS (no ACS option , appeared enabled by default.)

No problem turning on passthru and assigning it to a VM. Also passthru some NVME drives and a renesas USB controller which all seem  to work. efi firmware mode. Windows 11. 48 GB RAM (out of 96 on the host.) 8 cores (out of 24).

VM won't poweron without Use64bitMMIO of at least 64 GB (as expected for a 24GB card.)

 

VM then powers on and works in windows. BUT, the first time after each host boot, the VM will spontaneously die (power off) within a few seconds of getting to the login screen.  The VM log gives a message "attempted to map 65000 pages to host memory" and recommends setting a pciHole.start to 1536.

If I do that, the VM doesn't poweroff as above, BUT the windows OS inside of it dies at the same time and the system reboots (succesfully). 

With or without pciHole.start, the VM can the be restarted and seems to work fine.  I have seen it blow up spontaenously once in a few days of testing, otherwise rock solid.  Only the first powerup after booting the physical host seems affected.

If I turn "resize BAR" off in the BIOS, the pciHole address changes but otherwise as above.  reBAR works in the VM OS if enabled in BIOS.

 

Several other VM's on the same host , using 2070 GPU's, don't show this behavior (and also don't require 64bitMMIO.)

The same windows OS boots and 4090  on the same hardware (without esxi) and works fine.

Any ideas?

 

Thanks for thinking about it.

4 Replies
RobBenedit
Enthusiast
Enthusiast

First of all congrats on getting your hands on this video card! However, the first hurdle is making it past the Vmware HCL and if you manage to get this setup working on ESX 8 then your next hurdle is the FPS and mouse response on gaming.

For example, it's going to make you an easy target on a first person shooter scenario.

-r   

Rob-o Benedit
Reply
0 Kudos
Memnarch
Enthusiast
Enthusiast

So, after considerable debugging...

TL;DR

I think it's a bug in how esxi releases the console-claimed GPU from the console to a VM, and there is a workaround.

Turn off boot display with

esxcli system settings kernel set -s vga -v FALSE (Can undo with: esxcli system settings kernel set -s vga -v TRUE)

VM is now stable including on first boot.

Long version:

 

Esxi had a "bug" in 7.0 where the display output claimed for the hypervisor console itself would have to be manually re-enabled for passthru with every reboot (in 6.7, this wasn't needed).  This was reportedly later changed back to the 6.7 behavior.  The above command line stops esxi from claiming any GPU for the console at all, and during the initial phases of 7.0 therefore removed the need to manual re-enable one gpu for passthru on every boot. The need for this command line was subsequently reportedly removed when the behavior was changed back to 6.7-style behavior, where passthru remained enabled on reboot and the VM would just take over the GPU automatically on its first boot.

 

Using this command line appears to totally fix the problems with the VM spontaneously combusting on first boot after host boot described in the initial post, and nothing else I tried does. It's probably not a coincidence that the 4090 is in the primary graphics slot and esxi does in fact claim it for the console unless I force it not to. (I never had to re-enable passthrough on boot, so didn't see any need for this command line until testing to see if it solved the problematic behavior.)

Source: https://williamlam.com/2020/06/passthrough-of-integrated-gpu-igpu-for-standard-intel-nuc.html

Beware-- the setting persists across boots. It is reversible, but you can run into trouble if your host loses network connectivity for some reason-- you will have no console AND no network access and this usually requires reinstalling esxi to fix. In particular, the combination of "no console" and an off-HCL network driver is a very dicey idea, since those often require resetting network configurations and host services for minor changes.  You have been warned.

 

fatbob01
Contributor
Contributor

Did you run into error 43 in the VM?  I've been trying to passthrough a 3060 for a few days on esxi 8, no dice.  Do you think disabling boot display will resolve this?  Tried setting hypervisor.cpuid.v0 = "FALSE", didn't work.  Tried to edit /etc/vmware/passthru.map from bridge to link, same issue.  Any help would be appreciated!

Reply
0 Kudos
Jukari
Contributor
Contributor

Did you ever get your setup working? I can't get direct access working at all, I just get a 2nd VGA adapter, but for what I'm doing I need direct access to the CUDA cores. 

Depending on how I install the driver, when I run "nvidia-smi", I get either:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

or

No devices were found

____________________________________________________________________________________________

Here's what my devices look like:

*-display
description: VGA compatible controller
product: SVGA II Adapter
vendor: VMware
physical id: f
bus info: pci@0000:00:0f.0
logical name: /dev/fb0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom fb
configuration: depth=32 driver=vmwgfx latency=64 resolution=1176,885
resources: irq:16 ioport:840(size=16) memory:f0000000-f7ffffff memory:ff000000-ff7fffff memory:c0000-dffff
*-display
description: VGA compatible controller
product: NVIDIA Corporation
vendor: NVIDIA Corporation
physical id: e
bus info: pci@0000:02:05.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm msi pciexpress vga_controller bus_master cap_list
configuration: driver=nvidia latency=64
resources: irq:18 memory:fd000000-fdffffff memory:c0000000-cfffffff memory:d0000000-d1ffffff ioport:a80(size=128)

____________________________________________________________________________________________

Here's my build guide so far:

###BIOS
Vt-d (Enabled)
SRV-IO (Enabled)

###Hypervisor Video Turned Off
esxcli system settings kernel set -s vga -v FALSE
(Note: esxcli system settings kernel set -s vga -v TRUE)

###System Settings
30 Gb (Reserved)
2 CPUs (x1 Core)

###Assigned PCI Device to VM
RTX 4090
Audio Device

###VM Options
pciPassthru.set.usebitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = 64
hypervisor.cpuid.v0 = FALSE
pciHole.start = 1536
pciHole.end = 2200

###Minimal Install + No Drivers
sudo apt-get install openssh-server
sudo apt-get update && sudo apt-get upgrade

##Install Nvidia Requirements
sudo apt install build-essential
sudo apt install pkg-config libglvnd-dev


###Shutdown GUI
sudo systemctl set-default multi-user
sudo telinit 3
sudo reboot 0

###Installed Nvidia Drivers
sudo apt install nvidia-driver-525 nvidia-dkms-525

or
sudo sh NVIDIA-Linux-x86_64-535.54.03.run

____________________________________________________________________________________________

I've tried several combinations of things above to get it to detect, but it VMware absolutely refuses to pass the RTX 4090 directly. Need help, been working this for days now.  Thanks for any tips/information.

Reply
0 Kudos