VMware Cloud Community
Memnarch
Enthusiast
Enthusiast
Jump to solution

Threadripper 3970x GPU and USB Passthrough, esxi 6.7 U3, nvidia RTX 2080, TRX40

Hi all-- wanted to describe progress on an update to my former threadripper system.

Starting point: 4 Vm's  on a threadripper 1950, each with GPU passthrough (1 x 2080, 3 x 2070).  64 GB RAM, esxi 6.7 U3. System was quite stable (see prior thread).

Target: Threadripper 3970x (double the cores), 128 GB RAM, on an Asrock Creator TRX40 motherboard.

I started by validating the new hardware under a temporary (non-virtualized) windows build.  Stuff worked.

BIOS settings used: defaults, except:

Changed some fan settings to make them quieter

Turned on XMP

Turned on SR-IOV

left PBO OFF (default, but I changed it to disabled. PBO sucks up huge amounts of power for little performance benefit, to say nothing of validation!)

Used current BIOS version, not the beta for 3990x.

esxi installation: Used previous installation.  This had passthru.map entries for AMD and NVIDIA as detailed in my last post. It also had the epyc-recommended configuration change previously recommended (which I removed) and preinstalled aquantia NIC driver.

Moved 2 x m.2 SSd's  from old into new system. System booted nicely into esxi. All hardware passthrough vanished as expected. Of note, *neither* of the NIC's on this board has a native driver. I used the aquantia driver and live off the 10Gb aquantia.  I have no idea if there is a realtek Dragon 2.5g driver out there.

Redid the hardware passthrough.  All GPU's passed back through to their VM's, yay.  Only two of them would boot, boo. Eventually, after much gnashing of teeth, remade three VM's from scratch: they would all keep crashing immediately upon booting windows.

This was interesting.  Esxi would report that the GPU's had violated memory access and advised adding a passthrough.map entry, which didn't fix the problem.  Changing BIOS on the host to remove CSM support and enable 4G support, and enabling 64 bit MMIO in the vm didn't fix it either.  A new vm with fresh windows install worked.

There were several other interesting changes from previous system:

disabling msi on the gpu made them keep crashing , unlike previously when it fixed stuttering :smileyalert:

No cpu pinning or NUMA settings were used or needed

The mystical cpuid.hypervisor setting remains required to avoid error 43

With these caveats, I got 4 bootable VM's each using the Nvidia card's own USB-c connector for keyboard/mouse. 8 cpus/vm.  Which led to the next problem, which I haven't been able to solve:

The mice/keyboards would all intermittently freeze for moments to minutes, and sometimes not come back. Lots of testing inside of windows showed no cause. Interestingly, the problem was 1)worse with a highend G502 mouse, and 2) much worse inside of windows UI -- and never happened for example in demanding real time full screen apps.  I was sure it was going to be some bizarre windows problem. Rarely (every few hours) systems would crash completely (while idle!) with the same memory access violation.  Also rebooting one of the vm's would make other vm's momentarily stutter. None of this ever happened on the 1950x system where these controllers were reliable.

I eventually worked around the problem with the motherboard's USB controllers.  There are 5: 2 x Matisse, 2 x Starship, and 1 x asmedia.  The Matisse ones are lumped in the same IOMMU group and won't pass through (They are perpetually "reboot needed.") The Asmedia chip worked with no problems (usb-c port on the back of the motherboard).  The Starship USB 3.0 controllers both worked IF you had a passthru.map entry moving them to d3d0 reset method.  Otherwise, booting a VM with one of these controllers failed AND crashed a different VM with a GPU memory access violation :smileyalert: and the controller then permanently disappeared until the system was then powered down (not just a reboot).  Wow, talk about bad crashes.

Using these 3 motherboard controllers an 3 vms appears rock stable (I haven't tested the fourth yet.) One of them has 64but mmo enabled which probably isn't needed.

Things I haven't gotten around to testing yet:

1. Does isolating the vm to one ccx fix anything?

2. If only one vm is running, does the usb-c nvidia controller become reliable?

3. Does turning off XMP or using the latest beta BIOS change anything?

Other advice -- I'm obviously waaay off of HCL here-- but don't even try DRAMless SSD's.  Datastore *vanishes* under high load. Bad. Same thing happens with my OEM samsung until I updated the firmware, but that's another story well documented elsewhere.

I'm really puzzled by the nvidia usb-c thing. Would also be nice if the Matisse controllers worked.  Otherwise mostly pleased-- many of the kludges needed on older esxi versions and the 1950x with its wacky NUMA configuration are no longer needed and the new system is *much* faster.

Hope this helps someone else.  If anyone can tell me what's going on (or at least that it's not just me) would be much appreciated. I speculate it's a BIOS bug.

Thanks LT

1 Solution

Accepted Solutions
Memnarch
Enthusiast
Enthusiast
Jump to solution

I think I've found the answer,and it looks like  an issue that has nothing to do with vmware. I've found multiple other examples of non-virtualized users having same problem.

Proposed solutions:

https://www.nvidia.com/en-us/geforce/forums/geforce-graphics-cards/5/332481/type-c-port-on-my-2080ti...

"TLDR: Make sure your PCI Express Link State Power Management savings mode is set to off! After changing this all my issues went away switching also by switching from Balanced to High Performance profile which changes this setting for you. https://www.sevenforums.com/tutorials/292971-pcie-link-state-power-management-turn-off-windows.html 

Additionally just to ensure this I also set NVIDIA's root USB controller to uncheck "allow the computer to turn off this device to save power". I also moved the device to a powered USB hub. https://felixwong.com/2015/03/solved-microsoft-wired-keyboard-200-disconnects-and-makes-usb-connecti... "

Neither of the above actually solved the issue for me. But the following did:

Nvidia control panel, Manage 3D settings, power management mode --> prefer maximum performance.

Presto, stutter gone.  Including choices 1 and 2 because they seemed to help many other people with similar issue.

With this fix, the RTX gpu's are very convenient-- each VM has its own USB-c controller built in.

Another (no longer needed) workaround: https://www.virtuallyghetto.com/2020/05/how-to-passthrough-usb-keyboard-mouse-hid-and-ccid-devices-t...

https://www.virtuallyghetto.com/2020/05/how-to-passthrough-usb-keyboard-mouse-hid-and-ccid-devices-t...

but see all the comments to that article  for the "fine print" that actually makes it work. I found that answer worked for mouse and keyboard but passing through a USB audio controller gave me massive lag. Can do audio over HDMI from the gpu instead though. I much prefer passing through a whole USB controller, much simpler to do.

LT

View solution in original post

4 Replies
Memnarch
Enthusiast
Enthusiast
Jump to solution

Some additional data--

Disabling XMP, updating BIOS to the latest beta version, shutting down all other vm's and changing the test vm to High latency sensitivity, pinned to a single ccx (4 cores) with full CPU  reservation, and deleting all unused virtual devices on the VM did not fix (intermittent nvidia USB device freezing) problem.

No windows error messages when mouse freezes, device manager shows no errors.

No corresponding errors in VMware logs.

Still stumped.

0 Kudos
Memnarch
Enthusiast
Enthusiast
Jump to solution

I think I've found the answer,and it looks like  an issue that has nothing to do with vmware. I've found multiple other examples of non-virtualized users having same problem.

Proposed solutions:

https://www.nvidia.com/en-us/geforce/forums/geforce-graphics-cards/5/332481/type-c-port-on-my-2080ti...

"TLDR: Make sure your PCI Express Link State Power Management savings mode is set to off! After changing this all my issues went away switching also by switching from Balanced to High Performance profile which changes this setting for you. https://www.sevenforums.com/tutorials/292971-pcie-link-state-power-management-turn-off-windows.html 

Additionally just to ensure this I also set NVIDIA's root USB controller to uncheck "allow the computer to turn off this device to save power". I also moved the device to a powered USB hub. https://felixwong.com/2015/03/solved-microsoft-wired-keyboard-200-disconnects-and-makes-usb-connecti... "

Neither of the above actually solved the issue for me. But the following did:

Nvidia control panel, Manage 3D settings, power management mode --> prefer maximum performance.

Presto, stutter gone.  Including choices 1 and 2 because they seemed to help many other people with similar issue.

With this fix, the RTX gpu's are very convenient-- each VM has its own USB-c controller built in.

Another (no longer needed) workaround: https://www.virtuallyghetto.com/2020/05/how-to-passthrough-usb-keyboard-mouse-hid-and-ccid-devices-t...

https://www.virtuallyghetto.com/2020/05/how-to-passthrough-usb-keyboard-mouse-hid-and-ccid-devices-t...

but see all the comments to that article  for the "fine print" that actually makes it work. I found that answer worked for mouse and keyboard but passing through a USB audio controller gave me massive lag. Can do audio over HDMI from the gpu instead though. I much prefer passing through a whole USB controller, much simpler to do.

LT

sprange
Contributor
Contributor
Jump to solution

Older thread but still relevant.  I have essentially the same hardware and I'm literally in USB hell.  Passthrough of onboard USB devices causes the VM (Windows 10 with an NVidia GPU) to simply not boot if a device is attached to the controller.  Even the USBc on the RTX 20XX video cards.  For example I can boot up a VM, attach a USBc hub to the video card and the mouse and keyboard will work fine.  Reboot and the VM will hang (but not crash the host).  Remove the USBc hub and its fine.  HID passthrough is flakey at best and I have no idea how to manage multiple exceptions for the /bootbank/boot.cfg file.  I literally cannot get the VM to boot if any USB device is attached to a passed through controller.  I haven't tried the D3D0 option for the Starship controllers yet.  PCIE card with ASmedia 1142 chipset won't work either...actually it doesn't work at my office on ESXI approved servers either so no surprise.  I may end up buying some ethernet to USB hub servers.  Not ideal so I'm hoping for any guidance!  Thanks

Tags (1)
0 Kudos
sprange
Contributor
Contributor
Jump to solution

As an update, similar problems were mentioned about three years ago with 6.7 and the solution was to install Windows 10 with BIOS instead of EFI.  I set up another VM with an RTX 2060 Super and everything the USBC port on the video card worked fine.  With the recommended changes to the Nvidia driver (max performance) the mouse and other USB devices work fine.  The VM boots fine, even with USB devices attached.  So perhaps that is the issue...

0 Kudos