VMware Cloud Community
Memnarch
Enthusiast
Enthusiast

Threadripper ESXi 6.7 GPU and USB passthrough, experience progress and problems

Hello all-- thought I'd pass through (pun intended!) my experience so far getting up a multiheaded ESXi box on an AMD Threadripper platform.

Hardware:

ASRock Professional Gaming X399 motherboard

AMD Threadripper 1950x

64GB DDR4 RAM

1TB nvme m.2 drive

NVidia 1080 FE

Software:

ESXi 6.7

Windows 10 64 bit client

Steps that have worked so far:

I can get a vm with GPU passthrough working. In order to do this, I had to:

Upgrade the BIOS to the beta version (AGESA update)-- otherwise VM won't power up.

The mystical hypervisor.cpuid in the vm configuration, to avoid error 43 in the NVIDIA drivers in windows.

Edit passthru.map (I'm using d3d0, not bridge; others have used link).(On my former build not doing this yields complete host hang when restarting a VM.)

Turned on all IOMMU options, ACP , SVM in BIOS.

VM must use BIOS, otherwise hangs at Windows loading if USB passthrough is enabled.

The board has 3 USB controllers that I can see.  Each appears associated with two other devices, a "nonessential device" and a platform security processor in addition to the controller itself.

I can't get the aquantia 10GB ethernet to pass through, not that I've tried very hard.  It seems to get stuck at "enabled but needs to reboot" despite infinite reboots.

Current problem:

I can pass through a USB controller and it works fine.  If I then shut down or reboot the VM, it hangs on windows logo when starting again.  From a cold boot, no problems.

Things that haven't worked:

Changing PCIE switch mode in BIOS  to Gen 2 instead of Auto.

Idea: Perhaps the USB controllers on this board  (AMD Family 17H USB 3.0 controller?) also need to use d3d0 rebooting? Seems strange, but the bug is reminiscent of what the consumer NVIDIA boards do when not switched to bridge mode.

Hope my experience has been helpful to others and any advice would be much appreciated! Thanks "LT"

14 Replies
Memnarch
Enthusiast
Enthusiast

Hello all-- a few more notes, and progress.

The following didn't work:

Removing "USB Composite device" under windows device manager.

Passing through only the USB controller (not the other associated devices) and adding a passthrough.map entry , d3d0 mode, for that controller only.

The following does seem to work:

Adding the USB controller, associated security platform and "nonessential instrumentation" to the VM, and then using a wildcard with the vendor ID (which is the same for all devices)  in the passthrough.map. I was reluctant to try this for fear that the wildcard could nuke other AMD devices somehow (same vendor?) but the  VM now reboots happily with both USB and GPU passthrough, yay!

b

Memnarch
Enthusiast
Enthusiast

Further notes:

The Aquantia 10GB controller, the Intel wireless controller, an add-in PCIE 3.0 board in the PCIE x1 slot,  and the USB 3.1 controller (as compared to the 2 x USB 3.0 controllers, which do passthrough) all won't passthrough-- most of these perpetually "need reboot".  I've read this can be caused by being behind a non-ACS compatible PCIE switch.  Setting the parameter to disable ACS check to TRUE results in a non-bootable host.  All BIOS parameters that I can identify (including "enable ACS") are enabled, and I am using the beta bios with AGESA 1.0.0.6.  Stumped, although none of this is absolutely essential to me.  Another possibility (at least for the 3.1 controller) is that it seems to be in the same IOMMU group as a PCIE bridge that can't be enabled for passthrough (greyed out.) The 3.1 controller is slightly different, VM won't power up with a message that a device isn't passthrough capable.

of 4 GPU's in the system (1080 FE, 2 x 1070Ti FE, 1 x 1060 3Gb), all can be passed through.  One of the 1070 Ti's gives the dreaded error 43, even with the hypervisor.cpuid flag set to FALSE.  The others work very nicely.

I can try replacing one of the video boards with an AMD, and using software USB passthrough from the two working controllers, but it would be really nice not to have to do that (passing through USB between VM's means that VM's become dependent on eachother  for their keyboards to work).  And the error 43 just gets my goat because it is so poorly documented on the NVidia side.

Will post further updates, would appreciate any advice.  Thanks

Memnarch
Enthusiast
Enthusiast

And yet further progress--

The 1070 Ti with the error 43 was also dead in a single board system (board problem).

Was able to passthrough Asus USB Bluetooth dongles as USB devices (not as PCI devices; done through a non-passthrough USB controller-- the very same 3.1 controller that can't pass through anyway) to VM's, which could then attach Bluetooth keyboards and mice.  However, I'm going to stick with Virtualhere because I'd rather not deal with the input lag and batteries, and can live with making some vm's dependent on others.

Passthrough of the onboard audio worked fine despite not passing through other items in the IOMMU grouping (like a SATA controller). I've never actually tested or used the sata controllers, NVME works fine.

The on-board Bluetooth was eventually discovered appearing as a USB device and could be passed through ("Intel USB device") but not actually used in the VM (recognized as Bluetooth controller but had an error, "wouldn't start.")

juthi
Contributor
Contributor

Hello Memnarch. Thanks for your posts. Interesting reading. We are in a similar position and are a hardware manufacturer. Attempting to resolve through trial and error and appreciate your efforts and public postings.

Q: How many passthrough devices do you currently have in your box?

According to the VMWARE docs, a max of 6 are permitted per VM:

Add a PCI Device

The D3D0 are power states of the respective attached hardware. Often, to save on power, the main voltage rail is turned off on the respective PCIe function and then the widget is expected to remain active through the use of the 3v3_VAUX rail. However, most designs do not make use of that power rail. Ethernet adapters may be an exception. I would not expect USB adapters to have such support. From our review of USB host adapters, we purchased a random 5+ adapters from Amazon and we could not find a single board to offer the "mandatory" USB current protection via polymer fuse and/or USB load switch to shut down the leg of that USB connector in case of excessive current draw. Rather, the only "current limit" was the PCB trace width which will act like a fuse and is quite dangerous. Hoping your situation is different. Many motherboard based USB ports do offer some level of current protection but may be lacking on ESD protection.

From other readings, users are reporting success by using earlier versions (ie. 5.5U3 appears to be ok) of ESXi so not sure if that is an option for you.

VM with passthrough "freezes" entire ESXi box when shutdown/rebooting guest | ServeTheHome and Serve...

Memnarch
Enthusiast
Enthusiast

4 GPU passthrough devices, 2 USB controllers, 1 audio device, split amongs 4 VM's.

Using only motherboard USB now.  Seems to work with d3d0; would you suggest changing it to something else?

Also, I'm getting random VM lockups when under GPU load (not often, on the order of once in a few hours.)  VM locks hard, host is OK, but rebooting the vm puts it into Error 43 with no gpu: need to reboot the host to fix this.  The VM with the GPU furthest on the motherboard from the CPU is  most often affected.   Don't know if this is a pciHole, msiEnable, or agesa problem... grrr.

I looked into the log files when rebooting the VM (into error 43).  "Bad" reboots have this at the end: (truncated).  There's some earlier stuff about vmware  tools not loading either. Anyone know that this stuff means? (This text doesn't appear on normal boots, only after a lockup that yields error 43 on VM reboot).

Thanks LT

2018-07-05T21:49:31.734Z| vcpu-2| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:31.734Z| vcpu-4| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:31.734Z| vcpu-6| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:31.734Z| vcpu-3| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:31.734Z| vcpu-0| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:31.734Z| vcpu-5| I125: Guest MSR write (0x49: 0x1)

2018-07-05T21:49:32.836Z| svga| I125: SVGA disabling SVGA

2018-07-05T21:49:33.277Z| vcpu-7| I125: LSI: Invalid PageType [21] pageNo 0 Action 0

2018-07-05T21:49:43.510Z| vcpu-0| I125: Destroying virtual dev for scsi0:0 vscsi=8197

2018-07-05T21:49:43.510Z| vcpu-0| I125: VMMon_VSCSIStopVports: No such target on adapter

2018-07-05T21:49:43.511Z| vcpu-0| I125: DEVICE: Resetting device 'ALL'.

2018-07-05T21:49:43.511Z| vcpu-0| I125: USB: Per-Device Resetting device 0x200000050e0f0003

2018-07-05T21:49:43.511Z| vcpu-0| I125: Tools: ToolsRunningStatus_Reset, delayedRequest is 0x91DA8A4270

2018-07-05T21:49:43.511Z| vcpu-0| I125: Tools: Changing running status: 1 => 0.

2018-07-05T21:49:43.511Z| vcpu-0| I125: GuestLib Generated SessionId 11821741703570650877

2018-07-05T21:49:43.511Z| vcpu-0| I125: Checkpoint_Unstun: vm stopped for 3527 us

2018-07-05T21:49:43.511Z| vcpu-0| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-1| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-2| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-3| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-7| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-6| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-5| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.512Z| vcpu-4| I125: CPU reset: hard (mode 0)

2018-07-05T21:49:43.518Z| vcpu-0| I125: SVGA: Unregistering IOSpace at 0x1070

2018-07-05T21:49:43.518Z| vcpu-0| I125: SVGA: Unregistering MemSpace at 0xe8000000(0xe8000000) and 0xfe000000(0xfe000000)

2018-07-05T21:49:43.519Z| vcpu-0| I125: SCSI: switching scsi0 to push completion mode

2018-07-05T21:49:43.519Z| vcpu-0| I125: Creating virtual dev for 'scsi0:0'.

2018-07-05T21:49:43.519Z| vcpu-0| I125: DumpDiskInfo: scsi0:0 createType=11, capacity = 536870912, numLinks = 1, allocationType = 2

2018-07-05T21:49:43.519Z| vcpu-0| I125: SCSIDiskESXPopulateVDevDesc: Using FS backend

2018-07-05T21:49:43.519Z| vcpu-0| I125: DISKUTIL: scsi0:0 : geometry=33418/255/63

2018-07-05T21:49:43.519Z| vcpu-0| I125: SCSIFilterESXAttachCBRCInt: CBRC not enabled or opened without filters,skipping CBRC           filter attach.

2018-07-05T21:49:43.519Z| vcpu-0| I125: SCSIFilterSBDAttachCBRC: device scsi0:0 is not SBD. Skipping CBRC attach SBD way.

2018-07-05T21:49:43.567Z| vcpu-0| I125: PCIXHCI: Interrupt type changed from MSIX to INTX

2018-07-05T21:49:43.578Z| vcpu-0| I125: PCIBridge4: ISA/VGA decoding enabled (ctrl 0004)

2018-07-05T21:49:43.578Z| vcpu-0| I125: pciBridge4:1: ISA/VGA decoding enabled (ctrl 0004)

Memnarch
Enthusiast
Enthusiast

Further notes--

The dreaded "error 43" after random lockups did NOT go away switching MB to an Asus zenith extreme.  However, after that board was updated to the latest bios (for the new threadrippers, although I'm using a 1950x) I have gotten zero error 43: random lockups still happen but the VM is rebootable from the host.

I am (still) getting random audio freezes periodically on all the vm's, which occasional seem to result in these crashes.  The freezes became much less frequent (but did not resolve entirely) when setting each VM to prefer CPU cores from a single die on the threadripper (eg, affinity for core 0-7).  Still having occasional random crashes.

Disabling SMT in BIOS strangely did not work :smileyalert:---esxi still reports 32 cores with hyperthreading active.  STrange.  WIll try disabling hyperthreading in esxi itself, but would have thought the bios would be the definitive way to do that.

Docop
Contributor
Contributor

Hi

So on your system, have you tested it on esxi 6.5u2 ? For many all passthr it work vs 6.7. And more interesting to know, about your nic : Aquantia AQC107  was it working and listed as a vnic ? And you can get network data fine ?   I was looking for the onboard 10gb nic on new motherboard.. if you can confirm this it will be very appreciated. Instead of intel x540..

Thanks in advance

Memnarch
Enthusiast
Enthusiast

The 1gb nic on this board works fine. Never tried to pass it through.

i dont  have the included Aquantia Pcie board so I couldn’t tell you.

using 6.7.

Im embarrassed to say that a large chunk of problems disappeared with a cmos clear, probably had bad ram training. Freezes and crashes have cleared up. Also , strangely , turning memory interleaving back to auto DOUBLED my applixation Performance (remember I’ve got my vms pinned to one ccx each).

System works very  nicely now. Would still like to figure out how to pas through the 3.1 usb controller (VM won’t boot , “not a passthrough device ”) and the. Asmedia usb controller (perpetually needs a reboot to enable passthrough).

Docop
Contributor
Contributor

You said having the x399 prof.. so it has 3 ethernet port with the 10gb , the red one. So it's all on-board using the chipset Aquantia. In the nic list, do you have 3 port show ?  and 1 at 10gb ?

For the usb, you can pass the whole controller. Are you booting from an usb key ? Try to install into an ssd and then pass the whole ctrl to esxi. You will end up probably having 1 full usb ctrl to pass over 1 vm. Let me know, i will try to check with one of the my friend with another x399 board to see if...

Memnarch
Enthusiast
Enthusiast

Hi--

I did start this thread on the x399 prof, but I switched to the Asus Zenith Extreme partway (as above).  My recollection from the x399 is that the aquantia did show up on the hardware list but I never did much with it, because I have no 10G switch nor anything else with a 10G port. 

I'm only passing through whole controllers at this point.  I'd like to have 1 controller for each of the 4 vm's with GPU's, and there are 4 usb controllers on the the Zenith.  2 of the 3 AMD controllers work; the AMD 3.1 controller won't allow boot when passed through and the ASMEDIA is stuck in perpetual "activated needs reboot": no matter how many times I reboot-- so I only have 2 useful passthrough controllers for 4 vm's.  I'm using software to pass individual devices from one of the vm's to the other 2 but this is not idea.

D3DAiM
Contributor
Contributor

I'm having a similar problem with my ASRock Z77 Extreme4 board + GeForce GTX 1060 3GB + Fresco Logic FL1100 USB controller.

If I am passing through my 1060 + Fresco Logic, it will work from a cold boot, but it cannot handle 1 or 2 (if I'm lucky) shutdown/startup sequences before it crashes ESXi completely where the host stops pinging.

Interestingly, what I have found is that passing through the board's ASMedia USB controller (integrated, controls 2 USB ports on the back, rest is Intel USB) instead works just fine!

I haven't found any way to fix this. Tried many different combinations of advanced settings. So, I can't recommend this card..find something else! Onboard USB 3.0 passthrough works fine.

Memnarch
Enthusiast
Enthusiast

Hi all! OP here, a few more notes that I don't think I included above--

pciPassthru0.msiEnabled  (where 0 is replaced with whichever device is your GPU) must be set to FALSE in advanced vm settings for the GPU closest to the CPU, otherwise you get constant crashing of the video driver (often with immediate restart and no bluescreen, looks like a slow computer with a screen that occasionally blinks).  Beware, because if you change or delete the device and then re-add, you need to redo this.

Very important post on changing the esxi host configuration to prevent random core memory hopping.  This was on reddit and made a substantial difference, reducing or eliminating microstutter on all the VM's.  I also no longer needed to pin them to specific ccx's.  This is in the reddit post titled  AMD EPYC on ESXi 6.5-6.7 NUMA issues: Mostly Resolved

Using 6.7 U1 now, can use EFI in VM's instead of BIOS (used to hang if USB passhtru).

Ran into a problem where any VM would slam to 100% disk usage under heavy load (like say a file download) and become almost unresponsive for > 10 minutes after download stopped. Eventually tracked it down to meta corruption on the underlying VMFS.  VOMA doesn't like vmfs 6 (didn't last week) so I migrated to a new vmfs, problem solved.

Other notes:

The motherboard appears to retrain its BIOS (and renumber PCI devices ) if you take any out or put any in, potentially breaking more vm's until you re-activate for passthru.  You also need to reset lots of bios settings.

Very interestingly, I discovered my system overheating from a dying Enermax Liqtech TR4 AIO.  This caused really "interesting" features like the system abruptly (failing to POST) after changing memory settings, even though it had been working fine a minute before.  Even worse, if it DID memory training when hot, it would set to super low speeds.  This sounds easy, but with not way to check CPU temps outside of BIOS, it was slightly less straightforward (anyone know another way to check CPU temp under esxi with a threadripper? or any of the vm's?)

Memnarch
Enthusiast
Enthusiast

Hi all--- a few more notes.

tl;dr Learn esxtop. Assigning more resources to vm's can result in lowered performance.

I *finally* pinned down what seemed to be the last "stutter" problem.  To put it briefly, running 4 vm's with 8 vcpu's each on a 16core /32 thread CPU was a bad idea.  Under load, I discovered up to 50% READY state in esxtop which corresponded to massive stutter (should be < 5%; this is a measure of how often the vm was ready to execute but forced to wait due to lack of host resources).  I'm theorizing that the win10 vm's didn't realize (weren't told) that those "8 vcpus" were really "4 cores/ 8 thread." REDUCING the number of vcpu's on each vm from 8 down to 4 massively IMPROVED performance, with ready state < 0.5%.. It also fixed the latencymon testing (now the vm's pass, yay!).

vm's with GPU's all have RAM reserved.  If you're running in NUMA mode, you probably want all that RAM in one NUMA domain.  The order in which you poweron your vm's, and how much ram they have, can affect esxi's ability to do this.   Again, more can be less. I haven't investigated UMA vs. NUMA comprehensively.

Moved back to the asrock x399 Professional Gaming MB, works. Again, only 2 MB USB pass through successfully.  There seems to be a non-vmware driver available for the 10G NIC (aquantia).

All vm's moved to 6.5 NVME controllers successfully.

Reply
0 Kudos
Peli67
Contributor
Contributor

Can u please provide the String what u have add to the passtrhough map. as i am in the same Situation and i am noob on this

Thank u !

Reply
0 Kudos