VMware Cloud Community
rbdls
Contributor
Contributor

VM hangs on startup with Nvidia T4 passthrough - tried everything

Hi guys,

I've been banging my head against a wall getting a Nvidia Tesla T4 passthrough-enabled VM to boot. I have two ESXi hosts in a Vsphere 8.0.0 setup (Enterprise Plus), each with 1 T4 card. These systems previously each ran a Quadro P620 via passthrough without issues. Moving to the T4 has been nothing but trouble. 😕

With either ESXi host, it properly boots and recognizes the card, and I am able to enable passthrough on it in the vSphere UI, as well as add it to a VM configuration. However, once I try to start the VM (on either host), it will hang at 88% and eventually error out. vmware.log for the VM shows:

2023-01-25T19:27:18.723Z In(05) vmx - MX: init lock: rank(PCIPassLCK_0)=0x3e7 lid=26
2023-01-25T19:30:27.731Z In(05) vmx - AH Failed to find a suitable device for pciPassthru0
2023-01-25T19:30:27.731Z In(05) vmx - Module 'DevicePowerOn' power on failed.

Some more things:

  • The VM is set to boot via EFI and boots up fine without the GPU passthrough device added - stock Ubuntu 22.04 install.
  • I've tried both DirectPath IO and Dynamic DirectPath IO to pass the card though, no difference.
  • Embedded virtualization is not enabled in the vm.
  • All VM memory is reserved.
  • I have also tried enabling and disabling the IOMMU in the VM (under CPU).
  • Tried autodetecting the video card, and manually specifying it.
  • Have tried restarting the host after enabling passthrough...have rebooted the host numerous times.

I've also tried the below config parameters in the .vmx in varying combinations, with no success:

pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32" (as the card has 16gb of memory)
pciPassthru0.msiEnabled = "FALSE"
hypervisor.cpuid.v0 = "FALSE"

 

The host systems are each a Supermicro SuperServer 5019D-FN8TP running an up-to-date BIOS (v1.8), and this model is listed as supporting the T4 according to https://www.supermicro.com/en/support/resources/gpu -- now, I do have the GPU plugged into a x16 riser, which converts it to the x8 PCIE slot on the motherboard, but the T4 spec sheet says it supports PCIE 3.0 x8 and x16 so I didn't think this would be an issue.

BIOS is as follows:

Screenshot 2023-01-25 at 3.55.04 PM.png

 

The GPU shows up in the Vsphere UI as follows:

Screenshot 2023-01-25 at 4.07.57 PM.png

 

Screenshot 2023-01-25 at 4.07.30 PM.png

 

GPU shows up fine on the host via `esxcli hardware pci list -c 0x300 -m 0xff`:

0000:65:00.0
Address: 0000:65:00.0
Segment: 0x0000
Bus: 0x65
Slot: 0x00
Function: 0x0
Vendor Name: NVIDIA Corporation
Device Name: TU104GL [Tesla T4]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x1eb8
SubVendor ID: 0x10de
SubDevice ID: 0x12a2
Device Class: 0x0302
Device Class Name: 3D controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0b
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x3001
Module ID: 45
Module Name: pciPassthru
Chassis: 0
Physical Slot: 7
Slot Description: CPU SLOT7 PCI-E 3.0 X8
Device Layer Bus Address: s00000007.00
Passthru Capable: true
Parent Device: PCI 0:100:0:0
Dependent Device: PCI 0:101:0:0
Reset Method: Bridge reset
FPT Sharable: true
NUMA Node: 0
Hardware Label:
Virtual Function:

Here's the .vmx file for the VM I'm trying to boot:

.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "20"
nvram = "oc.nvram"
svga.present = "TRUE"
vmci0.present = "TRUE"
hpet0.present = "TRUE"
floppy0.present = "FALSE"
numvcpus = "2"
memSize = "16384"
firmware = "efi"
powerType.powerOff = "default"
powerType.suspend = "default"
powerType.reset = "default"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
sched.cpu.latencySensitivity = "normal"
vm.createDate = "1674612518956071"
scsi0.virtualDev = "pvscsi"
scsi0.present = "TRUE"
sata0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "oc.vmdk"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
sata0:0.deviceType = "cdrom-image"
sata0:0.fileName = "/vmfs/volumes/9d696458-538d8b1c/iso/ubuntu-22.04-live-server-amd64.iso"
sata0:0.present = "TRUE"
ethernet0.allowGuestConnectionControl = "FALSE"
ethernet0.virtualDev = "vmxnet3"
ethernet0.dvs.switchId = "50 11 bd bf 4b da 72 f0-66 52 ed d6 5f 9a a5 b8"
ethernet0.dvs.portId = "34"
ethernet0.dvs.portgroupId = "dvportgroup-2041"
ethernet0.dvs.connectionId = "1114659673"
ethernet0.shares = "normal"
ethernet0.addressType = "vpx"
ethernet0.generatedAddress = "00:50:56:91:f3:77"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
displayName = "oc"
guestOS = "ubuntu-64"
chipset.motherboardLayout = "acpi"
toolScripts.afterPowerOn = "TRUE"
toolScripts.afterResume = "TRUE"
toolScripts.beforeSuspend = "TRUE"
toolScripts.beforePowerOff = "TRUE"
uuid.bios = "42 11 41 c2 e2 4f 33 f8-bb e2 cc ae ec de ef e4"
vc.uuid = "50 11 cd 21 85 bf 53 07-6b 03 95 46 2f 0d f0 99"
migrate.hostLog = "oc-22261365.hlog"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "16384"
sched.mem.minSize = "16384"
sched.mem.shares = "normal"
migrate.encryptionMode = "opportunistic"
ftcpt.ftEncryptionMode = "ftEncryptionOpportunistic"
scsi0:0.ctkEnabled = "TRUE"
ctkEnabled = "TRUE"
sched.mem.pin = "TRUE"
numa.autosize.cookie = "40012"
numa.autosize.vcpu.maxPerVirtualNode = "4"
cpuid.coresPerSocket.cookie = "4"
sched.swap.derivedName = "/vmfs/volumes/611ffeaf-b4d4b252-6f7b-ac1f6b7d80aa/oc/oc-1416d0e7.vswp"
pciBridge1.present = "TRUE"
pciBridge1.virtualDev = "pciRootBridge"
pciBridge1.functions = "1"
pciBridge1:0.pxm = "0"
pciBridge0.present = "TRUE"
pciBridge0.virtualDev = "pciRootBridge"
pciBridge0.functions = "1"
pciBridge0.pxm = "-1"
scsi0.pciSlotNumber = "32"
ethernet0.pciSlotNumber = "34"
sata0.pciSlotNumber = "35"
scsi0:0.redo = ""
scsi0.sasWWID = "50 05 05 62 e2 4f 33 f0"
vmotion.checkpointFBSize = "16777216"
vmotion.checkpointSVGAPrimarySize = "16777216"
vmotion.svga.mobMaxSize = "16777216"
vmotion.svga.graphicsMemoryKB = "16384"
vmci0.id = "-320933916"
monitor.phys_bits_used = "45"
cleanShutdown = "TRUE"
softPowerOff = "TRUE"
tools.syncTime = "FALSE"
guestInfo.detailed.data = "architecture='X86' bitness='64' distroName='Ubuntu 22.04 LTS' distroVersion='22.04' familyName='Linux' kernelVersion='5.15.0-58-generic' prettyName='Ubuntu 22.04
toolsInstallManager.updateCounter = "1"
extendedConfigFile = "oc.vmxf"
sata0:0.startConnected = "FALSE"
bios.bootDelay = "5000"
vmx.buildType = "debug"
svga.autodetect = "TRUE"
svga.guestBackedPrimaryAware = "TRUE"
uuid.location = "56 4d f0 8d e1 dc 65 db-8e 50 1a 54 63 4b f8 3e"
svga.vramSize = "16777216"
vvtd.enable = "TRUE"
viv.moid = "f0c3d812-d205-4ee9-a1c6-452994dc9e42:vm-48044:A4Ad6e0tdI/Qwq+qN/eDfKIP6+cMXGD5Y6L6z5MTXBk="
pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32"
pciPassthru0.id = "00000:101:00.0"
pciPassthru0.deviceId = "0x1eb8"
pciPassthru0.vendorId = "0x10de"
pciPassthru0.systemId = "5c7944bd-360d-25c6-d570-ac1f6b7d80aa"
pciPassthru0.present = "TRUE"

Items like svga.vramSize, vmotion.*, svga.present were added automatically by VMWare. If I change from DirectPath to Dynamic Directpath, the pciPassthru0 items become:

pciPassthru0.allowedDevices = "0x10de:0x1eb8"
pciPassthru0.present = "TRUE"

Thank you for any help on this matter! Would love to get these cards working over the Quadros.

Reply
0 Kudos
0 Replies