Hi guys,
I've been banging my head against a wall getting a Nvidia Tesla T4 passthrough-enabled VM to boot. I have two ESXi hosts in a Vsphere 8.0.0 setup (Enterprise Plus), each with 1 T4 card. These systems previously each ran a Quadro P620 via passthrough without issues. Moving to the T4 has been nothing but trouble.
With either ESXi host, it properly boots and recognizes the card, and I am able to enable passthrough on it in the vSphere UI, as well as add it to a VM configuration. However, once I try to start the VM (on either host), it will hang at 88% and eventually error out. vmware.log for the VM shows:
2023-01-25T19:27:18.723Z In(05) vmx - MX: init lock: rank(PCIPassLCK_0)=0x3e7 lid=26
2023-01-25T19:30:27.731Z In(05) vmx - AH Failed to find a suitable device for pciPassthru0
2023-01-25T19:30:27.731Z In(05) vmx - Module 'DevicePowerOn' power on failed.
Some more things:
I've also tried the below config parameters in the .vmx in varying combinations, with no success:
pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32" (as the card has 16gb of memory)
pciPassthru0.msiEnabled = "FALSE"
hypervisor.cpuid.v0 = "FALSE"
svga.guestBackedPrimaryAware = "FALSE" (seems to like to be set to TRUE by default)
The host systems are each a Supermicro SuperServer 5019D-FN8TP running an up-to-date BIOS (v1.8), and this model is listed as supporting the T4 according to Qualified Platform List for GPUs | Supermicro -- now, I do have the GPU plugged into a x16 riser, which converts it to the x8 PCIE slot on the motherboard, but the T4 spec sheet says it supports PCIE 3.0 x8 and x16 so I didn't think this would be an issue.
BIOS is as follows:
The GPU shows up in the Vsphere UI as follows:
The GPU shows up fine on the host via esxcli hardware pci list -c 0x300 -m 0xff:
0000:65:00.0
Address: 0000:65:00.0
Segment: 0x0000
Bus: 0x65
Slot: 0x00
Function: 0x0
Vendor Name: NVIDIA Corporation
Device Name: TU104GL [Tesla T4]
Configured Owner: VM Passthru
Current Owner: VM Passthru
Vendor ID: 0x10de
Device ID: 0x1eb8
SubVendor ID: 0x10de
SubDevice ID: 0x12a2
Device Class: 0x0302
Device Class Name: 3D controller
Programming Interface: 0x00
Revision ID: 0xa1
Interrupt Line: 0x0b
IRQ: 255
Interrupt Vector: 0x00
PCI Pin: 0x00
Spawned Bus: 0x00
Flags: 0x3001
Module ID: 45
Module Name: pciPassthru
Chassis: 0
Physical Slot: 7
Slot Description: CPU SLOT7 PCI-E 3.0 X8
Device Layer Bus Address: s00000007.00
Passthru Capable: true
Parent Device: PCI 0:100:0:0
Dependent Device: PCI 0:101:0:0
Reset Method: Bridge reset
FPT Sharable: true
NUMA Node: 0
Hardware Label:
Virtual Function:
Here's the .vmx file for the VM I'm trying to boot:
.encoding = "UTF-8"
config.version = "8"
virtualHW.version = "20"
nvram = "oc.nvram"
svga.present = "TRUE"
vmci0.present = "TRUE"
hpet0.present = "TRUE"
floppy0.present = "FALSE"
numvcpus = "2"
memSize = "16384"
firmware = "efi"
powerType.powerOff = "default"
powerType.suspend = "default"
powerType.reset = "default"
tools.upgrade.policy = "manual"
sched.cpu.units = "mhz"
sched.cpu.affinity = "all"
sched.cpu.latencySensitivity = "normal"
vm.createDate = "1674612518956071"
scsi0.virtualDev = "pvscsi"
scsi0.present = "TRUE"
sata0.present = "TRUE"
scsi0:0.deviceType = "scsi-hardDisk"
scsi0:0.fileName = "oc.vmdk"
sched.scsi0:0.shares = "normal"
sched.scsi0:0.throughputCap = "off"
scsi0:0.present = "TRUE"
sata0:0.deviceType = "cdrom-image"
sata0:0.fileName = "/vmfs/volumes/9d696458-538d8b1c/iso/ubuntu-22.04-live-server-amd64.iso"
sata0:0.present = "TRUE"
ethernet0.allowGuestConnectionControl = "FALSE"
ethernet0.virtualDev = "vmxnet3"
ethernet0.dvs.switchId = "50 11 bd bf 4b da 72 f0-66 52 ed d6 5f 9a a5 b8"
ethernet0.dvs.portId = "34"
ethernet0.dvs.portgroupId = "dvportgroup-2041"
ethernet0.dvs.connectionId = "1114659673"
ethernet0.shares = "normal"
ethernet0.addressType = "vpx"
ethernet0.generatedAddress = "00:50:56:91:f3:77"
ethernet0.uptCompatibility = "TRUE"
ethernet0.present = "TRUE"
displayName = "oc"
guestOS = "ubuntu-64"
chipset.motherboardLayout = "acpi"
toolScripts.afterPowerOn = "TRUE"
toolScripts.afterResume = "TRUE"
toolScripts.beforeSuspend = "TRUE"
toolScripts.beforePowerOff = "TRUE"
uuid.bios = "42 11 41 c2 e2 4f 33 f8-bb e2 cc ae ec de ef e4"
vc.uuid = "50 11 cd 21 85 bf 53 07-6b 03 95 46 2f 0d f0 99"
migrate.hostLog = "oc-22261365.hlog"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "16384"
sched.mem.minSize = "16384"
sched.mem.shares = "normal"
migrate.encryptionMode = "opportunistic"
ftcpt.ftEncryptionMode = "ftEncryptionOpportunistic"
scsi0:0.ctkEnabled = "TRUE"
ctkEnabled = "TRUE"
sched.mem.pin = "TRUE"
numa.autosize.cookie = "40012"
numa.autosize.vcpu.maxPerVirtualNode = "4"
cpuid.coresPerSocket.cookie = "4"
sched.swap.derivedName = "/vmfs/volumes/611ffeaf-b4d4b252-6f7b-ac1f6b7d80aa/oc/oc-1416d0e7.vswp"
pciBridge1.present = "TRUE"
pciBridge1.virtualDev = "pciRootBridge"
pciBridge1.functions = "1"
pciBridge1:0.pxm = "0"
pciBridge0.present = "TRUE"
pciBridge0.virtualDev = "pciRootBridge"
pciBridge0.functions = "1"
pciBridge0.pxm = "-1"
scsi0.pciSlotNumber = "32"
ethernet0.pciSlotNumber = "34"
sata0.pciSlotNumber = "35"
scsi0:0.redo = ""
scsi0.sasWWID = "50 05 05 62 e2 4f 33 f0"
vmotion.checkpointFBSize = "16777216"
vmotion.checkpointSVGAPrimarySize = "16777216"
vmotion.svga.mobMaxSize = "16777216"
vmotion.svga.graphicsMemoryKB = "16384"
vmci0.id = "-320933916"
monitor.phys_bits_used = "45"
cleanShutdown = "TRUE"
softPowerOff = "TRUE"
tools.syncTime = "FALSE"
guestInfo.detailed.data = "architecture='X86' bitness='64' distroName='Ubuntu 22.04 LTS' distroVersion='22.04' familyName='Linux' kernelVersion='5.15.0-58-generic' prettyName='Ubuntu 22.04
toolsInstallManager.updateCounter = "1"
extendedConfigFile = "oc.vmxf"
sata0:0.startConnected = "FALSE"
bios.bootDelay = "5000"
vmx.buildType = "debug"
svga.autodetect = "TRUE"
svga.guestBackedPrimaryAware = "TRUE"
uuid.location = "56 4d f0 8d e1 dc 65 db-8e 50 1a 54 63 4b f8 3e"
svga.vramSize = "16777216"
vvtd.enable = "TRUE"
viv.moid = "f0c3d812-d205-4ee9-a1c6-452994dc9e42:vm-48044:A4Ad6e0tdI/Qwq+qN/eDfKIP6+cMXGD5Y6L6z5MTXBk="
pciPassthru.use64bitMMIO="TRUE"
pciPassthru.64bitMMIOSizeGB="32"
pciPassthru0.id = "00000:101:00.0"
pciPassthru0.deviceId = "0x1eb8"
pciPassthru0.vendorId = "0x10de"
pciPassthru0.systemId = "5c7944bd-360d-25c6-d570-ac1f6b7d80aa"
pciPassthru0.present = "TRUE"
Items like svga.vramSize, vmotion.*, svga.present were added automatically by VMWare. If I change from DirectPath to Dynamic Directpath, the pciPassthru0 items become:
pciPassthru0.allowedDevices = "0x10de:0x1eb8"
pciPassthru0.present = "TRUE"
Thank you for any help on this matter! Would love to get these cards working over the Quadros.
Which VMware tools version is running, was not able to find it in vmx.file
I am not sure that the problem is there. With Ubuntu OS it's a bit complicated for me