VMware Cloud Community
jmbraben2
Contributor
Contributor

ESXi 7.0.1 'DevicePowerOn' power on failed with Nvidia Quadro P400 passthrough

I've updated from 6.7.0u2 to 7.0.1

(HPE-ESXi-6.7.0-Update2 to (Updated) HPE-Custom-AddOn_701.0.0.10.6.0-40)

In an attempt to be able to pass through an Nvidia Quadro P400 to Linux guest.

In 6.7.0u2, Linux could see the GPU, but not "correctly" (nividia-smi not reporting card, but shows up with lspci)

I've read about the passthrough changes in 7.0 and have created a new ESXi 7.0 U1 virtual machine from scratch.

The GPU is marked as active passthrough in the host hardware

pastedImage_7.png

The VM is configured with UEFI bios

pastedImage_5.png

I've added the GPU as a "Dynamic PCI device"

pastedImage_2.png

Although does seem odd at top level to see:

pastedImage_3.png

I've reserved all the guest OS memory

pastedImage_4.png

This GPU only has 2Gb...so nothing fancy there in memory allocations.

This card is max 30 watts and no "extra" power connections.

Since trying this on 7.0.1 I cannot even start the VM as it always fails with:

"Power on failure messages: Module 'DevicePowerOn' power on failed."

I see nothing in the logs that would explain the failure.

2020-10-22T23:13:21.121Z| vmx| I005: DICT pciPassthru0.allowedDevices = "0x10de:0x1cb3,0x10de:0xfb9"

...

2020-10-22T23:13:21.410Z| vmx| I005+ Power on failure messages: Module 'DevicePowerOn' power on failed.

2020-10-22T23:13:21.410Z| vmx| I005+ Failed to start the virtual machine.

2020-10-22T23:13:21.410Z| vmx| I005+

2020-10-22T23:13:21.411Z| vmx| I005: Vix: [mainDispatch.c:4200]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0

2020-10-22T23:13:21.411Z| vmx| I005: Transitioned vmx/execState/val to poweredOff

2020-10-22T23:13:21.411Z| vmx| I005: Vix: [mainDispatch.c:4200]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=0 additionalError=0

2020-10-22T23:13:21.411Z| vmx| I005: Vix: [mainDispatch.c:4238]: Error VIX_E_FAIL in VMAutomation_ReportPowerOpFinished(): Unknown error

2020-10-22T23:13:21.411Z| vmx| I005: Vix: [mainDispatch.c:4200]: VMAutomation_ReportPowerOpFinished: statevar=0, newAppState=1870, success=1 additionalError=0

2020-10-22T23:13:21.411Z| vmx| I005: Transitioned vmx/execState/val to poweredOff

As soon as I remove the GPU from the VM, the VM will power up without issue.

I feel I have to be missing something obvious as this "should work", but I'm currently at a loss.

I've attached the log and vmx files.

TIA for any help.

21 Replies
jmbraben2
Contributor
Contributor

Pulled the video card and made sure it was functional (it is)...anyone's thoughts on this would be greatly appreciated.

Reply
0 Kudos
BMWAdriano
Contributor
Contributor

I'm having the exact same problem with Nvidia Grid K2 in passthrough. When removing the PCI device, the VM boots up just fine. Other then the 6.7 to 7.0 upgrade, nothing else was changed and everything work flawlessly in 6.7.  I tried everything possible and worst of all, we can't rollback to 6.7. What a mess!

ashilkrishnan
VMware Employee
VMware Employee

Hi @BMWAdriano ,

Please check if this helps --> https://kb.vmware.com/s/article/67587 

Reply
0 Kudos
VMdicker
Contributor
Contributor

No, unfortunately, it does not work!

I think in the original post it was already shown that passthrough has been enabled on the esxi host.

Reply
0 Kudos
VMdicker
Contributor
Contributor

@jmbraben2 have you been able to find out any solution?

Reply
0 Kudos
BMWAdriano
Contributor
Contributor

Thank you but the information you have provided is only basic documentation of how to initialize pass-thru. I have done this already multiple times on 6.5, 6.7 with no issue and now on 7.1 it does not function correctly. Unfortunately this information does not help and we have given up on 7.1 and reverted back to 6.7 which works without issue.

Reply
0 Kudos
VMdicker
Contributor
Contributor

@BMWAdriano  Is there any convenient way to roll back to 6.7? 

 

Reply
0 Kudos
VMdicker
Contributor
Contributor

@ashilkrishnan This problem seems to be quite different from similar cases you can find on the web.

In other cases, we could at least find "device blah blah does not exist". 

However, we could not find any useful information in the log.

Reply
0 Kudos
BMWAdriano
Contributor
Contributor

Hello VMdicker,

Unfortunately upgrading to v7 does not have a simple "rollback" as did 6.7 to earlier versions (from my understanding). I ended up reinstalling 6.7 from scratch. Luckily my datastores stayed intact so it wasn't a complete loss.

Reply
0 Kudos
Shacl0w
Contributor
Contributor

I am having the same issue with Nvidia Titan V on ESXi 7.0.2, which works fine on ESXi 6.7. There is no any other error message in the VM's vmware.log besides "Module 'DevicePowerOn' power on failed". I do find some information in the ESXi host's vmkernel.log.

PCI: 886: 0000:xx:00.0: Translation for IO 0x0 - 0x7f failed: not a bug
PCIPassthru: 1420: Failed to get pci info for 0000:xx:00.0
PCIPassthru: 1431: Disable Domain for device 0000:xx:00.0
PCIPassthru: 808: pcipdevInfo: 0xxxxxxxxxxxxx (0000:xx:00.0), state 0, destroyed

I don't know if these information are related to the problem.

Hope ESXi 7.0.3 will be released soon and solve the issue.

 

Reply
0 Kudos
BMWAdriano
Contributor
Contributor

Has anyone found the latest ESXi to solve this issue yet? We are postponing the purchase of Horizon for this reason and until it is resolved, we will only be using a ESXi standalone installation.

Reply
0 Kudos
cbbb
Contributor
Contributor

Whatever this is is still outstanding ?

 

I frustratingly got it to work on P2200 Quadro GL107 on my server with Force Enable Host Display to Embedded in HP Bios set

 

But just cannot get working in the same server using GL106 and exact same settings 😞

 

I did however once get it working on another server DevicePowerOn but force block module NVIDIA in the Boot of the host using a linux block module

However I cannot seem to find what it was

 

I need the PCI device to be unknown VGA device in the ESXI host

The VGA compatible Device in PCI list works

NVIDIA VGA Device in PCI list does not work

Reply
0 Kudos
peterbuckingham
VMware Employee
VMware Employee

I have a Dell Precision Tower 5810. I have configured both an NVIDIA Quadro M4000 (GM204GL) and an NVIDIA Quadro P2000 (GP106GL). I am using ESXi 7.0.1 and vCenter 7.0.3. I have passed through both of these devices using DirectIO Pass through and am able to boot VMs with either device or a VM with both GPUs.

I'm not sure what you are doing, but it seems like maybe you are missing a config step. You might want to double check the steps in the KB article mentioned above.

Reply
0 Kudos
cbbb
Contributor
Contributor

do you have onboard graphics? or a third graphics

I think that is the common trait as i have a machine that only has 1 graphics PCie or the Bios forces PCIe as both primary and secondary, mine does the latter with no workaround

 

Yes the majority of comments respond to the link directly we all get Hardware to be listed as passthru but the VM fails to boot

does your bios allow you to set Primary display as Onboard embedded?  in these cases it always works for me

do you have a 3rd PCIe graphics card ? in these cases it always works for me

when embedded and primary are enabled at same time in bios doesnt work for me

when primary/all is in use by ESXI are enabled at same time doesnt work for me

 

I enable kernel option esxi headless with TTYS0 output only and the lone PCIe card can be passed in 7.0 

It seems 7 has introduced some drivers in their new non linux model for NVIDIA that do now allow the technical sharing of the card anymore

Reply
0 Kudos
peterbuckingham
VMware Employee
VMware Employee

I expect my system has onboard graphics. I'm not using these for graphics at all, but just for compute workloads. I have been able to successfully use the NVIDIA Guest (ie native Linux driver) in my VMs without issue. I am just using the emulated graphics card in my VM too (as well as passing through these devices).

Reply
0 Kudos
cbbb
Contributor
Contributor

the issue, we cannot pass through the compute card to the OS if attached while VM is powered off,

power on the VM halts at DevicePowerOn fail and never works until PCIe is detached from the VM.

HPe server DL360 has embedded graphics, a Bios option to disable PCIe as a System default display and vsphere passthrough works

HPe DL20  for example has no such such settings it is either both or PCIe only no means to lock PCIe for being used by OS (vSphere 7.0) from being a display.

 

It is reproducible on all templates of VM linux 2.6 through to debian & windows 

VM hardware formats 6 6.5 6.7 and 7

i imagine there may be a manual removal of PCIe from the vShere host in passthru.map file changing from bridge to d3d0 mode or similar

 

 

Reply
0 Kudos
peterbuckingham
VMware Employee
VMware Employee

Hi, I'm just commenting on my experience with the Dell Tower. If you are hitting this issue with a specific environment I would suggest filing an SR and going through the official support channels to triage your issue.

Reply
0 Kudos
jmbraben2
Contributor
Contributor

I've ignored this issue for a long time (not that I did not want resolution, but not seeing anything that helps)

This is an HPE ML30 gen 10, I'm not seeing any bios settings that would imply the PCIe card would be used by the system by default. I've looked at the card outputs and they don't seem to be active FWIW.

One comment by @Shacl0w got me looking at the vmkernel.log. It appears that every time the VM goes to start up:

2022-01-05T22:11:40.273Z cpu0:524325)PCI: 1330: Skipping device reset on 0000:0a:00.0 because PCIe link to the device is down.
2022-01-05T22:11:40.273Z cpu0:524325)WARNING: PCI: 891: 0000:0a:00.0: Translation for MEM64 0x4000000000 - 0x400fffffff failed: firmware bug
2022-01-05T22:11:40.273Z cpu0:524325)PCI: 533: \_SB_.PC00: root bridge resources (via ACPI):
2022-01-05T22:11:40.273Z cpu0:524325)PCI: 545: IO: 0x0 - 0xcf7 Translation: 0x0 vmk_IOResourceAttrs: 0x0
2022-01-05T22:11:40.273Z cpu0:524325)PCI: 545: IO: 0xd00 - 0xffff Translation: 0x0 vmk_IOResourceAttrs: 0x0
2022-01-05T22:11:40.273Z cpu0:524325)PCI: 545: Mem: 0xa0000 - 0xbffff Translation: 0x0 vmk_IOResourceAttrs: 0x0
2022-01-05T22:11:40.273Z cpu0:524325)PCI: 545: Mem: 0x80000000 - 0xfeafffff Translation: 0x0 vmk_IOResourceAttrs: 0x0
2022-01-05T22:11:40.273Z cpu0:524325)PCIPassthru: 1420: Failed to get pci info for 0000:0a:00.0
2022-01-05T22:11:40.273Z cpu0:524325)PCIPassthru: 1431: Disable Domain for device 0000:0a:00.0

Interesting it does not show up in the VM logs, but whatever...0000:0a:00.0 is the Quadro card, and given the "translation is failing" would seem to be a problem...but what to do about it?

I've moved on to 7.0u2 trying to avoid the "destroy the USB boot media" issues.

Reply
0 Kudos