VMware Cloud Community
rycher
Contributor
Contributor

W10 Guest Crashes After Nvidia GRID K2 GPU Pass-Through

Hello VMware-ers,

I've struggled with this GPU issue for a while. See my setup below, and let me know if anyone has any ideas!

Problem: Any sort of stress on the GPU and it crashes the guest, and restarts. In order to utilize the GPU after the crash (even running shell commands in esxi host), require you to reboot the host. Small memory dumps are uploaded below. With the GPU/VM in pass-through, I have to use VNC to login.

Thank you for any help/suggestions!

ESXi Host #1 ver. 6.5

Dell Precision T5600
Bios: Latest

RAM: 24GB
Multiple SSD/HDDs
PCIe Slot 1 Nvidia GRID K2 (ECC off, nvidia-sme looks good)

PCIe Slot 5 Nvidia Quadro K600 (set as primary video card in BIOS)

Latest matching Nvidia drivers for host (injected as .VIB) and guest

vSphere Enterprise Plus license

ESXi Host #2 ver. 6.5

Intel NUC

Assorted VM's, including vCenter server. (Deployed via GUI installer)

vSphere Enterprise Plus license

vCenter Server ver. 7.0

Server 7 Standard

Virtual Machine with vGPU assigned (installed on ESXi host #1)

Windows 10 Enterprise LTSC ver. 1809

Nvidia GRID vGPU grid_k220q (have tried k200 and k280q)

Nested Virtualization Enabled

Latest VM Tools installed (ver. 10272)

hypervisor.cpuid.v0 = FALSE

2020-09-15 10_41_30-Window.png

2020-09-15 10_40_56-Window.png

2020-09-15 10_33_34-Window.png

Virtual Total for minidump zip

Message was edited by: Ryan (added photos)

Message was edited by: Ryan (updated title)

10 Replies
nachogonzalez
Commander
Commander

Hey, hope you are doing fine
might sound silly but
do you have VMware tools installed and up to date? What does VM logs say?

Reply
0 Kudos
rycher
Contributor
Contributor

Hi nachogonzalez,

I do have latest VMware tools installed. Which log? Are you talking about "Export Systems Logs..."

pastedImage_0.png

Reply
0 Kudos
nachogonzalez
Commander
Commander

Hey, hope you are doing fine:

can you please upload the following logs:

VMkernel and VMKwarning logs --> ESXi Log File Locations

VM log files: VMware Knowledge Base

Thanks in advance

Warm regards

rycher
Contributor
Contributor

Please see enclosed. I'm going to take a look myself, now that I know what are the primary logs are.

Reply
0 Kudos
nachogonzalez
Commander
Commander

Let me know if you need further assistance.

Reply
0 Kudos
rycher
Contributor
Contributor

I wasn't able to find much of a crash or resource limitation. Any tips on how to comb through these logs better?

Reply
0 Kudos
nachogonzalez
Commander
Commander

can you upload them please?

Reply
0 Kudos
rycher
Contributor
Contributor

Oh, maybe you can't see them? I uploaded them above. Nonetheless, I re-uploaded

Reply
0 Kudos
bluefirestorm
Champion
Champion

From the vmware.log, there are vmx settings that are mutually exclusive (i.e. setting(s) is not to be used with another as it is either contradictory or the other setting(s) will take effect making the other useless).

2020-09-15T13:30:10.423Z| vmx| I125: DICT  pciPassthru.use64bitMMIO = "TRUE"

2020-09-15T13:30:10.423Z| vmx| I125: DICT pciPassthru.64bitMMIOSizeGB = "16"

2020-09-15T13:30:10.423Z| vmx| I125: DICT             pciHole.start = "2048"

2020-09-15T13:30:10.423Z| vmx| I125: DICT          pciHole.dynStart = "3072"

There is no firmware settings in the vmx configuration so I guess the VM is using virtual BIOS and not virtual UEFI as I don't see this line in the vmware.log.

firmware="efi"

The use64bitMMIO and 64butMMIOSizeGB only has effect if the VM is using EFI for its virtual firmware. You wouldn't use the pciHole settings once the VM is using 64-bit MMIO as the MMIO address is already above the 4GB address area. pciHole.start = "2048" means the MMIO address starts at the 2GB address boundary.

Have a read of this KB

https://kb.vmware.com/s/article/2142307

and also read this to understand what a "PCI Hole" is

https://en.wikipedia.org/wiki/PCI_hole

The GRID K2 does not have display output so I suppose you intend to use this as a compute device (such as for CUDA). It is better to use EFI as virtual firmware for the VM (and along with it the 64-bit MMIO settings).

Note you have to reinstall the guest OS from scratch if switching to EFI from BIOS for the virtual firmware as the VM will no longer boot. Virtual EFI looks for GPT in the boot disk while BIOS looks for MBR.

Reply
0 Kudos
bluefirestorm
Champion
Champion

Alternative to reinstall from scratch for Windows 10 VM is to use the MBR2GPT tool available from version 1703 and newer.

Convert to GPT first and then change to EFI in the vmx settings.

I have done a conversion successfully before for a Windows 10 VM in Workstation Pro 15.x.

Reply
0 Kudos