Hello VMware-ers,
I've struggled with this GPU issue for a while. See my setup below, and let me know if anyone has any ideas!
Problem: Any sort of stress on the GPU and it crashes the guest, and restarts. In order to utilize the GPU after the crash (even running shell commands in esxi host), require you to reboot the host. Small memory dumps are uploaded below. With the GPU/VM in pass-through, I have to use VNC to login.
Thank you for any help/suggestions!
ESXi Host #1 ver. 6.5
Dell Precision T5600
Bios: Latest
RAM: 24GB
Multiple SSD/HDDs
PCIe Slot 1 Nvidia GRID K2 (ECC off, nvidia-sme looks good)
PCIe Slot 5 Nvidia Quadro K600 (set as primary video card in BIOS)
Latest matching Nvidia drivers for host (injected as .VIB) and guest
vSphere Enterprise Plus license
ESXi Host #2 ver. 6.5
Intel NUC
Assorted VM's, including vCenter server. (Deployed via GUI installer)
vSphere Enterprise Plus license
vCenter Server ver. 7.0
Server 7 Standard
Virtual Machine with vGPU assigned (installed on ESXi host #1)
Windows 10 Enterprise LTSC ver. 1809
Nvidia GRID vGPU grid_k220q (have tried k200 and k280q)
Nested Virtualization Enabled
Latest VM Tools installed (ver. 10272)
hypervisor.cpuid.v0 = FALSE
Virtual Total for minidump zip
Message was edited by: Ryan (added photos)
Message was edited by: Ryan (updated title)
Hey, hope you are doing fine
might sound silly but
do you have VMware tools installed and up to date? What does VM logs say?
Hi nachogonzalez,
I do have latest VMware tools installed. Which log? Are you talking about "Export Systems Logs..."
Hey, hope you are doing fine:
can you please upload the following logs:
VMkernel and VMKwarning logs --> ESXi Log File Locations
VM log files: VMware Knowledge Base
Thanks in advance
Warm regards
Let me know if you need further assistance.
I wasn't able to find much of a crash or resource limitation. Any tips on how to comb through these logs better?
can you upload them please?
From the vmware.log, there are vmx settings that are mutually exclusive (i.e. setting(s) is not to be used with another as it is either contradictory or the other setting(s) will take effect making the other useless).
2020-09-15T13:30:10.423Z| vmx| I125: DICT pciPassthru.use64bitMMIO = "TRUE"
2020-09-15T13:30:10.423Z| vmx| I125: DICT pciPassthru.64bitMMIOSizeGB = "16"
2020-09-15T13:30:10.423Z| vmx| I125: DICT pciHole.start = "2048"
2020-09-15T13:30:10.423Z| vmx| I125: DICT pciHole.dynStart = "3072"
There is no firmware settings in the vmx configuration so I guess the VM is using virtual BIOS and not virtual UEFI as I don't see this line in the vmware.log.
firmware="efi"
The use64bitMMIO and 64butMMIOSizeGB only has effect if the VM is using EFI for its virtual firmware. You wouldn't use the pciHole settings once the VM is using 64-bit MMIO as the MMIO address is already above the 4GB address area. pciHole.start = "2048" means the MMIO address starts at the 2GB address boundary.
Have a read of this KB
https://kb.vmware.com/s/article/2142307
and also read this to understand what a "PCI Hole" is
https://en.wikipedia.org/wiki/PCI_hole
The GRID K2 does not have display output so I suppose you intend to use this as a compute device (such as for CUDA). It is better to use EFI as virtual firmware for the VM (and along with it the 64-bit MMIO settings).
Note you have to reinstall the guest OS from scratch if switching to EFI from BIOS for the virtual firmware as the VM will no longer boot. Virtual EFI looks for GPT in the boot disk while BIOS looks for MBR.
Alternative to reinstall from scratch for Windows 10 VM is to use the MBR2GPT tool available from version 1703 and newer.
Convert to GPT first and then change to EFI in the vmx settings.
I have done a conversion successfully before for a Windows 10 VM in Workstation Pro 15.x.