Hi all!
I have some servers in the Horizon 7.13 VDI cluster. Servers are running "VMware-VMvisor-Installer-7.0.0.update01-16850804.x86_64-DellEMC_Customized-A00" (hardware is Dell R740 with 512GB RAM and 2 V100/16GB NVidia GPUs). NVidia driver VIB is "NVD-VGPU_460.32.04-1OEM.700.0.0.15525992_17478485.zip"
vCenter "VMware-VCSA-all-7.0.1-17004997" is running on a separate vSphere 7.0.1 4-server cluster (hardware is Dell R740 with 256GB RAM). Also I have the NVidia License server with "vGPU Licenses". GPU hosts are configured with "Shared Direct & Spread VMs across GPUs". VMs are Windows 10 with 8 vCPUs, 32GB RAM and "grid-v100_2q" GPU profile.
And here is a problem. After all GPU hosts are freshly installed & configured I have no any issues with VMs start & hot-migration. All seems fine.
But if I reboot any GPU host (through vCenter or iDRAC or even power button for test), I cannot start or migrate any VM on this host. When I try to start/hot-migrate VM I got this error in the vCenter migration screen: "The operation is not allowed in the current state" and host has an exclamaion mark. All other hosts works fine. When I try to reboot another host -- all previously described problems appears on it. The only way to return ability to start/migrate VMs on the problem hosts is to completely reinstall these hosts!
After some "googling" I have try two tests:
1) I clone this VM and remove GPU from hardware on clone. VM begin to start/migrate normally.
2) I switch off VM with GPU and cold-migrate it to the problem host. Then I try to start VM through WEB interface on problem host (not through vCenter). VM starts normally.
Where can be a problem? I thnik, it is somewhere around GPU driver API and vCenter, according to results of two previous tests.
WBR, Alexander.
Ciao
Did you try to upgrade the ESXi host to 7.0 Update 2?
Because ESXi 7.0 u1 is not supported with NVIDIA GPU
Your duplicate thread has been reported, expect a moderator to remove it.
Hi
NVidia driver version 12.1 (460.32.04), which I use compatible with 7.0 and compatible updates: https://docs.nvidia.com/grid/12.0/grid-vgpu-release-notes-vmware-vsphere/index.html#hypervisor-softw...
And all other hosts works fine (until rebooted).
Can be a problem with host GPU settings (e.g. "spread across GPUs") or host BIOS (e.g. SR-IOV)?
Ciao
Sorry for the mistake.
You can try to change the spread across GPUs setting to Group VMs on GPU until full .... but I don't think that resolves the issue.
Do you checked the Virtual Machine log when it starts when it has the error? and the /var/log/vmkernel.log on ESXi host? Do you have any errors?
Hi @Alex-M,
Have you tried to temporarily downsize the RAM of the VMs below 32GB?
Hi.
I know what you mean. Old problem with large MMIO space? But all other hosts (which are not rebooted yet) works fine.
And here: https://docs.nvidia.com/grid/12.0/grid-vgpu-release-notes-vmware-vsphere/index.html#bug-2043171-vms-... NVidia says that problem resolved in ESXi 6.7. I have 7.0.1.