Alex-M
Contributor
Contributor

Error starting VMs with GPU - "The operation is not allowed in the current state"

Hi all!

I have some servers in the Horizon 7.13 VDI cluster. Servers are running "VMware-VMvisor-Installer-7.0.0.update01-16850804.x86_64-DellEMC_Customized-A00" (hardware is Dell R740 with 512GB RAM and 2 V100/16GB NVidia GPUs). NVidia driver VIB is "NVD-VGPU_460.32.04-1OEM.700.0.0.15525992_17478485.zip"

vCenter "VMware-VCSA-all-7.0.1-17004997" is running on a separate vSphere 7.0.1 4-server cluster (hardware is Dell R740 with 256GB RAM). Also I have the NVidia License server with "vGPU Licenses". GPU hosts are configured with "Shared Direct & Spread VMs across GPUs". VMs are Windows 10 with 8 vCPUs, 32GB RAM and "grid-v100_2q" GPU profile.

And here is a problem. After all GPU hosts are freshly installed & configured I have no any issues with VMs start & hot-migration. All seems fine.

But if I reboot any GPU host (through vCenter or iDRAC or even power button for test), I cannot start or migrate any VM on this host. When I try to start/hot-migrate VM I got this error in the vCenter migration screen: "The operation is not allowed in the current state" and host has an exclamaion mark. All other hosts works fine. When I try to reboot another host -- all previously described problems appears on it. The only way to return ability to start/migrate VMs on the problem hosts is to completely reinstall these hosts!

After some "googling" I have try two tests:

1) I clone this VM and remove GPU from hardware on clone. VM begin to start/migrate normally.

2) I switch off VM with GPU and cold-migrate it to the problem host. Then I try to start VM through WEB interface on problem host (not through vCenter). VM starts normally.

Where can be a problem? I thnik, it is somewhere around GPU driver API and vCenter, according to results of two previous tests.

WBR, Alexander.

 

0 Kudos
6 Replies
fabio1975
Expert
Expert

Ciao 

Did you try to upgrade the ESXi host to 7.0 Update 2?

Because ESXi 7.0 u1 is not supported with NVIDIA GPU

fabio1975_0-1651826601314.png

VMware vSphere :: NVIDIA Virtual GPU Software Documentation

0 Kudos
scott28tt
VMware Employee
VMware Employee

Your duplicate thread has been reported, expect a moderator to remove it.

 


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
0 Kudos
Alex-M
Contributor
Contributor

Hi

NVidia driver version 12.1 (460.32.04), which I use compatible with 7.0 and compatible updates: https://docs.nvidia.com/grid/12.0/grid-vgpu-release-notes-vmware-vsphere/index.html#hypervisor-softw... 

And all other hosts works fine (until rebooted).

Can be a problem with host GPU settings (e.g. "spread across GPUs") or host BIOS (e.g. SR-IOV)?

 

0 Kudos
fabio1975
Expert
Expert

Ciao 

Sorry for the mistake.

You can try to change the spread across GPUs setting to Group VMs on GPU until full .... but I don't think that resolves the issue.
Do you checked the Virtual Machine log when it starts when it has the error? and the /var/log/vmkernel.log on ESXi host? Do you have any errors?

 

 

0 Kudos
mrkasius
Enthusiast
Enthusiast

Hi @Alex-M,

Have you tried to temporarily downsize the RAM of the VMs below 32GB?

0 Kudos
Alex-M
Contributor
Contributor

Hi.

I know what you mean. Old problem with large MMIO space? But all other hosts (which are not rebooted yet) works fine.

And here: https://docs.nvidia.com/grid/12.0/grid-vgpu-release-notes-vmware-vsphere/index.html#bug-2043171-vms-... NVidia says that problem resolved in ESXi 6.7. I have 7.0.1.

 

0 Kudos