VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
BrianBurdorf
Contributor
Contributor

we are seeing the same thing, i'm starting the revert process as well.  We were advised to upgrade to 7.0.3 to fix another PSOD issue with 7.0.2.

 

ericeby
Contributor
Contributor

Same here although we couldnt get vGPU to work at all with any of our GPUs except the A40s. All previous cards like the P4 and T4 would not allow vGPU to work and would only boot 1 vm per graphics card. We had to revert all of our servers back as well. 

0 Kudos
mhingu
Contributor
Contributor

@ericeby Does your vGPU fails to power on with error “No host is compatible with the virtual machine” to run multiple VMs on a GPU? If yes, make sure your vCenter version is also 7.0.3 (and not 7.0.2) along with ESXi version.

0 Kudos
krd
VMware Employee
VMware Employee

I verified ESXi 7.0u3 with M10 device managed by VC 7.0u3 powers on vGPU VMs correctly.  Multiple M10 vGPU VMs are assigned to each device.  If this ESXi is managed by VC 7.0u2, however, it powers on a single vGPU VM per device.  This is an unsupported configuration.  The VC version should be same or greater than ESXi.  The solution is to upgrade VC to 7.0u3 _before_ upgrading ESXi to 7.0u3.

0 Kudos
Seph1
Contributor
Contributor

We have exactly the same problem. It happens in horizon instant clone pools and on persistent single VMs. At the moment where the resolution is dropped to 800x600 we see in the corresponding "vmware.log" of the VM "Er(02) vthread-2199546 - vmiop_log: (0x0): Timeout occurred, reset initiated." and then many lines "Er(02) vthread-2199546 - vmiop_log: (0x0): TDR_DUMP:0x00989680 0x00000000 0x000001bb 0x0000000f" and after that "In(05) vthread-2199546 - vmiop_log: (0x0): Guest driver unloaded!".

After that the graphics device "NVIDIA GRID T4-1B" is stopped in the VM with Code 43.

We have VMware ESXi 7.0.3 and NVIDIA GRID 13.0.

 

0 Kudos
krd
VMware Employee
VMware Employee

We are investigating this issue with NVIDIA.  At first glance this appears to be an NVIDIA issue because their driver initiates Windows TDR (timeout detection and recovery).  At this point we don't know exactly why.

At least in some cases, an incompatible NVIDIA guest driver could contribute to the problem.  For example, for NVIDIA vGPU 13.0 drivers, ESXi host should be 470.63 and Windows guest should be 471.68.  I've seen cases where the guest driver is much older.

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

I can tell you that we we started off with NVidia GRID 13.0 (470.63 host and 471.68 Windows drivers), then rolled back the Windows driver to the 12.1 version, then rolled the host driver back to the 12.1 version.  Nothing helped except rolling the hypervisor back from 7.0.3 to 7.0.2 which immediately fixed the problem.

0 Kudos
mhingu
Contributor
Contributor

@krd , the NVIDIA driver initiates TDRs on ESXi 7.0.3 very frequently, but the same driver does not on ESXi 7.0.2; and rolling back from ESXi 7.0.3 to ESXi 7.0.2 fixes the issue. Any idea if something is changed in ESXi 7.0.3(or with new hardware version) that can impact the NVIDIA driver?

0 Kudos
BrianBurdorf
Contributor
Contributor

not sure if it helps with the troubleshooting, we are able to crash the driver on demand using valley benchmark tool.  After rolling back to 7.0.2 with latest NVidia 13.0 VIB we cant reproduce the crash.

https://benchmark.unigine.com/valley

 

 

krd
VMware Employee
VMware Employee

That's helpful.  What vGPU profile are you using?

0 Kudos
BrianBurdorf
Contributor
Contributor

@krd RTX8000 is the card with both 4Q and 8Q profiles.

0 Kudos
Seph1
Contributor
Contributor

Hello,

should we roll back to ESXi 7.0.2 or wait for a hotfix for 7.0.3?

Thank you

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

If you are actively experiencing the issue, then it depends on your pain tolerance.  I haven't got word from VMWare or NVidia about an impending fix, so you might want to test a rollback on one of your hosts to see how long the process will take.

0 Kudos
Seph1
Contributor
Contributor

Unfortunately I cannot rollback the version to 7.0.2 without a new installation because the alternative bootbank also has version 7.0.3 and esxcli software profile update --allow-downgrades gives me "Downgrade ESXi from version 7.0.3 to 7.0.2 is not supported".

Are there any news from NVIDIA or VMware related to this TDR issue?

0 Kudos
BrianBurdorf
Contributor
Contributor

@Seph1I had the same problem, luckily i had hardware and built a new cluster with the downgraded 7.0.2, then migrated VDI's which i'm still doing a week later 😞 

I have an open ticket with NVidia no update yet on any update VIB/Driver.  When i hear anything i'll post to group.

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

Yeah, I had to rebuild my hosts from scratch as well.  Tried the rollback but same as you - 7.0.3 was my alt bootbank.  Thought for sure I had done something wrong that caused it.  Makes me feel a little better I guess.

0 Kudos
BrianBurdorf
Contributor
Contributor

NVidia released new drivers (13.1).  per NVidia support I am going to test with 7.0.3, if this doesnt work they are escalating to engineering.  #FingersCrossed 🙂

0 Kudos
Seph1
Contributor
Contributor

Our NVIDIA drivers got updated to 13.1 (host drivers and client drivers) last week, but unfortunately the problems still exist.

0 Kudos
tomiboy78
Contributor
Contributor

Same here! Problem still exist with latest 7.0.3 and 470.82 nvidia drivers.

0 Kudos