Anyone else seeing this? Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day. Totally random. The screen goes black and then comes back with 800x600 resolution. Reboot(s) of the virtual desktop eventually fix it. In the vmware.log of the affected desktop you will see:
" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."
Tesla T4 cards in our case. Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help. Only rolling back the hosts to 7.0.2 fixes it.
we are seeing the same thing, i'm starting the revert process as well. We were advised to upgrade to 7.0.3 to fix another PSOD issue with 7.0.2.
Same here although we couldnt get vGPU to work at all with any of our GPUs except the A40s. All previous cards like the P4 and T4 would not allow vGPU to work and would only boot 1 vm per graphics card. We had to revert all of our servers back as well.
@ericeby Does your vGPU fails to power on with error “No host is compatible with the virtual machine” to run multiple VMs on a GPU? If yes, make sure your vCenter version is also 7.0.3 (and not 7.0.2) along with ESXi version.
I verified ESXi 7.0u3 with M10 device managed by VC 7.0u3 powers on vGPU VMs correctly. Multiple M10 vGPU VMs are assigned to each device. If this ESXi is managed by VC 7.0u2, however, it powers on a single vGPU VM per device. This is an unsupported configuration. The VC version should be same or greater than ESXi. The solution is to upgrade VC to 7.0u3 _before_ upgrading ESXi to 7.0u3.
We have exactly the same problem. It happens in horizon instant clone pools and on persistent single VMs. At the moment where the resolution is dropped to 800x600 we see in the corresponding "vmware.log" of the VM "Er(02) vthread-2199546 - vmiop_log: (0x0): Timeout occurred, reset initiated." and then many lines "Er(02) vthread-2199546 - vmiop_log: (0x0): TDR_DUMP:0x00989680 0x00000000 0x000001bb 0x0000000f" and after that "In(05) vthread-2199546 - vmiop_log: (0x0): Guest driver unloaded!".
After that the graphics device "NVIDIA GRID T4-1B" is stopped in the VM with Code 43.
We have VMware ESXi 7.0.3 and NVIDIA GRID 13.0.
We are investigating this issue with NVIDIA. At first glance this appears to be an NVIDIA issue because their driver initiates Windows TDR (timeout detection and recovery). At this point we don't know exactly why.
At least in some cases, an incompatible NVIDIA guest driver could contribute to the problem. For example, for NVIDIA vGPU 13.0 drivers, ESXi host should be 470.63 and Windows guest should be 471.68. I've seen cases where the guest driver is much older.
I can tell you that we we started off with NVidia GRID 13.0 (470.63 host and 471.68 Windows drivers), then rolled back the Windows driver to the 12.1 version, then rolled the host driver back to the 12.1 version. Nothing helped except rolling the hypervisor back from 7.0.3 to 7.0.2 which immediately fixed the problem.
@krd , the NVIDIA driver initiates TDRs on ESXi 7.0.3 very frequently, but the same driver does not on ESXi 7.0.2; and rolling back from ESXi 7.0.3 to ESXi 7.0.2 fixes the issue. Any idea if something is changed in ESXi 7.0.3(or with new hardware version) that can impact the NVIDIA driver?
not sure if it helps with the troubleshooting, we are able to crash the driver on demand using valley benchmark tool. After rolling back to 7.0.2 with latest NVidia 13.0 VIB we cant reproduce the crash.
https://benchmark.unigine.com/valley
That's helpful. What vGPU profile are you using?
@krd RTX8000 is the card with both 4Q and 8Q profiles.
Hello,
should we roll back to ESXi 7.0.2 or wait for a hotfix for 7.0.3?
Thank you
If you are actively experiencing the issue, then it depends on your pain tolerance. I haven't got word from VMWare or NVidia about an impending fix, so you might want to test a rollback on one of your hosts to see how long the process will take.
Unfortunately I cannot rollback the version to 7.0.2 without a new installation because the alternative bootbank also has version 7.0.3 and esxcli software profile update --allow-downgrades gives me "Downgrade ESXi from version 7.0.3 to 7.0.2 is not supported".
Are there any news from NVIDIA or VMware related to this TDR issue?
@Seph1I had the same problem, luckily i had hardware and built a new cluster with the downgraded 7.0.2, then migrated VDI's which i'm still doing a week later 😞
I have an open ticket with NVidia no update yet on any update VIB/Driver. When i hear anything i'll post to group.
Yeah, I had to rebuild my hosts from scratch as well. Tried the rollback but same as you - 7.0.3 was my alt bootbank. Thought for sure I had done something wrong that caused it. Makes me feel a little better I guess.
NVidia released new drivers (13.1). per NVidia support I am going to test with 7.0.3, if this doesnt work they are escalating to engineering. #FingersCrossed 🙂
Our NVIDIA drivers got updated to 13.1 (host drivers and client drivers) last week, but unfortunately the problems still exist.
Same here! Problem still exist with latest 7.0.3 and 470.82 nvidia drivers.