Horizon 8.6.0
VSphere 7.0.3 lastest patches
Dell R740 hosts - latest firmware - SR-IOV enabled, Performance OS Controlled power profile
NVidia 14.2 vGPU drivers on hosts and virtual desktops
Tesla T4 - T4-2b profiles, ECC disabled
Blast protocol
Virtual Desktop OS: Windows 10 - 21H2
Primary apps - AutoDesk Civil3D
Instant Clone virtual desktop sessions will randomly lock-up. SSH to host running the desktop, run the command "nvidia-smi vgpu" and we will see the following with the highlighted being the impacted virtual desktop:
Hard reset of the impacted virtual desktop will resolve the issue. Since the desktop is reset the user will lose their work.
The occurrence is totally random with respect to host, desktop, user or applications that the user has open in the their session. We have seen this occur on sessions where the user has nothing but a web browser and Outlook open.
If other desktops with active sessions are sharing the physical GPU and we don't reset the impacted desktops quickly, their sessions will freeze as well. However, resetting the "99%" machine will unfreeze their desktops and they will be able to reconnect and not have lost work.
This issue started in early June of this year
VMWare Support shrugs shoulders and says ask Nvidia. NVidia support says the 99% utilization is a symptom of frame buffer exhaustion, which is silly. You can see from the screenshot above that nothing else is actively occurring on the card. There is no correlation to GPU intensive tasks being performed on the desktop that would cause this "exhaustion". Switching from a T4-1B to a T4-2B profile has caused the issue to occur less frequently. But this never happened before June of this year. There were no major changes to the applications being used on the desktops or in the intensity of the tasks being performed by the users. It doesn't make sense.
Any thoughts would be appreciated.
@jmacdaddy - just added a reply on the other thread "vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU" thread with our experience of this same issue.
Thought would add details of our environment to see if any parallels -
Horizon 7.13.2
VSphere 7.0.3 lastest patches (20328353)
HPE DL385 Gen10 Plus v2 hosts, AMD EPYC 7543 - "Virtualization - Max Performance" workload profile
NVidia 13.4 vGPU drivers on hosts and virtual desktops
Tesla T4 - T4-2B profiles, ECC disabled (Nvidia also suggested trying the T4-2Q profiles, we currently have a temporary license for that, but no difference)
PCoIP protocol
Virtual Desktop OS: Windows 10 - 21H2
Are you still experiencing this issue or did you manage to resolve this issue?
We are experiencing the same issues with vm's randomly locking up.
Have to check the load with the nvidia-smi vgpu command though.
Hmmm, I'm just thinking basic IT stuff at this point. If you can't find the culprit for example, specific user, specific application, specific time of day, specific team, then I'd start looking at the GRID drivers, or ESXi Drivers, or OS updates for windows or linux, or Horizon Agent drivers in which ever is easiest for you.
Also perhaps changing GRID profiles may help, but I expect the culprit to come back again just with more time in-between.
We are in a very similar situation lately where we are experiencing random lockup/freezes on random users but when looking on a host we don't see our graphic card going to 99%, it is barely showing 1-9% when that happens. The VM in Horizon shows up as Agent Unreachable. We cannot remote manage it, browse to it or collect any logs from it. The only thing we can do is to ping it and then Remove it from Horizon which also takes a long time before it finally gets deleted in vsphere as I think that even vmtools is impacted at that time.
We are slightly in a different scenario being on ESXi 7U3, Horizon 2206 and Tesla M10s but was just hoping that you have made some progress on this issue and could share the resolution to it
Thanks in advance