jmacdaddy
Enthusiast
Enthusiast

Nvidia vGPU causing random desktop lock-ups

Horizon 8.6.0

VSphere 7.0.3 lastest patches

Dell R740 hosts - latest firmware - SR-IOV enabled, Performance OS Controlled power profile

NVidia 14.2 vGPU drivers on hosts and virtual desktops

Tesla T4 - T4-2b profiles, ECC disabled

Blast protocol

Virtual Desktop OS:  Windows 10 - 21H2

Primary apps - AutoDesk Civil3D

Instant Clone virtual desktop sessions will randomly lock-up.  SSH to host running the desktop, run the command "nvidia-smi vgpu" and we will see the following with the highlighted being the impacted virtual desktop:

jmacdaddy_0-1668112103798.png

Hard reset of the impacted virtual desktop will resolve the issue.  Since the desktop is reset the user will lose their work.

The occurrence is totally random with respect to host, desktop, user or applications that the user has open in the their session.  We have seen this occur on sessions where the user has nothing but a web browser and Outlook open.

If other desktops with active sessions are sharing the physical GPU and we don't reset the impacted desktops quickly, their sessions will freeze as well.  However, resetting the "99%" machine will unfreeze their desktops and they will be able to reconnect and not have lost work.

This issue started in early June of this year

VMWare Support shrugs shoulders and says ask Nvidia.  NVidia support says the 99% utilization is a symptom of frame buffer exhaustion, which is silly.  You can see from the screenshot above that nothing else is actively occurring on the card.  There is no correlation to GPU intensive tasks being performed on the desktop that would cause this "exhaustion".  Switching from a T4-1B to a T4-2B profile has caused the issue to occur less frequently.  But this never happened before June of this year.  There were no major changes to the applications being used on the desktops or in the intensity of the tasks being performed by the users.  It doesn't make sense. 

Any thoughts would be appreciated.

4 Replies
adc_1997
Contributor
Contributor

@jmacdaddy - just added a reply on the other thread "vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU" thread with our experience of this same issue.

Thought would add details of our environment to see if any parallels -

Horizon 7.13.2
VSphere 7.0.3 lastest patches (20328353)
HPE DL385 Gen10 Plus v2 hosts, AMD EPYC 7543 - "Virtualization - Max Performance" workload profile
NVidia 13.4 vGPU drivers on hosts and virtual desktops
Tesla T4 - T4-2B profiles, ECC disabled (Nvidia also suggested trying the T4-2Q profiles, we currently have a temporary license for that, but no difference)
PCoIP protocol
Virtual Desktop OS:  Windows 10 - 21H2

0 Kudos
JordyGB
Contributor
Contributor

@jmacdaddy 

Are you still experiencing this issue or did you manage to resolve this issue?
We are experiencing the same issues with vm's randomly locking up.

Have to check the load with the nvidia-smi vgpu command though.

0 Kudos
zsalazar
Contributor
Contributor

Hmmm, I'm just thinking basic IT stuff at this point. If you can't find the culprit for example, specific user, specific application, specific time of day, specific team, then I'd start looking at the GRID drivers, or ESXi Drivers, or OS updates for windows or linux, or Horizon Agent drivers in which ever is easiest for you.

Also perhaps changing GRID profiles may help, but I expect the culprit to come back again just with more time in-between.

0 Kudos
LukaszDziwisz
Hot Shot
Hot Shot

We are in a very similar situation lately where we are experiencing random lockup/freezes on random users but when looking on a host we don't see our graphic card going to 99%, it is barely showing 1-9% when that happens. The VM in Horizon shows up as Agent Unreachable. We cannot remote manage it, browse to it or collect any logs from it. The only thing we can do is to ping it and then Remove it from Horizon which also takes a long time before it finally gets deleted in vsphere as I think that even vmtools is impacted at that time. 

We are slightly in a different scenario being on ESXi 7U3, Horizon 2206 and Tesla M10s but was just hoping that you have made some progress on this issue and could share the resolution to it

Thanks in advance

0 Kudos