VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
RyanHardy
Enthusiast
Enthusiast

Reply
0 Kudos
LouisA1
Contributor
Contributor

@RyanHardy From my understanding there is no work around, but they were going to release a patch to fix.  Not sure if they have or not, I was told not install any more patches until the hotfix came out.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

Hm, well at least for me the host's warning sign disappeared after running the commands in my linked article. What's funny is that I had no NTP problems with 7.0U3b build, but now with the hotpatch I saw those too. Could be random though as others have had NTP issues with other 7.0U3 builds already.

Reply
0 Kudos
cristianomeloni
Enthusiast
Enthusiast

Reply
0 Kudos
LouisA1
Contributor
Contributor

FYI,

I received the patch on Friday and installed it on all host, so far it appears to have fixed the issue.  I am now on version 19037457.  After the patch i show from updated manager that I need 7 patches, but tech support advised me to wait til the next build is released and not to install the other patches.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@LouisA1 : You are lucky! I never get any patches, my SR is still opened. I try the reach them a couple of time last week. They answered that they are overwhelm with the Apache Log4J for the moment ... Sorry for the delay.

I'm lucky that @RyanHardy shared the patch with me. The patch fix all my servers  ... Thanks again.

I can't believe all the mess at VMware actually .... unbelivable.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast


@LouisA1 wrote:

I am now on version 19037457.


Interesting, your build is even more recent than the one I got. I hope they manage to keep track which build fixed what...

 

@Eric-Champagne: Now that is really bad support experience... I would talk with whoever sold you the product and demand money back or at least some additional months support for free.

Reply
0 Kudos
Arvee42
Contributor
Contributor

How does one get the 19026913 patch?   I've been fighting with this vgpu issue and log4j for what seems like an eternity now.  Company owners are losing faith in VMware as path to go any further down, even though it is the best fit for what they want imho.  

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

@Arvee42 I can send it to you if you still need it. Use on your own risk though.

Reply
0 Kudos
i4steck
Enthusiast
Enthusiast

@RyanHardy 
We have also the same Issue with NVIDA Tesla T4.
Can you please sent me the hotpatch?

Reply
0 Kudos
gabor1
Contributor
Contributor

similar Problem here with Tesla M10. Can anyone provide me this patch or confirm it has been resolved with vCenter/ESXi 7U3c? Thanks a lot

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

I'm sure they have included the fix for this issue in U3c too - would be very bad practice to do otherwise. I am waiting some more days though until updating my hosts to U3c as my hot-patched hosts are running flawless atm.

krd
VMware Employee
VMware Employee

@gabor1, yes ESXi 7.0u3c resolves this VDI vGPU issue.  The release notes include:

  • Virtual desktop infrastructure (VDI) might become unresponsive due to a race condition in the VMKAPI driver

    Event delivery to applications might delay indefinitely due to a race condition in the VMKAPI driver. As a result, the virtual desktop infrastructure in some environments, such as systems using NVIDIA graphic cards, might become unresponsive or lose connection to the VDI client.

    This issue is resolve in this release.

i4steck
Enthusiast
Enthusiast

We now have the latest verision actively in use.
No more interruptions detectable.

Reply
0 Kudos
VMVSF
Contributor
Contributor

Can anyone confirm that 7.0u3c definitively fixes the vGPU bug?  We've been on a broken 7.0u3a for a while and now planning to bite the bullet and upgrade next week in hopes vGPU bugs are gone!

Thanks.

Reply
0 Kudos
LouisA1
Contributor
Contributor

Yes, issue was resolved fully.

Reply
0 Kudos
VMVSF
Contributor
Contributor

Great!  Thanks, we'll give it a go then.  Didn't want to go to 7.0 u3d as it's too new, but haven't heard much bad with u3c and really wanting to use vGPU again!

Thank you again!

Tags (1)
Reply
0 Kudos
BrianBurdorf
Contributor
Contributor

@VMVSF i would hold off if you can, we were told to update to fix another issue, which it did, but we've been having issues with VMs powering on now, i have 2 tickets open with NVidia, so far they havnt said its a vmware issue or not, i've tried VIBs 13.2 and 14.0 with no change.  its also random, everything will work great for a week, then its a week of issues.

 

Reply
0 Kudos
krd
VMware Employee
VMware Employee

@BrianBurdorf and @VMVSF I've been monitoring ESXi 7.0u3c and 7.0u3d activity, and I have not seen any vGPU VDI related issues.  I also discuss open vGPU issues with NVIDIA on a weekly basis, and we are not tracking any open VDI issue.  As far as I can tell, ESXi 7.0u3c and 7.0u3d are stable for VDI (much better than original ESXi 7.0u3).

kanid99
Enthusiast
Enthusiast

We are experiencing a similar issue as described originally. The users session 'freezes' on the client side. If they reset client and reconnect, they have a 'small screen' and device manager shows the display drivers in a <!> state. Disabling and re-enabling the drivers fixes the issue with a subsequent session reconnect. 

The odd thing is that this started happening to us when we were still on ESXI 6.7 and Horizon 2006, sometime around early March after patches were applied in February. We upgraded to 7.0u3 in April/May and the issue has only been getting worse with time.

I have a case open with support but they a refusing to move forward with log analysis because our Windows thin clients are not certified for Horizon 2111 - even though they are just Windows 10 machines running the latest Horizon 8 client! 

Reply
0 Kudos