VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
jmacdaddy
Enthusiast
Enthusiast

Tomiboy78 - What build is your ESXi host on? 

0 Kudos
mhingu
Contributor
Contributor

@tomiboy78 Can you confirm your ESXi build number? Is it build 18644231 (ESXi 7.0 U3) or 18825058 (ESXi 7.0. U3a)?

0 Kudos
tomiboy78
Contributor
Contributor

18825058

0 Kudos
tomiboy78
Contributor
Contributor

we were able to reproduce the bug on a single node. From our point of view, some conditions seem to have to be met:

- the node needs a certain load
- the card must be loaded by the VM (we use Unigine Benchmark)
- the error occurs after an unpredictable time
- Zero clients with PCoiP always seem to have the problem with us
- Blast clients don't seem to have a problem

The error pattern on the PCoiP Zero is:

- Loss of the 2nd screen (FHD)
- Fallback to 1024x768
- flashing screen

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

All of our endpoints are Windows desktops and the users all connect with Blast, so I can tell you that the problem occurs regardless of Blast or PCOIP.

0 Kudos
krd
VMware Employee
VMware Employee

Fyi - we have identified the root cause and are working on a ESXi 7.0u3 fix.  Initial testing is positive.

RShumway
Contributor
Contributor

Any idea on when a fix will be released?  This has been causing us issues for the last two weeks and am about to roll my hosts back to 7.0.2 this weekend.

0 Kudos
krd
VMware Employee
VMware Employee

I'm not sure on exact timeframe, but it could be weeks.  If this is causing issues now, I suggest roll back is prudent.

0 Kudos
cristianomeloni
Enthusiast
Enthusiast

now whit the kb from NVIDIA still have trouble sith Tesla T4 (dell R740 esxi 7.0.3 18644231) with NVD-VGPU_470.82-1OEM.702.0.0.17630552

https://enterprise-support.nvidia.com/s/article/NVIDIA-vGPU-manager-VIB-installation-failure-after-u...

 

following this guide the driver bring up on the P40, but fail on tesla T4. the host are identical, both hw and sw.

as far that is not possible downgrade (VxRail Appliance) i'm in stuck! any idea?

0 Kudos
VanKhiem
Contributor
Contributor

Check this KB. At least it helped me to solve the "Expected 1 component, found 2" issue. https://kb.vmware.com/s/article/85982

BR, Khiem

0 Kudos
LouisA1
Contributor
Contributor

What will be the fastest way to get notified once a patch is released?  Hard to believe it has taken this long and still nothing, not even an quick fix patch, released as of yet!

0 Kudos
RyanHardy
Enthusiast
Enthusiast

Same problem here: VDI cluster was updated from 7.0U2d to 7.0U3b yesterday with latest Nvidia drivers on host and VMs and still seeing random disconnects and loss of vGPU.

Where oh where do you go, VMware? Lately every update breaks more than it fixes, it seems.

0 Kudos
RyanHardy
Enthusiast
Enthusiast

Additional note: has anyone checked if the error (that occurs randomly for most of you) maybe just occurs when DRS is moving VMs? Maybe we can minimize the impact of this bug with DRS set to "Partially Automated".

0 Kudos
krd
VMware Employee
VMware Employee

@RyanHardy The root cause of this particular issue is related to communication between NVIDIA vmkernel driver and its associated VMs. The issue can occur on an individual host regardless of DRS and vMotion. A fix has been identified and will be released in a ESXi patch.

0 Kudos
LouisA1
Contributor
Contributor

Any idea of time frame at this point?

0 Kudos
RyanHardy
Enthusiast
Enthusiast

Thanks for the insight. Can't believe U3b didn't include this fix though as it is a showstopper for all us VDI users. How can a bug like this stay unfixed for so long?

As I am using vSAN for VDI too I even can't go back to U2d which means our users are getting really angry with us. I have opened a ticket and am hoping to get U3c a little earlier...

0 Kudos
krd
VMware Employee
VMware Employee

I can't give specific date but my understanding next patch is in a few weeks.

0 Kudos
LouisA1
Contributor
Contributor

Man that is not good to say the least!  This is a major issue and the best they can do is this?  😕

0 Kudos
RyanHardy
Enthusiast
Enthusiast

It's plain inacceptable, is what it is.

0 Kudos
LouisA1
Contributor
Contributor

I just got word from Technical support that an approximate ETA of a Hot Patch, that would be available, is by the 24th of this Month.  Also, that they are currently testing on various host models, so far so good.  

0 Kudos