Anyone else seeing this? Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day. Totally random. The screen goes black and then comes back with 800x600 resolution. Reboot(s) of the virtual desktop eventually fix it. In the vmware.log of the affected desktop you will see:
" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."
Tesla T4 cards in our case. Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help. Only rolling back the hosts to 7.0.2 fixes it.
Tomiboy78 - What build is your ESXi host on?
@tomiboy78 Can you confirm your ESXi build number? Is it build 18644231 (ESXi 7.0 U3) or 18825058 (ESXi 7.0. U3a)?
18825058
we were able to reproduce the bug on a single node. From our point of view, some conditions seem to have to be met:
- the node needs a certain load
- the card must be loaded by the VM (we use Unigine Benchmark)
- the error occurs after an unpredictable time
- Zero clients with PCoiP always seem to have the problem with us
- Blast clients don't seem to have a problem
The error pattern on the PCoiP Zero is:
- Loss of the 2nd screen (FHD)
- Fallback to 1024x768
- flashing screen
All of our endpoints are Windows desktops and the users all connect with Blast, so I can tell you that the problem occurs regardless of Blast or PCOIP.
Fyi - we have identified the root cause and are working on a ESXi 7.0u3 fix. Initial testing is positive.
Any idea on when a fix will be released? This has been causing us issues for the last two weeks and am about to roll my hosts back to 7.0.2 this weekend.
I'm not sure on exact timeframe, but it could be weeks. If this is causing issues now, I suggest roll back is prudent.
now whit the kb from NVIDIA still have trouble sith Tesla T4 (dell R740 esxi 7.0.3 18644231) with NVD-VGPU_470.82-1OEM.702.0.0.17630552
following this guide the driver bring up on the P40, but fail on tesla T4. the host are identical, both hw and sw.
as far that is not possible downgrade (VxRail Appliance) i'm in stuck! any idea?
Check this KB. At least it helped me to solve the "Expected 1 component, found 2" issue. https://kb.vmware.com/s/article/85982
BR, Khiem
What will be the fastest way to get notified once a patch is released? Hard to believe it has taken this long and still nothing, not even an quick fix patch, released as of yet!
Same problem here: VDI cluster was updated from 7.0U2d to 7.0U3b yesterday with latest Nvidia drivers on host and VMs and still seeing random disconnects and loss of vGPU.
Where oh where do you go, VMware? Lately every update breaks more than it fixes, it seems.
Additional note: has anyone checked if the error (that occurs randomly for most of you) maybe just occurs when DRS is moving VMs? Maybe we can minimize the impact of this bug with DRS set to "Partially Automated".
@RyanHardy The root cause of this particular issue is related to communication between NVIDIA vmkernel driver and its associated VMs. The issue can occur on an individual host regardless of DRS and vMotion. A fix has been identified and will be released in a ESXi patch.
Any idea of time frame at this point?
Thanks for the insight. Can't believe U3b didn't include this fix though as it is a showstopper for all us VDI users. How can a bug like this stay unfixed for so long?
As I am using vSAN for VDI too I even can't go back to U2d which means our users are getting really angry with us. I have opened a ticket and am hoping to get U3c a little earlier...
I can't give specific date but my understanding next patch is in a few weeks.
Man that is not good to say the least! This is a major issue and the best they can do is this? 😕
It's plain inacceptable, is what it is.
I just got word from Technical support that an approximate ETA of a Hot Patch, that would be available, is by the 24th of this Month. Also, that they are currently testing on various host models, so far so good.