VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
Eric-Champagne
Contributor
Contributor

@LouisA1 : Do you have Distributed switched vDS ? Are they in version 7.0.3 ? Also if you are under a vSAN Cluster ... no sure the impact of that.

Reply
0 Kudos
LouisA1
Contributor
Contributor

@Eric-Champagne Yes I do have VDi switch but it is still on 7.0.0 and I am using vSan but i checked with that group and I also did not upgrade that to 7.0.3 as well.  I have a 4 host cluster and my thought was take one host out and do a full data migration, try and roll back the one host and then add it back.  Still checking with support.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@LouisA1 : I think you will be ok from what I read from you. The dVS is the pitfall. You cannot readd your server 7.0.2 to a 7.0.3 dVS. Its where the trouble started for me when I have decided to try a downgrade. The other pitfall could be the vSAN Disk Group version. This I cannot answer but I would recommend to validate.

Tags (1)
Reply
0 Kudos
LouisA1
Contributor
Contributor

@Eric-Champagne Yes I have checked and since i was on version 14 i should be ok there.  Problem I am having now is when trying to put host in maintenance mode with full data migration i am getting error about 1 stand alone host needed or against storage policy.  So that is what i am checking on now.  I want to be able to be ok if the upgrade does not go well.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

I sadly can't downgrade because of vSAN (who really runs vSAN without updating it as required by Skyline Health?).

But I got a reply from the "Technical Support Manager". And this one couldn't say less with this little amount of words:


I wanted to update you that our engineer team are actively working on the hot-patch and it’s in testing phase. We will keep you posted once it’s ready for deploy.


Trained answers when you're supposed to not tell the customer anything.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

This just in from the "Technical Support Supervisor":


Our Engineering Team is currently finalizing the hotpatch and we will send it out to you shortly today.

Thanks for your patience and cooperation!

😲

Reply
0 Kudos
LouisA1
Contributor
Contributor

@RyanHardy Wow, hope that works, I am probably going to downgrade a host today and see what happens.  They are making it seem that these hot patches are only for specific environments.  Like each customer could need different hotfixes.  

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

It sure seems so - I highly doubt that though, as they don't have that much information about my systems to write individual code.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@RyanHardy : Whatever you get i'm ready to give it a try on one of my servers if you can are welling to share your HPatch.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

Sure thing. I will report back here once the patch arrives.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

Just received the patch. Build number is now 19026913. I've updated one host and am currently moving some VMs on it to test if any error occurs with this build.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@RyanHardy Thanks Ryan big time! I will keep you update as well on my side.

Tags (1)
Reply
0 Kudos
krd
VMware Employee
VMware Employee

 @LouisA1Fyi, a hot patch targets component(s) in a specific build and thus is unique to that environment. As of early December, there are at least three unique ESXi 7.0u3+ builds in the field, and thus there is the potential for three unique hot patches. 

Reply
0 Kudos
LouisA1
Contributor
Contributor

Thank you for explanation.  @RyanHardy what build were you on previously?

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

Here all of our Horizon ESXi server are : VMware ESXi, 7.0.3, 18905247

Reply
0 Kudos
LouisA1
Contributor
Contributor

I am currently on 7.0.3 18825058, so I wander what patch was released for what version?

Reply
0 Kudos
WuGeDe
Enthusiast
Enthusiast

Just FYI

Here the article (000001654) on the Nvidia enterprise support page:

https://enterprise-support.nvidia.com/s/article/vGPU-VMs-have-TDRs-session-freezes-and-other-issues-...

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast


@LouisA1 wrote:

@RyanHardy what build were you on previously?


We are/were on 18905247 (7.0U3b). Luckily I never installed 7.0U3 or 7.0U3a, else we would have had this problem even longer!

We are in the process of converting VMs back to GRID-enabled VMs and putting some load on the updated server, but so far we've had no issues with the hotpatch release.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

FYI: We just finished updating the rest of the hosts as this specific issue seems to be fixed with the hotpatch. The NTP error still exists, but at least there is a workaround available for that...

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@RyanHardy : Thanks for the update. Could you share the workaround for NTP error ?

Reply
0 Kudos