VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
RyanHardy
Enthusiast
Enthusiast

While that date (24th) would still be very stressing for our users, I received an update from VMware in my support ticket:

As you may be aware, we have identified critical issues following the vSphere 7.0 U3 GA release, leading to two express patches. After further review, additional resolution complexities have come to light which we have been working around the clock to resolve, test and validate. To protect you from further impact and reduce the potential for further complexity until we have a properly vetted path forward, the decision has been made to put a hold on the full ESXi 7.0 U3 release (incl. ESXi U3, U3a and U3b) and vCenter Server U3b, removing it for download at this time. vCenter Server 7.0 U3 GA and U3a will remain available as a viable upgrade path and ESXi host backwards compatibility remains unchanged. 
 
For now, we are asking you to hold on any moves to ESXi 7.0 U3. Please reference the following FAQ (https://kb.vmware.com/s/article/86398) which goes into more detail and also gives guidance for those who have already updated in any form. For transparency and awareness, a banner is also being maintained directly on the download page. 
  
Please cascade this message across your teams and reach out to me should you have any questions not already address in the FAQ. I will work with our teams to get the answers needed. 
  
We are committed to getting this fully resolved in an upcoming release.

I can't tell you how disappointed I am of VMware - they are now matching with Microsoft regarding code quality and quality control. Hope the dollars saved on professional employees are worth the lost confidence in this once rock-solid product.

Better yet: I have to tell my management that all the money we are paying VMware for maintenance is soo well invested and is going to leave us in a non-functional state for quite some time.

Seph1
Contributor
Contributor

We finished the rollback to 7.0.2 - 17867351 and it seems it helped, there are no timeouts any more.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

VMWARE EMPLOYEE: Is there an ETA for that bug ? We have a very large VMware vSAN Horizon complete down with tons of NVIDIA card. We cannot downgrade to 7.0.2, our vSAN Cluster use vDSwith that are already in version 7.0.3 so impossible to bring back a vSAN node to that vDS without breaking everything.

Reply
0 Kudos
VanKhiem
Contributor
Contributor

Same question here. 

@VMware: please provide an ETA.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

Aside from the daily "we have it on our radar" and "you are our top priority" mails I got nothing with my support ticket. They suggested to rip the cluster apart and set it up new - but not on my watch. It's just insanity to put that much work into something VMware broke.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

Rip the cluster ... I couldn't imagine how to get ripped of 4 VSAN Clusters with 32 Nodes each and be confortable with that recommendation, recreating vDSwitch on version 7.0.2, recreate all PortGroup, LACP config, VMware Horizon on the top of it ... WOW .. Very poor recommendation for large and complex environment.

 

UNBELIVABLE! 

Tags (1)
RyanHardy
Enthusiast
Enthusiast

Wow, with an environment of that size I don't envy you... How many VMs do you have to restart every minute I wonder?

For VDI we only have one cluster with four nodes and a little over 100 VMs - but even with this size one person has to take calls and restart VMs the whole day long.

 

BTW (if it helps): we changed almost all of our VMs to software graphics, so at least now only VMs with AutoCAD or Adobe CC are affected.

Reply
0 Kudos
LouisA1
Contributor
Contributor

Maybe we are getting close, I received the below email from support, Friday.  Any one else see anything yet?

"Requesting you to share us the ESXI logs from any one host along with the Time Stamp and  details of the Virtual machines on this the issue was observed so that we can verify with the hotpatch currently available can be used in our environment as well,."  

 

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

The update I got today does not read as optimistic as yours, but at least there is talk about a hotpatch. The least one can expect.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@LouisA1: Any update on your SR ticket ?

Reply
0 Kudos
LouisA1
Contributor
Contributor

Last info I got was they were testing a hotfix and that they would give me an ETA soon.  That was yesterday.  They also said this was not a general patch, but specific, made it seem as though they would need log files from your environment to see if patch would fix.  So, if you have not opened a ticket I would suggest doing that.  Who knows how long before a general update will be released.  This is so disappointing that it is taking this long.

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

After another reminder by me within our support ticket we got a new answer today (what more can you ask for these days?).


I reviewed the PR created for the Hot-patch and there seems to be some delay on last week, We had sent a high priority update and I’ll be following up on that.


Oh boy...

Reply
0 Kudos
LouisA1
Contributor
Contributor

I have requested another update as I have not heard anything except, were waiting on engineering for an ETA.  LOL, not the patch, but an ETA.  VMWARE I have a question, How much longer do I have to listen to my users tell me VDI SUCKS!!!!!!  It's been over a month!!!!!!!  of me saying, there working on it. 

Reply
0 Kudos
RyanHardy
Enthusiast
Enthusiast

Seems like a perfect time to put our heads together and find better alternatives.

Reply
0 Kudos
Eric-Champagne
Contributor
Contributor

@LouisA1 : I know it makes no sense at all. I wish I could downgrade but I cannot at all due to our VSAN Cluster and Distributed switches already to 7.0.3. Actually, @RyanHardy We are an heavy NVIdia shop I dont know what could be a plan. I even try the vSGA old method instead of vGPU Shared Direct access and my 3D Software (Dassault CATIA) is not able to start ... 

Its NOT funny at all here actually ... 300 PCs VDI no 3D ...

Reply
0 Kudos
LouisA1
Contributor
Contributor

Here is the latest response from the engineering team I suppose

Hi Team,

Thanks for the response. 

We will update you by tomorrow about the ETA on the patch.

Additionally, since it is a vSAN we are testing a couple of things internally for rolling back the patch.

 

LouisA1
Contributor
Contributor

So far Crickets

Reply
0 Kudos
LouisA1
Contributor
Contributor

I have requested a Manager assistance to answer a question, why is it taking so long just to get an ETA of when a patch will be released?  Does anyone else find that odd that after a month they still can't say when one will be available?  

Reply
0 Kudos
sjesse
Leadership
Leadership

I don't work for VMware. That's the best you can do, also if you have local contact let me know as well to get it escalated if its not.  I'm assuming there are overall issues with update 3 they are trying to fix and prioritize them as they can, they pulled the entire update if you not aware.

Reply
0 Kudos
LouisA1
Contributor
Contributor

Here is the latest info,

Hi Team,

Apologies for the delay. 

We check internally with the engineering team they confirmed the ETA for the hot patch will be 2-3 weeks since there are various things that need to be checked considering your environment.  

I am going to see about rolling back the update.  I think they have no clue when it will be fixed but they have to say something I guess.

This is beyond ridiculous, so they release an update that breaks all kinds of things, and it will take them a couple of months or more to get it fixed?  Where is the testing before releasing?  Obviously there was none.   

Reply
0 Kudos