VMware Horizon Community
jmacdaddy
Enthusiast
Enthusiast

vSphere ESXi 7.0.3 - 7.0 Update 3 bug with NVidia Grid vGPU

Anyone else seeing this?  Upgrade of hosts from 7.0.2 to 7.0.3 and then a fraction (10-15%) of the virtual desktops just lose access to the vGPU during the day.  Totally random.  The screen goes black and then comes back with 800x600 resolution.  Reboot(s) of the virtual desktop eventually fix it.  In the vmware.log of the affected desktop you will see:

" Er(02) vthread-2108911 - vmiop_log: (0x0): Timeout occurred, reset initiated."

Tesla T4 cards in our case.  Trying different NVidia guest drivers and host VIBs (12.1, 13.0) doesn't help.  Only rolling back the hosts to 7.0.2 fixes it.

116 Replies
adc_1997
Contributor
Contributor

Hi @kanid99 - did you get a resolution to this issue?  Really painful to troubleshoot.

We have the same symptoms as you mention, where the display drivers (473.47) end up in a <!> state.  

Our config is -

  • Hypervisor:VMware ESXi, 7.0.3, 19482537
  • Model:ProLiant DL385 Gen10 Plus v2
  • Processor Type:AMD EPYC 7543 32-Core Processor
  • Horizon View 2206 client and Horizon 7.13.1
0 Kudos
kanid99
Enthusiast
Enthusiast

Yes @adc_1997 we did. They suggested trying these two steps, making these registry changes on the base images affected, which worked for our environment.

A) HKLM\System\CurrentControlSet\Control\GraphicsDrivers
TerminateIndirectOnStall : 0 (DWORD)
IddCxDebugCtrl : 0x80 DWORD
B)HKLM\Software\Policies\VMware, Inc.\VMware Blast\config :PixelProviderGpuCompareCopy = 0 (Reg_Sz)

 

Wesley-VKAE
Contributor
Contributor

Am i glad to see this post (sadly, i know ... but at least i now finally know I'm not alone in this struggle that is going on for months now)

our experience with VDI has been trouble from the start, running it for over 2 years now ... 

the issues described in this topic are as close to identical with my environment.

hopefully someone can provide me with a solution that will work, after months talking to both VMWare and Nvidia i am still nowhere. 

the last post with the registry keys, can someone confirm this could actually resolve the issue for me?

i will post some details about my setup and how i experience the issues.

Setup : 

VSAN Cluster with 4 Nodes , 1HP  DL380 Gen 10 / 3 DL580 Gen10

DL380 with 1 RTX8000 GPU / DL580 with 2 RTX8000 GPU (All RTX8000)

we use both Instant Clones as Persistent VM's on Windows 10 22H2 (both have the same issue.)

Windows 11 22H1 and 22H2 are in a sandbox , no results there (still testing)

ESX Versions on hosts : VMware ESXi, 7.0.3, 20328353

VCenter Version : 7.0.3 20150588

in communcation with NVidia they claimed to address the issue and we needed to update our VIB and Guest drivers to :

ESX Package : 510.85.03-1OEM.702.0.0.17630552

Guest driver : 513.46_grid_win10_win11_server2019_server2022_64bit_international

VMware Horizon : 

Currently running on version 2206.

Agent version : 2206-8.6.0-20088748

Clients are almost all updated to minimum 8.6

---------------------------------

our issue is almost identical like others subscribe 

- different hosts, not 1 specific machine but the entire cluster

- user is doing his job, regardless of heavy load or idle work. suddenly connection loss.

- Guest interaction does not work anymore, in horizon client reset / restart does not work. on Vcenter Guest restart or shutdown same result.

- Ping to VM works / UNC path to c$ also works to the machine, but when i want to transfer the programdata\vmware folder (to inspect the logs after i reset the instant clone) the copy freezes immediately, it is like the VM is still there, but at the same time it isn't?

we are struggling with a faulty vdi experience since the start and after numerous of updates trough the months, hoping for something better like others also claim, it feels like when they fix one issue another one arises.

the way the environment handles the issues has changed from past until precent, where in the past one VM could eventually bring down one entire host (ore GPU, call it as you want), i then had to evacuate the entire host and reboot. now the issue is isolated to the VM without crashing the entire host.

the errors i get are also different then in the past, Protocol blast not ready ... cannot establish transport protocol ...

now i just get "A general system error occurred: Invalid fault", awesome ... that says it all doesn't it.

reverting to 7.0.2 is no option for us since i have a VSAN, i would need to start all over again and this is a production environment.

really hoping someone here can share some light into my struggle and give me a possible solution.

i lost faith in support at this time. it took me months to be addressed by VMware by the devops team and it was a simple ping pong effect. VMware blamed Nvidia, they did the same to VMware, then VMware infra teams said it should be the horizon team, horizon team was pointing to VMware infra team ... they Letterly drive me insane with this issue.

sorry for the long post, i guess i kind of lost track and put my frustration inside this post. I'm sure many will understand and feel the same way.

0 Kudos
RyanHardy
Enthusiast
Enthusiast

If you read this thread carefully you will find that the issue has been fixed quite some time ago - with recent builds like the ones you are using you shouldn't be facing this particual issue. At least we never experienced anything related to that issue since U3c.

I am not 100% sure you have the same problem though anyway. Have you tried changing some VMs from GRID to Software 3D? This was our solution until VMware finally fixed the bug. Not being able to (fully) copy log files from the VM seems a little unusual and would point more into the direction of networking issues?

What I totally have to repeat is that VMware Level 1 support is (almost always) a class of its own - in a bad, like really bad, way. 

0 Kudos
Wesley-VKAE
Contributor
Contributor

Hi,

thanks for the reply,

yes, i have been reading this post more and more and look at every link / log file that is suggested but at some point i'm not devops but just the administrator of the product 🙂

issue should be resolved indeed, like i said the situation changed on my end too in where individual random VM's are now experiencing the issue instead of the entire host going "down", with down i mean the GPU that gives up.

Nvidia Support and VMWare took over and did a check of the infrastructure, they confirmed it should work fine. We also configured familiar setups for other customers, same hardware that do not have this same issue.

changing from GRID to Software 3D is not possible I'm afraid, all the end users work in Autodesk products and the infrastructure was initially setup just for that. We were also explicitly told to make sure SVGA is no longer installed within VMWare tools and 3D support is disabled on the VM's / Masters.

the environment actually worked for a few weeks and all of the sudden the first of november the issue started again. No changes were made weeks before the first of november so I'm clueless.

i understand giving advice is not easy, every setup is different, and many parameters are involved.

the entire cluster is running on the same switches (physically, and if i check my personal log to keep track of the users having issues), i now have about 10 reports over the day for an environment with 100+ machines.

when i start going through logs i get lost in there ... not sure what is relevant or not.

regards

0 Kudos
jmacdaddy
Enthusiast
Enthusiast

When the user experiences a locked up session, don't restart the problem VM immediately.   First SSH to the host the VM is running on and run "nvidia-smi vgpu".  Do you see the impacted virtual machine showing 99% GPU Utilization?

If so, you are experiencing what we have been calling the "99% issue".  We have fought this since mid-2022 with no real insight from VMWare or NVidia on the root cause.  We did see it where one VM experiencing this issue would lock up the sessions of all the desktops on that same GPU card.  Resetting the problem VM would allow the other desktops to be reconnected to.  

In the end, we switched all of our vGPU profiles from GRID T4-1B to GRID T4-2B, and the problem went away.  Obviously, you may not be able to just double your assigned GPU frame buffer if you don't have enough physical GPU's in your hosts.  We were lucky in that we only had to add one additional server to our cluster to be able to do this.

Just curious, what AutoDesk product are your users primarily working with on their virtual desktops?

 

Good luck.

0 Kudos
Wesley-VKAE
Contributor
Contributor

Curious that you mention this, in the past we were working with RTX8000-2Q Profile and had allot of issues.

for testing purpose, we changed to 3Q Profiles and then the issues kind of came better ... 

now all of the sudden it starts again, but then again software is getting heavier, and also the project are growing ...

Autodesk Civil 3D, Autodesk Revit ... Fire Simulation software (but that is CPU, not GPU)

my manager tries to avoid going to 4Q Profiles because the investment (VM per GPU) was calculated for 2Q profiles in the beginning.

tomorrow the issue will happen again for sure, so i make sure SSH is enabled and i check the Nvidia command to confirm whether it could be this that is causing the issue.

hypothetically speaking if this would be the case, should this be considered normal behaviour that a VM completely freezes because of this?? one would believe there should be safety mechanismes in place to ensure VM's don't go completely bananas 😉

Thanks for the feedback.

0 Kudos
adc_1997
Contributor
Contributor

@jmacdaddy - we have the same issue as you, also seeing one VM on 99%.  All the users on the same GPU get a black screen.  Although so far, resetting the problem VM doesn't allow the other users to reconnect - and they won't vmotion off, we have to restart them.  The hosts have 3 graphics cards, and we have to vmotion the users off the other two cards - then reboot the host to get it functional again.  Very disruptive all in all!

We do have a case logged with Nvidia - and they tell us its a known issue, which is currently with engineering as top priority.  Although they can't give info on what exactly causes the issue, or if there is a way to mitigate or minimise it.   Nor an estimation of when there may be a fix.  Everytime there is a failure we are collecting the logs and adding them to the case.

Also we moved from 1B to 2B profiles for everyone about a month ago.  Unfortunately we are still experiencing the 99% issue - although possibly not quite so regularly.  We don't actually do anything GPU intensive, no tools like AutoDesk - and are considering reverting to software based graphics until this issue is actually resolved.

0 Kudos
Wesley-VKAE
Contributor
Contributor

hi,

i don't know if this should be considered a good thing or a bad thing, but like you describe your issues i have the exact same thing.

vmkernel.log is showing me this scrambled output :

2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: qYKCoPoVEMABAAAAAgAAAADVUAAAAAAAcGaPY6mCgqDvFRDAAQAAAAIAAACojFAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKg+hUQwAEAAAADAAAAAN1QAAAAAABwZo9jqYKCoO8VEMABAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AwAAAKyMUAAAAAAAcGaPY6mCgqD6FRDAAQAAAAQAAAAA5VAAAAAAAHBmj2OpgoKg
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: 7xUQwAEAAAAEAAAAsIxQAAAAAABwZo9jqYKCoPoVEMABAAAABQAAAADtUAAAAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: cGaPY6mCgqDvFRDAAQAAAAUAAAC0jFAAAAAAAHBmj2OpgoKg+hUQwAIAAACALFEA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKgzhUMwAIAAACELFEAAAAAAHBmj2OpgoKg1BUMwAIAAACMLFEA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKg2hUMwAIAAAAAAAAAAEVRAAAAAABwZo9jqYKCoO8VEMACAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAKAMUQAAAAAAcGaPY6mCgqD6FRDAAgAAAAEAAAAATVEAAAAAAHBmj2OpgoKg
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: 7xUQwAIAAAABAAAApAxRAAAAAABwZo9jqYKCoPoVEMACAAAAAgAAAABVUQAAAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: cGaPY6mCgqDvFRDAAgAAAAIAAACoDFEAAAAAAHBmj2OpgoKg+hUQwAIAAAADAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AF1RAAAAAABwZo9jqYKCoO8VEMACAAAAAwAAAKwMUQAAAAAAcGaPY6mCgqD6FRDA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AgAAAAQAAAAAZVEAAAAAAHBmj2OpgoKg7xUQwAIAAAAEAAAAsAxRAAAAAABwZo9j
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: qYKCoPoVEMACAAAABQAAAABtUQAAAAAAcGaPY6mCgqDvFRDAAgAAAAUAAAC0DFEA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKg+hUQwAMAAACArFEAAAAAAHBmj2OpgoKgzhUMwAMAAACErFEA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKg1BUMwAMAAACMrFEAAAAAAHBmj2OpgoKg2hUMwAMAAAAAAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AMVRAAAAAABwZo9jqYKCoO8VEMADAAAAAAAAAKCMUQAAAAAAcGaPY6mCgqD6FRDA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AwAAAAEAAAAAzVEAAAAAAHBmj2OpgoKg7xUQwAMAAAABAAAApIxRAAAAAABwZo9j
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: qYKCoPoVEMADAAAAAgAAAADVUQAAAAAAcGaPY6mCgqDvFRDAAwAAAAIAAACojFEA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AAAAAHBmj2OpgoKg+hUQwAMAAAADAAAAAN1RAAAAAABwZo9jqYKCoO8VEMADAAAA
2022-12-06T16:06:07.654Z cpu22:7675369)nvrm-nvlog: AwAAAKyMUQAAAAAAcGaPY6mCgqD6FRDAAwAAAAQAAAAA5VEAAAAAAHBmj2OpgoKg

 

while vmware.log inside the Virtual machine shows me this :

2022-12-06T15:56:19.486Z Er(02) vthread-11513368 - vmiop_log: (0x0): Idle channel timeout
2022-12-06T15:56:19.486Z Er(02) vthread-11513368 - vmiop_log: (0x0): VGPU message 22 failed, result code: 0x65
2022-12-06T15:56:19.486Z Er(02) vthread-11513368 - vmiop_log: (0x0): 0xfffefc10, 0xf4240, 0x1, [1]0xc1d00461, 0xff020000, 0xff040016
2022-12-06T15:56:19.486Z Er(02) vthread-11513368 - vmiop_log: (0x0):
2022-12-06T15:56:22.589Z Er(02) vthread-11513368 - vmiop_log: (0x0): Idle channel timeout
2022-12-06T15:56:22.589Z Er(02) vthread-11513368 - vmiop_log: (0x0): VGPU message 22 failed, result code: 0x65
2022-12-06T15:56:22.589Z Er(02) vthread-11513368 - vmiop_log: (0x0): 0xfffefc10, 0xf4240, 0x1, [1]0xc1d0044c, 0xff020000, 0xff040008
2022-12-06T15:56:22.589Z Er(02) vthread-11513368 - vmiop_log: (0x0):
2022-12-06T15:56:25.602Z Er(02) vthread-11513368 - vmiop_log: (0x0): Idle channel timeout
2022-12-06T15:56:25.602Z Er(02) vthread-11513368 - vmiop_log: (0x0): VGPU message 22 failed, result code: 0x65
2022-12-06T15:56:25.602Z Er(02) vthread-11513368 - vmiop_log: (0x0): 0xfffefc10, 0xf4240, 0x1, [1]0xc1d00454, 0xff020000, 0xff040010
2022-12-06T15:56:25.602Z Er(02) vthread-11513368 - vmiop_log: (0x0):
2022-12-06T15:56:25.642Z In(05) vthread-11513368 - VMIOP: Driver metadata = [vgpu_version:0x0]
2022-12-06T15:56:25.642Z No(00) vthread-11513368 - ConfigDB: Setting vmiop.guestVgpuVersion = "0"
2022-12-06T15:56:25.642Z In(05) vthread-11513368 - vmiop_log: (0x0): Guest driver unloaded!

 

i really can't believe how they manage to "not fix this problem" , what's the point of again creating a ticket ... they promised a few months back they found the problem and created a fix for it , yeah right ...

did you manage to get any feedback in the meanwhile ?

 

Regards

0 Kudos
krd
VMware Employee
VMware Employee

Those nvrm-nvlog messages in vmkernel log are from nvidia host driver and indicate a NVIDIA driver/hardware error. I suggest you report this issue to NVIDIA. 

0 Kudos
Wesley-VKAE
Contributor
Contributor

you wouldn't believe if i  told you.

months have gone by with NVidia support , sometimes daily we send log's for an entire host that crashes.

the times we have heard "we see a problem and will create a custom driver specially for you", now again we are on a custom NVidia driver but here we go again. 2 hosts failed already at the same day. if I'm not mistaken we are dealing with this problems for almost 2 years now. 

NVidia clearly can't help us, VMware won't help us because well, it's easy, it is a NVidia problem. so that's that. We have no support … only a crapy system we invested way too mush money into.

0 Kudos
BrianBurdorf
Contributor
Contributor

@Wesley-VKAE  i've had an open ticket with NVidia since last March 2022, its finally in an archive statewhat seemed to be working,  We are using VIB and guest driver 13.4.  We have other issues when we vmotion we get random GPUs that fall off the bus,  "cpu28:11556099)NVRM: Xid (PCI:0000:c3:00): 79, pid=11555595, GPU has fallen off the bus"  apprently its a hardware vendor issue, in our case HP.  NVidia states the XID 79 error is from heat or lack of power, although i cant find either one going over a threashold.  HP advises us to replace the card and so far they havent reproduced on cards replaced.  Our M10 cards do have these issue, just the RTX8000's.

0 Kudos
NateNateNAte
Hot Shot
Hot Shot

I can't believe this problem is still persisting after 2 years.  I've moved from instant clone pools to dedicated horizon desktops just because we got tired of the constant reboot (and user complaints). Fortunately it's a small use base needing the vGPU power so we could get away with that.  Hoping to see some improvement when we try again with ESXi 8.x soon.

Tags (1)
0 Kudos
Wesley-VKAE
Contributor
Contributor

Replace the cards you say ? thats the advice given ? 

we have 7 of them ... we are not a big enterprise so these cards cost us a lot of money ... Ticket is still open with nvidia ... they recognize it is an issue with their driver ... and keep giving an custom driver to solve the issue ... we are now 2 weeks further with no crash but it is total BS if you ask me we depend on an custom driver and reg key's we have to push in esx to create timeouts and so on. now we are stuck on version 13.4 because we work with their custom driver. i have a hate/love relationship with VDI at this point.

0 Kudos
Wesley-VKAE
Contributor
Contributor

do you mean PassTrough phyisical desktops or still VM's ? we have the issue in general on instant clones and persistent VM's ...

we use Q3 profiles and are heavily depended on GPU power (Civil 3D / Autocad / Revit)

0 Kudos
ToddMartin
Contributor
Contributor

I want to add something. With my graphics cards being M60's there are 2 GPU chips per card. On my cofiguration, I cannot have an allocation of different profiles on the same GPU chip. I noticed that I had all of one 1gb profile, and when I changed one to a different profile like (4q) it wouldn't power on even though I had 4gb of ram left on the physical card. When I changed it to match the rest of th guests it powered on fine. Somehow you have to fully evacuate the entire processor on a GPU and allocate only one profile for it. 

 

If you go into the CLI of the vsphere host - you can type the command 

nvidia-smi vgpu

That will show you which guests are located on which gpu and card if there are multiple cards. Then you can either try to power off and remove the cards for the hosts until you evac one gpu. Then you can change the profile to something else and the VM should power on . Add the other vgpu back to the other cards, and they should power on, on another gpu.

This was the issue I was having with the exact same verbiage as all of you.

0 Kudos
krd
VMware Employee
VMware Employee

Yes, this is standard behavior for all existing vGPU time-sliced VMs.  It is described by NVIDIA at Valid Time-Sliced Virtual GPU Configurations on a Single GPU 

If you want to support multiple M60 vGPU types, you can consider enabling ESXi host graphics consolidation mode (see Configuring Host Graphics). This will allow each M60 GPU to run a different vGPU size.

0 Kudos