Am i glad to see this post (sadly, i know ... but at least i now finally know I'm not alone in this struggle that is going on for months now)
our experience with VDI has been trouble from the start, running it for over 2 years now ...
the issues described in this topic are as close to identical with my environment.
hopefully someone can provide me with a solution that will work, after months talking to both VMWare and Nvidia i am still nowhere.
the last post with the registry keys, can someone confirm this could actually resolve the issue for me?
i will post some details about my setup and how i experience the issues.
Setup :
VSAN Cluster with 4 Nodes , 1HP DL380 Gen 10 / 3 DL580 Gen10
DL380 with 1 RTX8000 GPU / DL580 with 2 RTX8000 GPU (All RTX8000)
we use both Instant Clones as Persistent VM's on Windows 10 22H2 (both have the same issue.)
Windows 11 22H1 and 22H2 are in a sandbox , no results there (still testing)
ESX Versions on hosts : VMware ESXi, 7.0.3, 20328353
VCenter Version : 7.0.3 20150588
in communcation with NVidia they claimed to address the issue and we needed to update our VIB and Guest drivers to :
ESX Package : 510.85.03-1OEM.702.0.0.17630552
Guest driver : 513.46_grid_win10_win11_server2019_server2022_64bit_international
VMware Horizon :
Currently running on version 2206.
Agent version : 2206-8.6.0-20088748
Clients are almost all updated to minimum 8.6
---------------------------------
our issue is almost identical like others subscribe
- different hosts, not 1 specific machine but the entire cluster
- user is doing his job, regardless of heavy load or idle work. suddenly connection loss.
- Guest interaction does not work anymore, in horizon client reset / restart does not work. on Vcenter Guest restart or shutdown same result.
- Ping to VM works / UNC path to c$ also works to the machine, but when i want to transfer the programdata\vmware folder (to inspect the logs after i reset the instant clone) the copy freezes immediately, it is like the VM is still there, but at the same time it isn't?
we are struggling with a faulty vdi experience since the start and after numerous of updates trough the months, hoping for something better like others also claim, it feels like when they fix one issue another one arises.
the way the environment handles the issues has changed from past until precent, where in the past one VM could eventually bring down one entire host (ore GPU, call it as you want), i then had to evacuate the entire host and reboot. now the issue is isolated to the VM without crashing the entire host.
the errors i get are also different then in the past, Protocol blast not ready ... cannot establish transport protocol ...
now i just get "A general system error occurred: Invalid fault", awesome ... that says it all doesn't it.
reverting to 7.0.2 is no option for us since i have a VSAN, i would need to start all over again and this is a production environment.
really hoping someone here can share some light into my struggle and give me a possible solution.
i lost faith in support at this time. it took me months to be addressed by VMware by the devops team and it was a simple ping pong effect. VMware blamed Nvidia, they did the same to VMware, then VMware infra teams said it should be the horizon team, horizon team was pointing to VMware infra team ... they Letterly drive me insane with this issue.
sorry for the long post, i guess i kind of lost track and put my frustration inside this post. I'm sure many will understand and feel the same way.