We've been dealing with this issue for quite some time and feel like we're chasing our tail therefore wanted to reach out to the community and trying to get maybe some fresh view on this.
We are currently on Horizon 2111 with matching agent, Windows 10 21H2, ApppVolumes 2103 (126.96.36.199) with matching agent, vcenter 7U3 with matching ESXi hosts and vmtools 11.3.5. We also use NVidia Tesla M10 GPUs on all of the clones
Not a huge environment but running roughly 600 sessions daily among 2 horizon pods. Everything running on Pure storage. I should probably add that we do have User profile only writables. Our Instant clones have 6 vCPUs and 12 GB of memory. For banking industry I know it's crazy but I should probably add that the circumstances required it with using Carbon Black AppControl, Carbon Black Cloud, ObserveIT and Tessian for email security. Also we kept increasing it trying to help those VMs to prevent from going into unresponsive state.
Symptoms: What we typically see is an average of 20 users per day calling in reporting that their VDI session froze and they got disconnected. When we look at that VM, it is pingable and running. We cannot console in due to NVidia GRID and also cannot remote manage it. We have recently added Controlup to aid with those issues but the Controlup agent also becomes unresponsive. We tried to put the VM in maintenance mode, shut it down, remove nvidia grid and power it back on but it immediately gets removed. When the VM in horizon shows as agent unreachable the only option we have is to Remove it. This part is taking a super long time before the VM really gets powered off and removed. Typically to speed things up we go in to vshpere and power it off. This tells me that vmtools is also not responding to Shutdown guest. In additional to that after recent update to Horizon 2111 from 7.13.2 we definitely see a huge increase with VMs going into Already Used state which typically ends up in Agent Unreachable state.
We have triple checked Carbon black exclusions and based on what we found everything is in place. Over the course of that issue we have already created new optimized image, then created new non-optimized image (current) to see if that helps but to no avail.
Based on what we are gathering through COntrolUp we can see some PageFaults and Disk i/o contributing to the VMs reporting in High/Critical stress but cannot correlate it really to the actual VMs getting into unreachable state. The three things that users typically report is using Outlook, Cisco Jabber or one more inhouse application as the last thing they did before getting into the unreachable state
Sorry about a super long post but I'm hoping if someone has some good way to share that could aid us with troubleshooting this issue.
Thank you in advance
I would get rid of Carbon Black, App Volumes and NVIDIA, create a test pool with/without changing the VM config and see how it behaves in terms of CPU usage and performance. Are the hosts capable of running the VMs without hitting the limits?
If the above doesn't help, please get VMware tech support team involved.
For the VMs showing as 'Already Used' try setting the attribute pae-DirtyVMPolicy to the value of 2 on the affected pools and see if that helps any. I've been having to do this in our environment for the longest time because of issues similar to this which I could never figure out either.
Also, if you upgrade your NVIDIA drivers to 15.0 you can see the VM console finally with vGPU.
I appreciate your response. For the NVidia I'd like to wait until it's LTSB cause we don't really have time to upgrade it multiple times in a year plus Cisco UCS hardware matrix is limiting us quite a bit too.
As for the pools, we already have that attribute in place.