Okay here goes.
Short Description: After Farm Maintenance operations, RDSH instant clones occasionally fail their first user authentication challenge, drop all AppStack attachments, and report "Virtualization is Disabled."
RDSH Image: Server 2016
We have several Horizon RDSH Farms, each of which has multiple instant clone hosts, all of which are based on a single Server 2016 master image. Each Farm's clones live in their own OU, and various appstack assignments are made to each OU. We perform weekly Farm Maintenance operations to recover/reset the RDSH hosts to prevent configuration drift, and Farm Maintenance operations for patch updates.
After these Farm Maintenance operations, we often get users complaining of receiving "Virtualization is Disabled" messages at their Horizon Client. When we look at all the RDSH clones in AppVolumes (Directory -> Computers), we will find random RDSH hosts with 0 appstack attachments, while the other healthy clones have the expected number of appstack attachments. There isn't a pattern to which clones this happens on. Could be one or more clones in one or more farms. It changes each time.
The failing clones typically come up healthy with all stacks attached immediately after the maintenance operations. We know this because we check. Inspection of agent/manager logs tells us this that, when this happens, it generally happens at the moment of a user login... usually the first user login after farm maintenance for the clones that are going to fall down.
We have disabled agent cookies in the AppVolumes database.
We typically gather the logs from the failing agents, as well as the manager logs from the corresponding timeframe. A common theme we see is this in the manager log:
[2021-05-22 08:17:50 UTC] P4484R697 INFO Manager: User Login: upn=OURDOMAIN\UserXYZ account=UserXYZ (domain) [2021-05-22 08:17:50 UTC] P4484R697 INFO Cvo: Found existing record for "Computer <OURDOMAIN\RDS-FARM1-CLONE1$>", associated to "Machine <RDS-FARM1-CLONE1> (5014c7c6-3004-b058-0ec0-bf7c16ae2345)" [2021-05-22 08:17:50 UTC] P296R538 INFO RADIR: Creating persistent LDAP connection to domain "ourdomain.com" at "ldap.ourdomain.com (ldap.ourdomain.com):389" with base "" [2021-05-22 08:17:50 UTC] P4484R697 INFO Cvo: Machine "Machine <RDS-FARM1-CLONE1> (5014c7c6-3004-b058-0ec0-bf7c16ae2345)" was marked as deleted in the past. Marking as existing. [2021-05-22 08:17:50 UTC] P4484R697 WARN Manager: Unable to login because Computer "Computer <OURDOMAIN\RDS-FARM1-CLONE1$>" is offline, last seen online at "2021-05-22 06:10:25 UTC" [2021-05-22 08:17:50 UTC] P4484R697 INFO Rendering text template [2021-05-22 08:17:50 UTC] P4484R697 INFO Rendered text template (0.0ms) [2021-05-22 08:17:50 UTC] P4484R697 INFO Completed 400 Bad Request in 253ms (Views: 0.4ms | ActiveRecord: 11.3ms)
(names changed to protect the innocent)
So for some reason the manager thinks the agent machine is offline (even though the authentication request is coming from the agent machine), and it denies the request with a 400 error.
FYI the LDAP call is made to our NetScaler load balanced LDAP VIP. For a while I worried that traffic was going to DCs in other sites and the replication of computer account changes during farm maintenance operations wasn't happening fast enough, but I've confirmed the load balance destinations are all in the same local AD site.
I don't really understand why this is happening. I especially don't understand why it will happen to like a single clone in a farm of 10 identical machines.
Would changing to LDAPS help? We're on that path, but just haven't done it yet.
Any help is appreciated. Thanks!