Multi-Tenant Horizon DaaS - Sporadic Instant Clone Domain Trust Issues: Possible Cause?

Recently we ran into an issue where select groups of Instant Clone machines were having issues with domain trust issues shortly after certificate updates on the underlying vCenter in two DCs. Machines were initially unable to allocate while logging through the normal methods and direct console logins gave the usual "The trust relationship between this workstation and the primary domain failed". All problems are rectified but we are still trying to find the actual root cause for our understanding to ensure it is actually fixed as well as for accurate documentation for all parties affected. We also need to update the certificates on the remaining DCs so we are trying to understand the issue as much as possible to ensure we don't cause further issues if that was in fact what caused the issues.


This happened hours after we did SSL certificate updates on the hosting vCenters in two separate DCs, though the TAs were previously done about two weeks prior with no known issues. The scope of the affected machines was not consistent at all. It only affected a few tenants in each DC and even then, only seemed to affect machines that did not have an active user connection. Those users that were in logged in just fine well after the cert update. Affected machine count was approximately 125 out of 600 machines.


Now, we could lazily pin the cert updates to the issue but the fact that it only caused problems for certain machines not ALL in each DC and the sporadic nature leaves us confused.


We tried a few things to correct before we found that deleting the affected machines and rebuilding as new was the easiest/quickest way to fix. Reboots on the instant clones did nothing as I assume the templates/replicas were ultimately busted. As for logs, the TAs did not show much, we also sent them in to VMware Support for assistance, but they requested the machine agent logs. Which we no longer have because we deleted all the affected machines. However, I may have some snapshots still within retention that we will restore for further investigation and to grab those logs.

Has anybody else run into a similar issue or have an idea of what could have caused the problems or items that I can take a look at when I restore that machine? Any help would be appreciated. Thanks!

