Hello,
A customer asked me to look at their vCenter with 4 clusters and all 4 clusters had the "vSphere DRS functionality was impacted due to unhealthy state vSphere Cluster Services" error.
Version is 7 U2b
Me: what happened?
Customer: I don't know
Me: since when?
Customer: I saw that error for the first time a couple of days ago.
Me: and no-one did anything?
Customer: nope
Anyway, First thing I thought is that someone did not like those vCLS VMs, found some blog and enabled "Retreat mode". But in the vCenter Advanced Settings, there where no "config.vcls.clusters.domain-c(number).enabled" settings. Did somebody add and set it (4x, one for each cluster), then deleted the setting? Unlikely, because once set, such settings cannot be removed via the GUI so I ruled that out. But what/how where they deleted without a trace?? I was at this customer a month earlier and those VM's where there. I'm 100% sure.
I added 4 of those "config.vcls.clusters.domain-c(number).enabled = true" settings, one for each cluster, with the correct ID: Nothing happened.
The ESX Agent Manager service is running and nothing wierd in the eam.log. None of the Knowledgebase Articles I found apply. All looks good but I just can't get these vCLS VM's back. There is no trace of them. They just disappeared into thin Air. I stumped.
Anyone any ideas before I open a SR ?
Update: the issue is solved. As per Support's analysis, the issue is matching KB https://kb.vmware.com/s/article/85742?lang=en_US
Replacing the STS certs. was not enough. I replaced the Solution User Certificates with VMCA generated Certificates also and rebooted. After the reboot, the vCLS VM's started to be deployed again. All is good now.
And SR is probably best there are alot of cert things that can cause this, I know for me we cleared out a bunch of expired ca certs and regenerated the sts certificate.
I know and I looked if there is anything in the eam.log concerting certificates, expired stuff etc. but can't find anything. None of the certs are expired. Regenerated the STS certs (fixsts.sh) anyway but it did not help. There are also no vCLS folders anywhere which is also strange.
I'll open a SR.
Update: the issue is solved. As per Support's analysis, the issue is matching KB https://kb.vmware.com/s/article/85742?lang=en_US
Replacing the STS certs. was not enough. I replaced the Solution User Certificates with VMCA generated Certificates also and rebooted. After the reboot, the vCLS VM's started to be deployed again. All is good now.