sprouse94
Contributor
Contributor

Cluster Agent VM is missing on cluster XYZ (vCLS)

Jump to solution

Just upgraded to vSphere 7 Update 1 and see that in the VMs and Templates view I see it created the folder for vCLS.  I only have one cluster and DRS is complaining about the unhealthy state of the vSphere Cluster Service...which makes sense as none have been created.  I created a new cluster out of curiosity and moved some hosts into that cluster but getting the same issue.

 

In Virtual Center, when I go to Admin -> vCenter Server Extensions and look at the vSphere ESX Agent Manager I see both clusters have alerts and both have the same message..."Cluster agent VM is missing in the cluster" which makes sense, none exist.  Nice that there is a Resolve All Issues button but that doesn't resolve any issues of mine.

 

I am poking around trying to find logs that help pin point the exact issue but haven't been successful just yet.  Has anyone seen this before or can point me in the right direction of the logs to find the underlying issue why the vCLS VMs are not getting created.

 

All ESXi hosts have been patched to 7.0.1 and have the same CPU make / model (Intel)

 

** Update: Ok found in the EAM log "can't provision VM for ClusterAgent due to lack of suitable datastore".  All of my stores have 100 or more GBs free....but will start down that path **

Labels (3)
0 Kudos
25 Replies
SrVMwarer
Hot Shot
Hot Shot

Exact same issue, I had .. just fixed by running fixsts.sh : )) 

Regards, İlyas
0 Kudos
fojtp
Contributor
Contributor

My problem was a little "bigger".
Before stsfix, we created a test cluster and moved ESXi - vcenter didn't create a vCLS VM here either.
After stsfix in the existing cluster, vcenter tried to create vCLS and immediately deleted it (every minute!), in the cluster test vcenter created it without any problems. So stsfix didn't help us.

However, we tried the procedure described in https://kb.vmware.com/s/article/80472 and after disabling/enabling vCLS creations, the existing cluster recovered and created vCLS correctly and DRS is fully functional.

 

Thanks to All for your help!

0 Kudos
sprouse94
Contributor
Contributor

Thanks for the info / results.  For me was able to get them to deploy using the lsdoctor tool.

Ran "phython lsdoctor.py -l"....told me there was an SSL Trust Mismatch

Ran "phython lsdoctor.py -t"....corrected the Mismatch

vCenter immediately created and powered up the vCLS machines and DRS appears to be happy again.

0 Kudos
funarpps
Contributor
Contributor

Anyone hear any updates yet?  Is there a 17004997_7.0.1.00100_vcsa floating about, or an updated lsdoctor?

0 Kudos
dcolpitts
Enthusiast
Enthusiast

So I swore at this issue for a day or so too, after migrating from VCSA 6.7 to 7.01 (build 17327586) yesterday.  In my case, my 6.7 VCSA (whatever the most recent version of 6.7 was on January 24, 2021) had been migrated multiple times over the years from version to version to version as required from I want to say VCSA 5 (but maybe it was 5.5).  It's also had an AD Certificate Authority issued certificate on it for many years (the cert says it's valid from May 2015 to July 2024, so it's been around for a while).  Eventually (after I opened a ticket with VMware Support 3+ hrs ago to which I haven't gotten a response to yet) I stumble on to this thread, which lead me to this route of resolution.

These are the steps I took, in the order I took them.

  1. Enabled then disabled Retreat Mode as per https://kb.vmware.com/s/article/80472
  2. Ran lsdoctor.py from https://kb.vmware.com/s/article/80469 and had to use the fixtrust and fixstale options to fix two issues identified.
  3. Ran checksts.py from https://kb.vmware.com/s/article/79248 - this identified multiple root certs with different thumbprints and expiry dates
  4. Ran fixsts.sh from https://kb.vmware.com/s/article/76719
  5. Ran checksts.py from https://kb.vmware.com/s/article/79248 again, and showed I now only have a single root cert.
  6. Ran "service-control --stop --all" to stop all the services after fixsts.sh finished (as is detailed in the KB article).
  7. Ran "service-control --start --all" to restart all services after fixsts.sh finished (as is detailed in the KB article).

By the time I made a pitstop for coffee, got the Chrome cache cleared, and managed to get logged back into VC, all the vCLS were finally deployed.

Incidently, the EAM.log indicated this prior to the fixsts.sh:

FAILED:  com.vmware.eam.sso.exception.TokenNotAcquired: Couldn't acquire token due to: Signature validation failed
Caused by: com.vmware.vapi.std.errors.ServiceUnavailable: ServiceUnavailable (com.vmware.vapi.std.errors.service_unavailable)
Can't provision VM for ClusterAgent(ID: 'Agent:48c988c8-570a-43d6-a12a-XXXXXXXXXX:null') due to lack of suitable datastore.

dcc

davidr78
Contributor
Contributor

dcolpitts - that process worked for me. I had previously gone through the lsdoctor script but didn't resolve the issue. I t was the fixsts that was needed as I had 3 root certs within the system. This vcenter has also been upgraded from previous versions. Thanks for sharing the solution

0 Kudos