VMware Cloud Community
sprouse94
Enthusiast
Enthusiast
Jump to solution

Cluster Agent VM is missing on cluster XYZ (vCLS)

Just upgraded to vSphere 7 Update 1 and see that in the VMs and Templates view I see it created the folder for vCLS.  I only have one cluster and DRS is complaining about the unhealthy state of the vSphere Cluster Service...which makes sense as none have been created.  I created a new cluster out of curiosity and moved some hosts into that cluster but getting the same issue.

 

In Virtual Center, when I go to Admin -> vCenter Server Extensions and look at the vSphere ESX Agent Manager I see both clusters have alerts and both have the same message..."Cluster agent VM is missing in the cluster" which makes sense, none exist.  Nice that there is a Resolve All Issues button but that doesn't resolve any issues of mine.

 

I am poking around trying to find logs that help pin point the exact issue but haven't been successful just yet.  Has anyone seen this before or can point me in the right direction of the logs to find the underlying issue why the vCLS VMs are not getting created.

 

All ESXi hosts have been patched to 7.0.1 and have the same CPU make / model (Intel)

 

** Update: Ok found in the EAM log "can't provision VM for ClusterAgent due to lack of suitable datastore".  All of my stores have 100 or more GBs free....but will start down that path **

Labels (3)
Reply
0 Kudos
34 Replies
SrVMwarer
Hot Shot
Hot Shot
Jump to solution

Exact same issue, I had .. just fixed by running fixsts.sh : )) 

Regards, İlyas
Reply
0 Kudos
fojtp
Contributor
Contributor
Jump to solution

My problem was a little "bigger".
Before stsfix, we created a test cluster and moved ESXi - vcenter didn't create a vCLS VM here either.
After stsfix in the existing cluster, vcenter tried to create vCLS and immediately deleted it (every minute!), in the cluster test vcenter created it without any problems. So stsfix didn't help us.

However, we tried the procedure described in https://kb.vmware.com/s/article/80472 and after disabling/enabling vCLS creations, the existing cluster recovered and created vCLS correctly and DRS is fully functional.

 

Thanks to All for your help!

Reply
0 Kudos
sprouse94
Enthusiast
Enthusiast
Jump to solution

Thanks for the info / results.  For me was able to get them to deploy using the lsdoctor tool.

Ran "phython lsdoctor.py -l"....told me there was an SSL Trust Mismatch

Ran "phython lsdoctor.py -t"....corrected the Mismatch

vCenter immediately created and powered up the vCLS machines and DRS appears to be happy again.

Reply
0 Kudos
funarpps
Contributor
Contributor
Jump to solution

Anyone hear any updates yet?  Is there a 17004997_7.0.1.00100_vcsa floating about, or an updated lsdoctor?

Reply
0 Kudos
dcolpitts
Enthusiast
Enthusiast
Jump to solution

So I swore at this issue for a day or so too, after migrating from VCSA 6.7 to 7.01 (build 17327586) yesterday.  In my case, my 6.7 VCSA (whatever the most recent version of 6.7 was on January 24, 2021) had been migrated multiple times over the years from version to version to version as required from I want to say VCSA 5 (but maybe it was 5.5).  It's also had an AD Certificate Authority issued certificate on it for many years (the cert says it's valid from May 2015 to July 2024, so it's been around for a while).  Eventually (after I opened a ticket with VMware Support 3+ hrs ago to which I haven't gotten a response to yet) I stumble on to this thread, which lead me to this route of resolution.

These are the steps I took, in the order I took them.

  1. Enabled then disabled Retreat Mode as per https://kb.vmware.com/s/article/80472
  2. Ran lsdoctor.py from https://kb.vmware.com/s/article/80469 and had to use the fixtrust and fixstale options to fix two issues identified.
  3. Ran checksts.py from https://kb.vmware.com/s/article/79248 - this identified multiple root certs with different thumbprints and expiry dates
  4. Ran fixsts.sh from https://kb.vmware.com/s/article/76719
  5. Ran checksts.py from https://kb.vmware.com/s/article/79248 again, and showed I now only have a single root cert.
  6. Ran "service-control --stop --all" to stop all the services after fixsts.sh finished (as is detailed in the KB article).
  7. Ran "service-control --start --all" to restart all services after fixsts.sh finished (as is detailed in the KB article).

By the time I made a pitstop for coffee, got the Chrome cache cleared, and managed to get logged back into VC, all the vCLS were finally deployed.

Incidently, the EAM.log indicated this prior to the fixsts.sh:

FAILED:  com.vmware.eam.sso.exception.TokenNotAcquired: Couldn't acquire token due to: Signature validation failed
Caused by: com.vmware.vapi.std.errors.ServiceUnavailable: ServiceUnavailable (com.vmware.vapi.std.errors.service_unavailable)
Can't provision VM for ClusterAgent(ID: 'Agent:48c988c8-570a-43d6-a12a-XXXXXXXXXX:null') due to lack of suitable datastore.

dcc

davidr78
Enthusiast
Enthusiast
Jump to solution

dcolpitts - that process worked for me. I had previously gone through the lsdoctor script but didn't resolve the issue. I t was the fixsts that was needed as I had 3 root certs within the system. This vcenter has also been upgraded from previous versions. Thanks for sharing the solution

ksl281
Contributor
Contributor
Jump to solution

Ive just spend hours trying to fix this issue, running all the scripts / commands from VMware.

Your guide worked for me! Thanks so much for the help! 🙂 

Reply
0 Kudos
seslinger
Contributor
Contributor
Jump to solution

dcolpitts, I opened a support ticket with VMware and the technician and I ended up using steps 2-7 of your solution.  Thanks.  He also sends his Kudos to you.

Reply
0 Kudos
rdowling2
Contributor
Contributor
Jump to solution

I raised a ticket for the same problem, pointed out this post to the support engineer, and still ended up waiting hours for them to drip feed me the steps themselves. 

Thank you dcolpitts

Reply
0 Kudos
dcolpitts
Enthusiast
Enthusiast
Jump to solution

Slight update on my original instructions.  Getting the scripts onto the vCenter is a pain, so I now just use curl to pull them down  The overall steps are still the same...

  1. Enabled then disabled Retreat Mode as per https://kb.vmware.com/s/article/80472
  2. Ran lsdoctor.py from https://kb.vmware.com/s/article/80469 and had to use the fixtrust and fixstale options to fix two issues identified.
  3. Ran checksts.py from https://kb.vmware.com/s/article/79248 - this identified multiple root certs with different thumbprints and expiry dates
  4. Ran fixsts.sh from https://kb.vmware.com/s/article/76719
  5. Ran checksts.py from https://kb.vmware.com/s/article/79248 again, and showed I now only have a single root cert.
  6. Ran "service-control --stop --all" to stop all the services after fixsts.sh finished (as is detailed in the KB article).
  7. Ran "service-control --start --all" to restart all services after fixsts.sh finished (as is detailed in the KB article).

SSH the vCenter appliance with Putty and login as root and then cut and paste these commands down to the first "--stop--".  Then apply each command / fix as required for your environment.  Note that the curl links were valid at the time I created this post (2021.05.17).

 

--start cut & paste below here--

 

curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/0685G00000NxYfZQAV -o /root/configure_retreat_mode.py

curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/0685G00000S5Q77QAF -o /root/lsdoctor.zip

curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/068f400000HW9InAAL -o /root/checksts.py

curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/068f400000JAn50AAD -o /root/fixsts.sh

chmod +x /root/fixsts.sh

unzip /root/lsdoctor.zip

cd /root/lsdoctor-master

python /root/lsdoctor-master/lsdoctor.py -l

 

--stop--

 

python /root/lsdoctor-master/lsdoctor.py --stalefix

 

--stop--

 

python /root/lsdoctor-master/lsdoctor.py --trustfix

 

--stop--

 

python /root/checksts.py

 

--stop--

 

cd /root

/root/fixsts.sh

 

--stop--

 

service-control --stop --all

 

--stop--

 

service-control --start --all

 

willoland
Contributor
Contributor
Jump to solution

Step 3-7 resolved this for me. I had 3 root certs.

Thank you for the guide.

Reply
0 Kudos
rbdrbd
Contributor
Contributor
Jump to solution

Running a fresh install of 7.0.2U2 and encountered this issue, but none of the solutions mentioned worked for me. I tailed the eam.log file and noticed errors stating that it was not able to connect to the database and 'service still initializing' type messages.

 

I found this KB article: https://kb.vmware.com/s/article/2112577 

The steps in it worked for me.

OctaviaIT
Contributor
Contributor
Jump to solution

Facing similar issue when vCLSs couldn't be created. Error: Can't provision VM for ClusterAgent(ID: 'Agent::null') due to lack of suitable datastore.

None of the steps from this thread helped.

After some troubleshooting with VMware, they pointed out the SRM is the issue as per https://docs.vmware.com/en/Site-Recovery-Manager/8.4/com.vmware.srm.admin.doc/GUID-531FB787-8B30-401... 

Unfortunately SRM also broke after the 7.0.2 upgrade so work is still in progress to fix the SRM first and then to unprotect one datastore for vCLS

Reply
0 Kudos
ace02000
Enthusiast
Enthusiast
Jump to solution

fixsts.sh fixed my problem too ... thank you !

Reply
0 Kudos
warnox
Enthusiast
Enthusiast
Jump to solution

None of the above options worked for me, the solution was as per https://kb.vmware.com/s/article/80588. The odd thing is that all the certificate checks were coming back successful.

Reply
0 Kudos