VMware Cloud Community
sprouse94
Enthusiast
Enthusiast
Jump to solution

Cluster Agent VM is missing on cluster XYZ (vCLS)

Just upgraded to vSphere 7 Update 1 and see that in the VMs and Templates view I see it created the folder for vCLS.  I only have one cluster and DRS is complaining about the unhealthy state of the vSphere Cluster Service...which makes sense as none have been created.  I created a new cluster out of curiosity and moved some hosts into that cluster but getting the same issue.

 

In Virtual Center, when I go to Admin -> vCenter Server Extensions and look at the vSphere ESX Agent Manager I see both clusters have alerts and both have the same message..."Cluster agent VM is missing in the cluster" which makes sense, none exist.  Nice that there is a Resolve All Issues button but that doesn't resolve any issues of mine.

 

I am poking around trying to find logs that help pin point the exact issue but haven't been successful just yet.  Has anyone seen this before or can point me in the right direction of the logs to find the underlying issue why the vCLS VMs are not getting created.

 

All ESXi hosts have been patched to 7.0.1 and have the same CPU make / model (Intel)

 

** Update: Ok found in the EAM log "can't provision VM for ClusterAgent due to lack of suitable datastore".  All of my stores have 100 or more GBs free....but will start down that path **

Labels (3)
Reply
0 Kudos
1 Solution

Accepted Solutions
nswilson82
Contributor
Contributor
Jump to solution

I had this response from support yesterday:

I've been working with engineering and they mentioned a similar case to this which suggested the issue had to do with certificates.

Can you try following the steps in the below KB article to fix the SSL trusts and rebuild service registrations on the affected nodes using lsdoctor?

https://kb.vmware.com/s/article/80469

1. Run lsdoctor with the "-t, --trustfix" option to fix any trust issues.

2. Run lsdoctor with the "-r, --rebuild" option to rebuild service registrations

However we already rolled back vcenter to 6.5 and then re-upgraded it to 6.7 so cannot test whether this works at the moment. If anyone decides to try it please reply and let us know whether it worked for you or not? 🙂

View solution in original post

34 Replies
Lalegre
Virtuoso
Virtuoso
Jump to solution

Hey @sprouse94,

Do you have vSAN in this environment?

Reply
0 Kudos
sprouse94
Enthusiast
Enthusiast
Jump to solution

No, the datastores are iSCSI

Reply
0 Kudos
Lalegre
Virtuoso
Virtuoso
Jump to solution

So vCLS is deployed in the datastore which best ranked and shared in multiple ESXi. I presume this iSCSI that you are talking is presented on all your ESXi inside the cluster. How many nodes do you have?

Reply
0 Kudos
sprouse94
Enthusiast
Enthusiast
Jump to solution

Yes, there are 4 nodes in the cluster and they all have access to about 12 different iSCSI data stores all with more than enough space.  I see vCLS needs 2GB I think, they all have well over 100GB free.  There is also a NFS datastore that is accessible to all 4 nodes as well.

.  

Reply
0 Kudos
Lalegre
Virtuoso
Virtuoso
Jump to solution

Oh, definitely then you have the resources to deploy the vCLS, so I can think of two scenarios. One that is a bug and the second is that the service is kinda stuck. Have you tried to restart the EAM service?

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

We upgraded our vcenter from 6.5 to 7.0 update 1 yesterday and are experiencing exactly the same problems.

At the moment DRS doesnt work on any of our clusters - because all the vCLS vm's failled to deploy.

The eam log states - Can't provision VM for ClusterAgent(XYZ) due to lack of suitable datastore......

If anyone knows how to fix this please speak up!

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

I've just had an update from VMware support - its still with engineering and they do not have a fix for this issue yet.

Reply
0 Kudos
sprouse94
Enthusiast
Enthusiast
Jump to solution

Ok, glad I am not the only one with this issue.  Hopeful this is addressed sooner rather later.

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

After some digging around in the EAM log, I found that its deploying this ovf

-rw-r--r-- 1 root root    36782 Aug  1 17:02 photon-ova-0.0.1-16677410.ovf

-rw-r--r-- 1 root root     1909 Aug  1 17:02 photon-ova.cert

-rw-r--r-- 1 root root 75251200 Aug  1 17:02 photon-ova-disk1.vmdk

-rw-r--r-- 1 root root      148 Aug  1 17:02 photon-ova.mf

I was wondering if vCLS VM's could be deployed manually - but dont really want to test my theory on a production VC...

Reply
0 Kudos
cfmorrell
Contributor
Contributor
Jump to solution

Did you get anywhere on this?  I'm having the same issue and seeing the same errors in the EAM log.  I don't have a vSAN configured, so the VMWare docs don't appear to offer any thoughts.

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

Nope -VMware support said that the CLS VM's cannot be deployed manually - and they also still dont have a fix or an ETA on a fix.

The support engineer mistakenly thought that the problem was only affecting vsan clusters - so we set them straight on that.

The last we heard from support was that they were "opening a PR internally to flag the issue further with engineering as there are many similar known issues which are unresolved" and that it could be a while before a fix was available.

So we reverted back to 6.5 last week , which involved a fair amount of turning HA on and off again , disconnecting and reconnecting a bunch of hosts and a couple of reboots of the 6.5 VC before everything would behave properly....

We are upgrading to 6.7 this week to avoid being affected by the loss of flash at the end of december.

 

Reply
0 Kudos
cfmorrell
Contributor
Contributor
Jump to solution

Well that's a bummer.  I'm running an educational cluster with ~100 resource groups, ~1000 machines, way too may people touching vCenter, and we're in the middle of final project season.  I think I'm going to cross my fingers that everything remains stable as we limp through the last couple of weeks in the semester.  It's super exciting that VMWare put us in this situation though.  😐

Reply
0 Kudos
fojtp
Contributor
Contributor
Jump to solution

the same problem, something new from vmware?

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

I had this response from support yesterday:

I've been working with engineering and they mentioned a similar case to this which suggested the issue had to do with certificates.

Can you try following the steps in the below KB article to fix the SSL trusts and rebuild service registrations on the affected nodes using lsdoctor?

https://kb.vmware.com/s/article/80469

1. Run lsdoctor with the "-t, --trustfix" option to fix any trust issues.

2. Run lsdoctor with the "-r, --rebuild" option to rebuild service registrations

However we already rolled back vcenter to 6.5 and then re-upgraded it to 6.7 so cannot test whether this works at the moment. If anyone decides to try it please reply and let us know whether it worked for you or not? 🙂

fojtp
Contributor
Contributor
Jump to solution

before fix:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
You can find the report and logs here: /var/log/vmware/lsdoctor

2020-12-02T11:11:50 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
2020-12-02T11:11:51 INFO live_checkCerts: Checking services for trust mismatches...
2020-12-02T11:11:51 INFO generateReport: Listing lookup service problems found in SSO domain
2020-12-02T11:11:51 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found Port 7444 Found: Please run python ls_doctor.py --stalefix option on this node.
2020-12-02T11:11:51 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found SSL Trust Mismatch: Please run python ls_doctor.py --trustfix option on this node.
2020-12-02T11:11:51 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
2020-12-02T11:11:51 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-111150.json

First fix - stalefix:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py --stalefix

WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
Logs can be found here: /var/log/vmware/lsdoctor

2020-12-02T11:12:36 INFO main: You are running a check on this node for stale 5.x data. NOTE: Please run this script on all VC's or PSC's in the SSO domain to be thorough.

Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


Provide password for administrator@vsphere.local:
2020-12-02T11:12:57 INFO __init__: Retrieved services for machine with hostname: vcenter.masked.domain
2020-12-02T11:12:57 INFO checkStale: Checking for logbrowser or 5.x vsphere client services...
2020-12-02T11:12:57 WARNING checkStale: PROBLEM FOUND: logbrowser service found. Attempting to unregister...
2020-12-02T11:12:57 INFO checkStale: Success!
2020-12-02T11:12:57 WARNING checkStale: PROBLEM FOUND: stale 5.x webclient service found. Attempting to unregister...
2020-12-02T11:12:57 INFO checkStale: Success!
2020-12-02T11:12:57 INFO checkStale: PASSED: 5.x vcenter service not found.
2020-12-02T11:12:57 INFO backup_machine: Exporting MACHINE_SSL_CERT cert and key
2020-12-02T11:12:57 INFO checkLegacy: Checking for STS_INTERNAL_SSL_CERT...
2020-12-02T11:12:57 INFO backup_machine: Exporting MACHINE_SSL_CERT cert and key
2020-12-02T11:12:57 INFO check_sts_internal: Checking for STS_INTERNAL_SSL_CERT...
2020-12-02T11:12:57 INFO backup_sts_internal: Backing up STS_INTERNAL_SSL_CERT
2020-12-02T11:12:57 INFO checkLegacy: PROBLEM FOUND: STS_INTERNAL_SSL_CERT found!
2020-12-02T11:12:57 INFO replace_sts_internal: Replacing STS_INTERNAL_SSL_CERT with MACHINE_SSL_CERT
2020-12-02T11:12:57 INFO replace_sts_internal: Successfully replaced STS_INTERNAL_SSL_CERT
2020-12-02T11:12:57 INFO checkLegacy: Checking for 7444 in legacy services...
2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sso-adminserver/sdk/vsphere.local
2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sts/STSService/vsphere.local
2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sso-adminserver/sdk/vsphere.local
2020-12-02T11:12:57 INFO checkLegacy: Recreating legacy SSO service registrations...
2020-12-02T11:12:59 INFO checkLegacy: Successfully recreated legacy SSO endpoints.
2020-12-02T11:12:59 INFO main: Please restart services on all PSC's and VC's when you're done.

After stalefix:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
You can find the report and logs here: /var/log/vmware/lsdoctor

2020-12-02T11:22:17 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
2020-12-02T11:22:17 INFO live_checkCerts: Checking services for trust mismatches...
2020-12-02T11:22:17 INFO generateReport: Listing lookup service problems found in SSO domain
2020-12-02T11:22:17 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found SSL Trust Mismatch: Please run python ls_doctor.py --trustfix option on this node.
2020-12-02T11:22:17 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
2020-12-02T11:22:17 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-112217.json

Second fix - trustfix:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py --trustfix

WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
Logs can be found here: /var/log/vmware/lsdoctor

2020-12-02T11:22:33 INFO main: You are checking for and fixing SSL trust mismatches in the local SSO site. NOTE: Please run this script one PSC or VC per SSO site.

Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


Provide password for administrator@vsphere.local:
2020-12-02T11:22:42 INFO __init__: Retrieved services from SSO site: Default-First-Site
2020-12-02T11:22:42 INFO findAndFix: Checking services for trust mismatches...
2020-12-02T11:22:42 INFO findAndFix: Attempting to reregister d51c3647-4896-4823-acb2-1d1cb3acb48 for vcenter.masked.domain
2020-12-02T11:22:43 INFO findAndFix: We found 1 mismatch(s) and fixed them 🙂
2020-12-02T11:22:43 INFO main: Please restart services on all PSC's and VC's when you're done.

After trustfix:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
You can find the report and logs here: /var/log/vmware/lsdoctor

2020-12-02T11:39:16 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
2020-12-02T11:39:17 INFO live_checkCerts: Checking services for trust mismatches...
2020-12-02T11:39:17 INFO generateReport: Listing lookup service problems found in SSO domain
2020-12-02T11:39:17 INFO generateReport: No issues detected in the lookup service entries for vcenter.masked.domain (VC 7.0 or CGW).
2020-12-02T11:39:17 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
2020-12-02T11:39:17 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-113916.json

Problem with vCLS is not fixed.
Third fix - rebuild:

root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -r

WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
Logs can be found here: /var/log/vmware/lsdoctor

2020-12-02T13:36:22 INFO main:
You have selected the Rebuild function. This is a potentially destructive operation!
All external solutions and 3rd party plugins that register with the lookup service will
have to be re-registered. For example: SRM, vSphere Replication, NSX Manager, etc.

Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


Provide password for administrator@vsphere.local:
2020-12-02T13:36:27 INFO __init__: Established LS connection to vcenter.masked.domain

Version Detected
Deployment type: embedded
Version: 17004997_7.0.1.00100_vcsa
========================

0. Exit
1. Generate a template.
2. Replace all services with new services.
3. Replace individual service.
4. Restore services from backup file.

========================

Please select an action: 2

No template found for 17004997_7.0.1.00100_vcsa. Proceeding to file select.

2020-12-02T13:36:33 INFO fileSelect: Getting files from /root/lsdoctor/lsdoctor-master/templates
Please select a file:

[0] 13010631_6.7.0.30000_vcsa.json
..
[80] 16749653_7.0.0.10700_vcsa.json
..
[96] 15808842_6.5.0.32300_vcsa.json
Select number:

You can see - in current version lsdoctor is missing template for build 17004997_7.0.1.00100_vcsa

 

 

Reply
0 Kudos
nswilson82
Contributor
Contributor
Jump to solution

Thanks for testing - I've given the link to this thread to support so that they can see your output.

fojtp
Contributor
Contributor
Jump to solution

from eam.log:

 

2020-12-02T13:08:09.934Z | INFO | sso-0 | AcquireTokenProvider.java | 53 | [CreateSAMLToken:124579314592473] Acquiring HoK token.
2020-12-02T13:08:09.980Z | INFO | sts-0 | Workflow.java | 121 | [CreateSAMLToken:124579314592473] FAILED
com.vmware.eam.sso.exception.TokenNotAcquired: Couldn't acquire token due to: Signature validation failed
..
2020-12-02T13:08:09.983Z | WARN | sts-0 | TagsChecker.java | 157 | [FilterNotAllowedDatastores:56795461689246] Unexpected error filtering datastores by tag category names.
..
2020-12-02T13:08:09.991Z | ERROR | cluster-agent-3 | AuditedJob.java | 106 | JOB FAILED: [#45471321] DeployVmJob(ClusterAgent(ID: 'Agent:54654654-6554-5454-4747-4587215768741:null'))
com.vmware.eam.job.DeployVmJob$DeployVmJobFailure: Can't provision VM for ClusterAgent(ID: 'Agent:54654654-6554-5454-4747-4587215768741:null') due to lack of suitable datastore.

 

Does it look like a problem with STS? but check STS is OK - https://kb.vmware.com/s/article/79248

Reply
0 Kudos
cfmorrell
Contributor
Contributor
Jump to solution

Same result here.  I had to run stalefix, but everything is clear after that.  Still getting the "lack of suitable datastore" error.  Here are my lsdoctor -l results:

root@vcsa1 [ ~/lsdoctor-master ]# python lsdoctor.py -l

ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
You can find the report and logs here: /var/log/vmware/lsdoctor

2020-12-02T13:17:03 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
2020-12-02T13:17:03 INFO live_checkCerts: Checking services for trust mismatches...
2020-12-02T13:17:03 INFO generateReport: Listing lookup service problems found in SSO domain
2020-12-02T13:17:03 INFO generateReport: No issues detected in the lookup service entries for vcsa1.eecs.net (VC 7.0 or CGW).
2020-12-02T13:17:03 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
2020-12-02T13:17:03 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcsa1.eecs.net-2020-12-02-131703.json

cfmorrell
Contributor
Contributor
Jump to solution

My issue is finally resolved.  Turned out that checksts.py was telling me that there weren't any issues, but there were four certificates (1 leaf and 3 roots).  I've read in a few places that there is only supposed to be one.  I ran fixsts.sh, which dropped me to 3 certs in checksts (1 root and 2 leaf certs).  After that, the vCLS machines showed up almost immediately.