VMware Cloud Community
future2000
Enthusiast
Enthusiast

vCloud Foundation 3.10.1 Unable to expand cluster or create new cluster

Hi,

 

We're facing an issue with our vCF 3.10.1 environment which means we cannot expand clusters or deploy new ones. Commissioning hosts completes succesfully but any attempts to add these hosts to existing cluster results in the following error which is reported approximately 15 seconds after the task starts...

 

Subtasks of Tak Adding new hosts(s) to cluster

 

Validate that installed NSX-T version is supported

Error: Message: Unable to check NSX-T Version

Remediation Message:

Reference Token: F29CAT

Cause:

    Type: java.lang.RuntimeException

     Message: Unable to fetch NSX-T version.

These are brand new hosts which we have just built and are confirmed as all good during commissioning. Suffice to say no NSX-T VIBs are installed on the hosts and this appears to be 1 of many tasks.

 

The operations manager.log provides absolutely no information as to why this is failing. Integration between vCF SDDC Manager and NSX-T appears fine.

 

Cheers

 

0 Kudos
16 Replies
CyberNils
Hot Shot
Hot Shot

Did you manage to figure this out?

I have the same problem when trying to remove a failed host from my stretched cluster in VCF 4.1.

The host is gone, so SDDC Manager can't validate anything against it. 

I just want it removed from the cluster so that I can replace it with a new one.



Nils Kristiansen
https://cybernils.net/
0 Kudos
shank89
Expert
Expert

Try the steps in this guide. 

https://www.lab2prod.com.au/2020/12/the-unofficial-vcf-troubleshooting-guide.html

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
CyberNils
Hot Shot
Hot Shot

Thanks, but I need the supported way to remove a failed host from a Stretched Cluster and this one does not work:

https://docs.vmware.com/en/VMware-Cloud-Foundation/4.1/com.vmware.vcf.admin.doc_41/GUID-92FD3AEE-5B5...

 



Nils Kristiansen
https://cybernils.net/
0 Kudos
shank89
Expert
Expert

If the API method isn't working, what isn't working by the way? Have you checked the logs for a reason,  or just going by the UI?

And you need an official response, contact gss as the forums aren't the right way.  But if you are stuck in a sticky situation, I dare say it'll be a database edit. 

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
CyberNils
Hot Shot
Hot Shot

First problem is that step 1 in the article I referenced is not possible on Stretched Clusters, it says so right in the GUI.

I decided to skip step 1, but then step 4 fails because it tries to verify NSX-T version on a host which doesn't exist anymore.

This is only a lab exercise, so haven't contacted GSS yet as I have a limited number support cases on my contract 🙂

I have given feedback to docs.vmware.com so hopefully they come back to me.



Nils Kristiansen
https://cybernils.net/
0 Kudos
shank89
Expert
Expert

Have you got a host to replace the failed host with.

You can also use the SoS utility to perform stretched cluster operations.

I imagine you used this tool to stretch the cluster originally?

https://docs.vmware.com/en/VMware-Cloud-Foundation/3.9/com.vmware.vcf.admin.doc_39/GUID-92FD3AEE-5B5...

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
CyberNils
Hot Shot
Hot Shot

I will look into the SOS utility. Thanks for the tip.

I used the API Explorer to stretch the cluster:

https://docs.vmware.com/en/VMware-Cloud-Foundation/4.1/com.vmware.vcf.admin.doc_41/GUID-CDEEF4C6-7DF...

In 3.x I used to do it with the SOS utility like you described, but I have never had to replace a failed host before.



Nils Kristiansen
https://cybernils.net/
0 Kudos
future2000
Enthusiast
Enthusiast

Hi,

 

Thanks for your reply and the great work you have done putting together the troubleshooting guide.

 

Following a complete decommission of the hosts I could still see the vCF Database had my hosts in the used IP addresses section in the vMotion and vSAN networks. I therefore removed these manually from the database and tried recommissioning the hosts. This was successful but unfortunately the problem adding these hosts to an existing cluster still remained present.

 

The failure occurs within a few seconds with the same task 'Adding new host(s) to cluster' showing Failed.

 

Subtasks of Tak Adding new hosts(s) to cluster

 

Validate that installed NSX-T version is supported

Error: Message: Unable to check NSX-T Version

Remediation Message:

Reference Token: F29CAT

Cause:

    Type: java.lang.RuntimeException

     Message: Unable to fetch NSX-T version.

 

Interestingly following a revisit to the PSQL Database when running following command

 

select * from vcf_network where type=’VMOTION’;

 

The output was formatted completely differently from when I run the commands initially! I was simply copying and pasting from notepad++ into putty when running the commands. Either way functionally nothing has changed. Just a bit wierd..

 

 

0 Kudos
shank89
Expert
Expert

Happy you were able to use it :).

Is this a stretched or non-stretched cluster?

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
future2000
Enthusiast
Enthusiast

This is just a normal cluster in our NSX-T workload domain. Non-stretched vSAN.

Tags (1)
0 Kudos
shank89
Expert
Expert

And this is failing as part of the add host to cluster and not commissioning the host right?

If so, have you checked the domainmanager.log ? If it is part of the commission process, the operationsmanagere.log may give you some insight.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
future2000
Enthusiast
Enthusiast

Correct, the host commissioning is entirely successful. The addition of any hosts to the cluster fail instantly. Nothing useful is present in either log unfortunately which is what makes this issue so frustrating. I will have to raise this with GSS unfortunately.

Tags (1)
0 Kudos
shank89
Expert
Expert

Sounds like a plan, have you ensured the host you are trying to add in is 'clean' ?

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
future2000
Enthusiast
Enthusiast

They have been built from scratch manually via the oob from an ISO. Same as all the other hosts in the environment.

Tags (1)
0 Kudos
future2000
Enthusiast
Enthusiast

From the domainmanager.log, which I clearly hadn't been looking at carefully enough...

 

javax.net.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed. sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certificatin path to requested target

 

So as usual this appears to be a SSL Certificate related issue. I can confirm both NSX-T and SDDC Manager are still using the very same self-signed SSL certs since deployment. Both are valid and have never been changed which makes this all the wierder..

 

https://kb.vmware.com/s/article/67030

 

The above is interesting as I did upgrade vCF from 3.8.1 to 3.10.1 relatively recently.

0 Kudos
future2000
Enthusiast
Enthusiast

Unfortunately for any one out there who may hit this issue the script doesn't resolve it with vCF 3.10.1. Just bugs out with missing parentheses and then a missing module if you correct the syntax issues. Case to be raised.

0 Kudos