VMware Cloud Community
SMcT
Enthusiast
Enthusiast

VCF 4.0 NSX-T Deployment stuck in endless loop

Hi All

Currently attempting to install VCF 4.0 into my lab.

Im stuck at an endless loop of deploying the management cluster nodes then the builder tearing them down again...

The vcd-bringup-debug.log shows entries of 'Waiting for NSX-T manager to become operational' once the vm's are online. Eventually it times out and gives the error 'NSXT_MANAGER_NON_OPERATIONAL NSX-T Manager operation status is false on 10.xx.xx.131'. It then proceeds to delete the VM's.

Its always the same IP address (10.xx.xx.131) of node A it mentions.

For the period of time the VM's are online, I can successfully log in and ping the other nodes, vCenter and the Cloud Builder appliance so comms look to be ok.

The logs aren't giving me anything else I am finding useful.

The only other thing that could be relevant is the deployment spreadsheet shows the NSX node A IP as valid, node B and C flag as invalid (red). I haven't been able to resolve this and assumed it might be a conditional formatting error, but it passes all the validation checks when I load it into the Cloud Builder appliance.

Anyone else had a similar experience, or can suggest anything?

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Tags (1)
0 Kudos
29 Replies
toffaha1
Enthusiast
Enthusiast

Hi,

I encountered the same issue and it seems storage latency as it worked fine after changing the Datastore RAID level into my nested lab.

BR,

Muhammad Toffaha

Technical Consultant

@vtoffaha

Best Regards,
Muhammad Toffaha
Technical Consultant
0 Kudos
sv1984
Contributor
Contributor

Hi Muhammad,

Can you please elaborate on the exact change that you made? I am using VCF 4.3.1 and also using a nested lab

Thanks!

0 Kudos
shank89
Expert
Expert

Can you describe your lab and the hardware you are using.

Likely meant the underlaying datastore raid level to 0.  Storage latency has a huge impact in nested labs, closely followed up memory and cpu resources.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
0 Kudos
sv1984
Contributor
Contributor

Thanks for your reply Shashank!

Here is the host hardware specs.

sv1984_0-1632848635651.png

Running 4 nested ESXi nodes, Cloudbuilder, VyOS router and Jumphost running as VMs

sv1984_2-1632848822912.png

sv1984_1-1632848719639.png

 

 

 

 

0 Kudos
sv1984
Contributor
Contributor

I also see the following error in the bringup logs which seems to be the root cause of the failure. "Error occurred while getting certificate chain for 'nsx01b.tmelab.local"  Any idea on how to resolve this?

 

sv1984_0-1632849110834.png

 

0 Kudos
tenthirtyam
VMware Employee
VMware Employee

Does the password used for the NSX Manager admin pass a cracklib-check?

0 Kudos
sv1984
Contributor
Contributor

Hi Ryan,

 

Yes it does pass the check. 

0 Kudos
VasanthanB
Contributor
Contributor

Is there any solution of reported issues?

0 Kudos
AbbedSedkaoui
Enthusiast
Enthusiast

There is a timeout at 1 200 000 ms = 20min seen in /opt/vmware/bringup/logs/vcf-bringup-debug.log that wait for all the services in the NSX cluster are UP and stable.

  1. Each ESXi should be at least 46GB or later the installation of NSX bits will fail.
  2.  What we experience in lab is that we use small form factor for NSX 4vCPUs and 16GB of RAM but official minimum is medium (6 and 24 respectively) from API standpoint.
    • We can get around that when we deploy as small to stop the NSX VM(s) and give it/them at minimum with 6vCPU = 20000Mhz and 20GB.
    • extend the timeout that wait  NSX to be UP, depending on your storage... i gave it 180min, just in case NSX installation bits fail
    • extend the timeout of overall ova deployment which default to 40min... i gave it 180

````sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties````
````sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties````
systemctl restart vcf-bringup````

I know 180min is probably overkill since the cluster might need 25min, but that allow troubleshooting if things doesn't go as we expect.

0 Kudos
HuyMai_SVT
Contributor
Contributor

Hi guys,

I deploy vcf on my lab and facing the same issue . But when I config the minutes deploy.timeout from 40 to 180 minutes , it's working normally .

#sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
#sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
#systemctl restart vcf-bringup

 

0 Kudos