Currently attempting to install VCF 4.0 into my lab.
Im stuck at an endless loop of deploying the management cluster nodes then the builder tearing them down again...
The vcd-bringup-debug.log shows entries of 'Waiting for NSX-T manager to become operational' once the vm's are online. Eventually it times out and gives the error 'NSXT_MANAGER_NON_OPERATIONAL NSX-T Manager operation status is false on 10.xx.xx.131'. It then proceeds to delete the VM's.
Its always the same IP address (10.xx.xx.131) of node A it mentions.
For the period of time the VM's are online, I can successfully log in and ping the other nodes, vCenter and the Cloud Builder appliance so comms look to be ok.
The logs aren't giving me anything else I am finding useful.
The only other thing that could be relevant is the deployment spreadsheet shows the NSX node A IP as valid, node B and C flag as invalid (red). I haven't been able to resolve this and assumed it might be a conditional formatting error, but it passes all the validation checks when I load it into the Cloud Builder appliance.
Anyone else had a similar experience, or can suggest anything?
I encountered the same issue and it seems storage latency as it worked fine after changing the Datastore RAID level into my nested lab.
Can you please elaborate on the exact change that you made? I am using VCF 4.3.1 and also using a nested lab
Can you describe your lab and the hardware you are using.
Likely meant the underlaying datastore raid level to 0. Storage latency has a huge impact in nested labs, closely followed up memory and cpu resources.
Thanks for your reply Shashank!
Here is the host hardware specs.
Running 4 nested ESXi nodes, Cloudbuilder, VyOS router and Jumphost running as VMs
I also see the following error in the bringup logs which seems to be the root cause of the failure. "Error occurred while getting certificate chain for 'nsx01b.tmelab.local" Any idea on how to resolve this?
Does the password used for the NSX Manager admin pass a cracklib-check?
Yes it does pass the check.
Is there any solution of reported issues?
There is a timeout at 1 200 000 ms = 20min seen in /opt/vmware/bringup/logs/vcf-bringup-debug.log that wait for all the services in the NSX cluster are UP and stable.
````sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties````
````sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties````
systemctl restart vcf-bringup````
I know 180min is probably overkill since the cluster might need 25min, but that allow troubleshooting if things doesn't go as we expect.
I deploy vcf on my lab and facing the same issue . But when I config the minutes deploy.timeout from 40 to 180 minutes , it's working normally .
#sed -i 's/ovf.deployment.timeout.period.in.minutes=40/ovf.deployment.timeout.period.in.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
#sed -i -e's/nsxt.disable.certificate.validation=true/nsxt.disable.certificate.validation=true\nnsxt.manager.wait.minutes=180/' /opt/vmware/bringup/webapps/bringup-app/conf/application.properties
#systemctl restart vcf-bringup