VMware Cloud Community
SMcT
Enthusiast
Enthusiast

VCF 4.0 NSX-T Deployment stuck in endless loop

Hi All

Currently attempting to install VCF 4.0 into my lab.

Im stuck at an endless loop of deploying the management cluster nodes then the builder tearing them down again...

The vcd-bringup-debug.log shows entries of 'Waiting for NSX-T manager to become operational' once the vm's are online. Eventually it times out and gives the error 'NSXT_MANAGER_NON_OPERATIONAL NSX-T Manager operation status is false on 10.xx.xx.131'. It then proceeds to delete the VM's.

Its always the same IP address (10.xx.xx.131) of node A it mentions.

For the period of time the VM's are online, I can successfully log in and ping the other nodes, vCenter and the Cloud Builder appliance so comms look to be ok.

The logs aren't giving me anything else I am finding useful.

The only other thing that could be relevant is the deployment spreadsheet shows the NSX node A IP as valid, node B and C flag as invalid (red). I haven't been able to resolve this and assumed it might be a conditional formatting error, but it passes all the validation checks when I load it into the Cloud Builder appliance.

Anyone else had a similar experience, or can suggest anything?

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Tags (1)
Reply
0 Kudos
29 Replies
paramoyoo
Enthusiast
Enthusiast

What about DHCP? you need it for VTEPs.

Reply
0 Kudos
vRon
Contributor
Contributor

same here:

* I can ssh into each NSX-T Manager-VM

* each NSX-T-Manager-VM can ping everybody in the Management-VLAN

* each NSX-T-Manager-VM establishes "https"-connections to the cloud-builder vm

//at my DNS i see, the NSX-T-Manager-VMs try to find the host "A? ${bundle:nsx-common-syslog:syslog.server.host}. (64)"

//looks like a variable which hasn't been set during the "Generate NSX-T Data Center Input Data"-phase

//my (unbound-)DNS-Server doesn't like to reply to this query with the syslog-server-ip - which would be a workaround...

//=> can't beleive, that this might be a showstopper

after 15 Minutes uptime, each NSX-T-Manager-VM get's shut down, and rebuild and the loop closes

some unsuccessfull loops later Cloud-Builder stops

pastedImage_0.png

So - why is the NSX-T-Manager operation status false?

//172.16.11.66 is the first NSX-T-Manager-VM, ..11.67 the second, ..11.68 the third ..11.65 would be the VIP

Reply
0 Kudos
vRon
Contributor
Contributor

the NSX-T Manager-VMs use static IPs, they have no network-interface connected to the VTEP-Portgroup.

...i'm wondering, why a possibly failed dhcp-service in the VTEP-Vlan could affect the bringup-process of the NSX-T Manager-VMs...

Reply
0 Kudos
SMcT
Enthusiast
Enthusiast

Using the parameters from the spreadsheet, I can successfully deploy NSX-T manually with all services starting.  I wanted to verify I wasn't using an incompatible value for something.  There must be something like a variable not passing through like you say... 

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Reply
0 Kudos
vRon
Contributor
Contributor

i've been using the "vcf-ems-deployment-parameter.xlsx" as proposed by the cloud-builder-vm as base for the deployment-process, there is no field to set the "syslog-service".

the vcf-docs(VMware Cloud Foundation Documentation ) provide another "Planning and Preparation Workbook" (https://docs.vmware.com/en/VMware-Validated-Design/6.0/vmware-validated-design-60-vmware-cloud-found... ) which contains syslog-settings.

btw: vcf4.0.1 has been released https://my.vmware.com/group/vmware/downloads/details?downloadGroup=VCF401&productId=1015&rPId=48125

Reply
0 Kudos
vRon
Contributor
Contributor

VCF4.0.1

* new Excel-Spreadsheed (pro: better structure / con: same content has to be migrated (copy&paste) from the old 4.0.0-Excel-Spreadsheed to the new 4.0.1)

* new Software //i've been using the original ESXi 7 Release

* identical result, (almost) enless loop at NSX-Manager-Setup, VMs are up and running but not getting operational.

so, a more systematic approach is needed:

* is the variable "bundle:nsx-common-syslog:syslog.server.host" which seems to be not set by the cloud-builder - necessary?

  => i don't expect this to be the case, otherwise nobody would have success using the cloud-builder/Spreadsheed-Combo

Since some people have success (or their deployment fails later on in the process, for example when setting the edge-size to "very small" which was possible in release 4.0.0 but is listed as known bug) - there must be a difference about the values inside the excel spreadsheet.

I let most values at default, changing just some host-names - but dns ist working and all DNS-Pre-Checks are successful).

Action Plan:

* redeploy the lab

* use ESX7.0b-Patch

* catch the 15min timeframe between NSX-Manager-VM spin-up and shut-down

* log-in again via SSH into the VMs

* analyze the vm-operation (logs, tcpdump)

Any ideas, which log-files could be most interesting?

Reply
0 Kudos
SMcT
Enthusiast
Enthusiast

Let me know if you get any further (I'll likewise let you know if I make progress).  It would be good to get to the bottom of this!  I am also using the VMUG version of the product, not sure if there is a difference to the one available via the VMware download page directly.

I also noticed the bug with the extra small sizing.  I have deployed vCenter and NSX in all sizes up to Medium to check whether there was a compatibility issue similar to the extra small.  Had the same error every time.

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Reply
0 Kudos
paramoyoo
Enthusiast
Enthusiast

VCF uses DHCP to configure each vmkernel port of an ESXi host used as a VTEP. Each host requires two IP addresses, one for each VTEP configured.

Reply
0 Kudos
SMcT
Enthusiast
Enthusiast

Can you elaborate on this?  The problem is the NSX manager service isn't starting.  The static IP's assigned to each node are coming online and each node can communicate with the Cloud Builder appliance ad each other. 

When deploying NSX-T manually using the same config, the manager service is starting.  I'm not using any DHCP during this.

How does DHCP come into it at this point?

Thanks in advance.

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Reply
0 Kudos
marthneilson
Contributor
Contributor

You can correctly activate NSKS-T manually by running all services with spreadsheet settings. I wanted to make sure I didn't overdo the value for everyone. There must be something that doesn't change, as you say

Reply
0 Kudos
charliek19
Contributor
Contributor

This fixed my issue, seem the spreadsheet isn't validating to the same schema as NSX-T.

VCF 4.0 - stuck at "Deploy and Configure NSX-T Data Center"

Reply
0 Kudos
vRon
Contributor
Contributor

I can log in into all controllers using ssh, so the password seems to be complex enough, doesn't it?

Reply
0 Kudos
henryt_uk
Contributor
Contributor

Hi SMcT and vRon. I am now observing the same NSX Manager loop during a vCF 4.0.1 bringup. Did you ever find a resolution to your issue?

Mine is not password related either (4.0.1 input spreadsheet has prevalidation for this).

Cheers.

Reply
0 Kudos
SMcT
Enthusiast
Enthusiast

Hi henryt

I haven't found a solution to this as of yet.  Perhaps vRon has.

Out of interest, are you using nested hosts?

Blog: stephanmctighe.com Twitter: @vStephanMcTighe
Reply
0 Kudos
henryt_uk
Contributor
Contributor

Hi - no, regular physical hosts. I found my issue that same evening - looking at the debug logs I could see the builder trying to contact the NSX manager API on the 1st node by name. Upon checking, it was a simple case of DNS records not being registered for the NSX node. Doh! After creating these, the deployment completed fine.

Reply
0 Kudos
joejay
Contributor
Contributor

Hi,

Did you ever find a solution to your problem?  I am having exactly the same problem, I've spent hours trying to troubleshoot and can't seem to find a solution.  Just curious to see if you were able to get around this.  I get the same loop.  Here's what I've seen it do:

1. Bring up process successfully deploys the NSX-T Cluster Members

2. Powers on the VMs

3. Once the VMs are powered on, I see it tries to do a vmotion of vCenter for some reason

4. Get a failure that it can't migrate vCenter "in its current state"

5. NSX-T VMs start powering off

6. NSX-T VMs get deleted

7. NSXT VMs get re-deployed

8. Whole process starts again.

This goes for like 3 or 4 times before Cloudbuilder shows the failure, which is the same failure message you're getting.

Thanks,

Jay

Reply
0 Kudos
shank89
Expert
Expert

In a lab environment I have also seen this generally attributed to latency issues where the manager does not come up fully prior to it being ripped away and retried, until an eventual fail in the workflow.

This link may be useful to you as it talks about some tweaks you can perform in your lab environment to hopefully get around these issues.

VCF 4 Workload Domain  

Mind you this article is for the workload domain, you can still make tweaks for the management domain, one hack is to also shutdown the cloud builder just after the manager nodes get deployed or pause the vm.  Then after a while once the services are good to go, restart cloud builder.

Shashank Mohan

VCIX-NV 2022 | VCP-DCV2019 | CCNP Specialist

https://lab2prod.com.au
LinkedIn https://www.linkedin.com/in/shankmohan/
Twitter @ShankMohan
Author of NSX-T Logical Routing: https://link.springer.com/book/10.1007/978-1-4842-7458-3
raymondred506
Contributor
Contributor

VCF utilizes DHCP to design each vmkernel port of an ESXi have utilized as a VTEP. Therefore Each host requires two IP addresses, one for each VTEP designed.

Reply
0 Kudos