Solved: ESXi host cannot communicate with NSX controllers

Floki00 · ‎01-15-2018

Hi Guys,

We have 2 clusters in the same transport zone, each cluster with a dedicated vDistributed switch, hosts on one cluster can ping test using vxlan traffic, hosts in the other cluster cannot ping themselves or the hosts in the first cluster.

running net-vdl2 -l shows there are no controllers connects to the faulty hosts, running show control-cluster connection-table for vni in the environment only shows the vteps for the working cluster. It seems the controllers don't know about the second cluster. I have restarted netcpad on the faulty hosts to no avail.

Has anyone faced this before, any ideas how to resolve this!? Quite weird error. How can I force update/sync the host information to the controllers?

Thanks in advance.

Floki00 · ‎01-31-2018

Hi Guys,

This problem is now solved, seems the VXLAN did not like VLAN "0", untagged traffic, we tagged the VXLAN port group with VLAN ID and problem was solved. This used to work with VLAN "0", but not anymore.

Thanks,

Ola

View solution in original post

cnrz · ‎01-16-2018

What is the output of /etc/init.d/netcpad status and

cat /etc/vmware/netcpa/config-by-vsm.xml commands?

Which NSX version is used?

Dns forward and reverse could be resolved for hosts and NTP time is synchronized?

Is the host preparation successful, same VIB modules shown as installed and working for both Clusters?

Is it possible to ping from Management vmkernel interfaces to the Controller Ip addresses, could there be a firewall blocking port 1234 if seperate from Cluster1 hosts?

Are there any errors on /var/log/netcpa.log on Cluster2 hosts?

Messaging Bus between NSX Manager and Hosts working?

Floki00 · ‎01-16-2018

Hello Canero,

Thanks for response, very much appreciated, please find my answers below.

Netcpad status is running, config-by-vsm.xml shows all the controllers in there, NSX version 6.3.1, sms forward and reverse available for both ipv4 and ipv6, ntp is synchronised, host preparation takes a while to complete but comes up green and successful, all 3 vibs are installed and loaded , ping works from management vmkernel interface to the controllers, no firewall blocking port 1234, I can see established connections to the controllers, rabbitmq looks to be up and running on manager and hosts.

I have noticed my controllers don' t know about the other failing hosts but the hosts know about the controllers! Weird state.

Thanks,

Ola

cnrz · ‎01-17-2018

Vtep table for cluster 2 hosts may be empty until there are VMs on the Logical switch that spans both VDs through the common transport zone belonging to two clusters. Is it possible to create a logical switch with this transport zone, and Vmotion a VM to A cluster 2 host? Does the controller tables change informing the Mac and Vtep of this VM on the new Cluster 2 host?

Floki00 · ‎01-17-2018

There are External Service Gateways deployed on the logical switches spanning both cluster vDS switches.

cnrz · ‎01-17-2018

Are the ESG gateway VM on ESX hosts of Cluster1 or 2? If all of them are in Cluster1, is it possible to Vmotion one ESG VM to any ESX host on Cluster2? If there are ESG VM on cluster2 and still the Controller Cluster VTEP table is empty for this VNI Logical switch, then it is possible that controolers view and ESX view are not the same,

For this VNI Logical switch on the NSX Manager CLI, Is it possible to compare the difference between a ESX Host on Cluster1 and another host on Cluster2 this command:

http://cloudmaniac.net/nsx-central-cli-operations-troubleshooting/

sx01-cap-z51.sddc.lab> show logical-switch host host-15 vni 10000 verbose

VXLAN Global States:

Control plane Out-Of-Sync: No --> Control plane Out-of-Sync shoud be No

UDP port: 8472

VXLAN network: 10000

Multicast IP: N/A (headend replication)

Control plane: Enabled (multicast proxy,ARP proxy)

Controller: 10.51.10.72 (up) --> The Controller should be up state

MAC entry count: 0

ARP entry count: 0

Port count: 1

VXLAN port: vdrPort

Switch port ID: 50331655

vmknic ID: 0

For every Logical Switch, one of the 3 has the master role for VNI, so other 2 controllers may not show the table. Is the table checked on the master controller?

What does the Communication Channel Health shows between host and Controllers?

Installation -> Host Preparation-> Selecting Cluster2 -> Actions selecting Communication Channel Health normally shows status as Up with Green arrow for Control Plane Agent to Controller column.

http://www.virtualizationblog.com/vmware-nsx-6-2-communication-channel-health/

Host and NSX Controller: Heartbeats are sent every 30 seconds, if 3 iterations are lost a sync will occur

Are there any messages on the NSX Manager logs or system events?

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.troubleshooting.doc/GUID-6F1C02...

If the status of any of the three connections for a host changes, a message is written to the NSX Manager log. In the log message, the status of a connection can be UP, DOWN, or NOT_AVAILABLE (displayed as Unknown in vSphere Web Client). If the status changes from UP to DOWN or NOT_AVAILABLE, a warning message is generated. For example:

2016-05-23 23:36:34.736 GMT+00:00  WARN TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1941, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: DOWN.

If the status changes from DOWN or NOT_AVAILABLE to UP, an INFO message that is similar to the warning message is generated. For example:

2016-05-23 23:55:12.736 GMT+00:00  INFO TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1938, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: UP.

If the control plane channel experiences a communication fault, a system event with one of the following granular failure reason is generated:

1255601: Incomplete Host Certificate
1255602: Incomplete Controller Certificate
1255603: SSL Handshake Failure
1255604: Connection Refused
1255605: Keep-alive Timeout
1255606: SSL Exception
1255607: Bad Message
1255620: Unknown Error

Also, heartbeat messages are generated from NSX Manager to hosts. A configuration full sync is triggered, if heartbeat between the NSX Manager and netcpa is lost.

Techstarts · ‎01-17-2018

Where you able to successfully deploy the controllers?

You mention you are using same transport zone (VXLAN) for both the cluster which are in different vDS.

So you have used same VLAN ID right?

any reason you are using different VDS but same Transport Zone?

With Great Regards,

Floki00 · ‎01-18-2018

Hello Canero,

Thanks for the questions, much appreciated. The VM's were on one cluster but the hosts have access to both vDS switches, moved a VM onto the other cluster and still not working.

Running the show logical-switch host host-15 vni 10000 verbose on hosts from both clusters shows the difference, controllers are up on the working cluster but not on the bad one.

We got fed-up with the behaviour and installed NSX Manager 6.4.0, but experiencing a different problem now, the same cluster will not take VXLAN configuration, states the one host already has VXLAN installed, I have performed a rest of the network and reconfigured the host networking, deleted the vDS for the cluster and recreated it. Still getting the same problem.....this makes me think, this may have been the host causing the issue for that cluster all along. I am rebuilding the hsot completely and will test if the problem persists and update the discussion.

Thanks.

Floki00 · ‎01-18-2018

Hi Guys,

As I said, since installing NSX Manager version 6.4.0, we now have a new problem manifesting on just one host in the cluster that was not working before. Whenever we try to configure VXLAN, the error is "Can not use switch "vDS-Name" to configure VXLAN. Switch "vDS-Name" in host "Host3" is already configured for VXLAN".

I have wiped the network config on this host, rebuilt the host (re-installed ESXi 6.5.0U1) , recreated the vDS switch but the problem persists. I am perplexed!!

cnrz · ‎01-18-2018

Is this a new installation of NSX 6.4, or upgradied previous NSX Manager6.3.1?

What does the Installation > Host Preparation shows?

If the removal of vDS and recreating is done before taking the host to maintenance mode > Removing the host from prepared cluster > Rebooting the host, then there could be some configurations left from the previous installation (as the error seems to indicate)

https://kb.vmware.com/s/article/2137959

If it shows VXLAN Error Unconfigured or VXLAN Transport as unconfigured these steps may be helpful(details below)

1. Unpreparation of Host (Removal of Host from NSX) and removing the host from dVS

2. Removing the Cluster from the NSX

3. Preparing the Cluster for NSX

4. Addition of Host to the dVS

5. Addition of Host to to the Prepared Cluster

6. Configuring VXLAN

1. Upreparing the the host If the problem is host related this would

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.install.doc/GUID-C388C016-95FF-...

This would uninstall the previously installed VIBs and VTEPS. If VTEPs are not removed, manual removal of VTEPs may be necessary:

https://kb.vmware.com/s/article/2137959

An ESXi host that had been previously prepared for NSX was removed from the NSX cluster, and then added back to an NSX-prepared cluster (or different cluster) triggering a host preparation activity. When the ESXi host was removed initially, it should have triggered an un-prepare action. However, the un-prepare did not complete successfully, the host was not rebooted to complete the un-preparation, and /or the VXLAN Tunnel End Point (VTEP vmkernel interface removal failed, which causes the host preparation task to fail when trying to re-create the VTEP as it already exist.

Resolution

To resolve this issue, after removing the ESXi host from an NSX-prepared cluster, ensure that the VTEP vmkernel interface is removed before attempting to add it back into an NSX-prepared cluster.

To manually remove the VTEP vmkernel interface on the ESXi host:

Put the ESXi host into maintenance mode
Remove the ESXi host from the NSX prepared cluster.
Reboot the host
If the VTEP vmkernel interface is still present on the ESXi host after the reboot, manually delete it from the host networking VMkernel interface configuration.
Re-add the ESXi host to the NSX prepared cluster and allow it to be prepar

2. Removing the cluster as following procedure may be helpful:

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.install.doc/GUID-90BA85A9-1E3C-...

NSX - Remove clusters from Transport Zones - vCrooky

3. Preparing the host for the Cluster

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-07ED3DD6-BF82-...

5. Addition of host to the prepared Cluster

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-7411A51D-3FA3-...

6. Configuring VXLAN

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-7411A51D-3FA3-...

Floki00 · ‎01-18-2018

Yes correct, this is a new deployment of 6.4.0. I am able to perform the agent installAtion with no problems, but when it gets to configuring the vxlan in cluster 2 strange the begin to happen. It complAins that the first host already has a vxlan configuration. I have uninstalled agents via NSX manager, putting the host in maintenance mode, removed the host from the cluster , removed host from vds switch, restored Network settings, rebuilt the host but it still causes the same error message.

One thing though about our set up, we have 2 clusters Management/ESG cluster and Production cluster, all ESG and DLR reside on the first cluster with its own vds switch and all production vm's reside in Production, since we always add both clusters to the same transport zone, using the same subnet for vtep we can have a logical switch down both clusters so this way vm traffic from Production can communicate with DLR/ESG in Management/ESG. Hence we have hosts in both clusters connected to both vds switches, cluster 2 has all management uplinks on the management vds, this has never been a problem.

I will be trying your recommendations once in the office in about 45 minutes, and will provide feedback as soon as.

Floki00 · ‎01-18-2018

Hi Techstarts,

Apologies for the Ltd response, my answers below.

Yes for version 6.4.0 we have been able to deploy controllers, same vlan id "0" .

We have a separate ESG cluster for the North-South ingress/egress and East-West routing, hence having both clusters in the same transport zone, so the logical overlay stretches both clusters.

Floki00 · ‎01-19-2018

Hi All,

I can confirm the Agent install and the VXLAN configuration has been performed on the hosts, there was no lingering VXLAN/VTEPs on the hosts after the clean up activity this was done as required, going via NSX Manager - Uninstall - Resolve - Maintenance Mode - Agent removal (I checked as well esxcli software vib list | grep esx-nsxv as this is version 6.4.0).

I could not get both clusters prepared initially, as it was complaining the second host already had VXLAN configuration, this was due to the fact all hosts are connected to both vDS switches thus when one vDS is prepared/Installed VXLAN it didn't want to do this again, got round this by removing the hosts from one switch after VXLAN configuration, then performing the other vDS switch VXLAN config, then adding the hosts back into the vDS as before.

As before I still cannot have hosts in one cluster communicating over VXLAN. the first cluster (Management/ESG with 2 hosts can communicate with themselves) does communicate, but the second cluster will NOT talk to the hosts on the same switch (Production with 3 hosts, cannot communicate with themselves or the 2 hosts from Management/ESG).

I don't know where to go with this as the underlying Physical network connectivity are all trunked, same configuration in Management/ESG and Production, a normal VMKPing to the vmk for VXLAN does not work.

A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip) [Works OK] also works with MTU of 1570 as well. (Host from second cluster pinging itself).

A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip of another host) [Does not work] with MTU of 1570 does not work as well. (Host from second cluster pinging other hosts).

MTU on the switch is 9000, nothing has changed this used to work with NSX 6.3.1, but since upgrading to ESXi 6.5.0U1........the environment has been HELL!!

Floki00 · ‎01-19-2018

Just something I noticed, when running a ping test between hosts, I did a Packet capture, exported in .pcap format and it shows one of the faulty hosts pinging 224.0.0.1.

I know this to be a Multicast IP reserve, My Transport zone is Unicast, so not sure why my connectivity test using a Logical Switch in a Unicast Transport zone is trying to access a Multicast IP 224.0.0.1.

I hope I'm missing something here and this is the approved behaviour!

cnrz · ‎01-19-2018

About the Subnets and Vlan-Id used on the Edge/Mmgmt and Production Clusters, for Edge Cluster vDS_Edge is used for Vxlan with Vlan 0, and for Production Cluster vDS_Production is used for Vxlan with again the same Vlan 0. (as the previous message), is the undertanding correct?

Does both Cluster use the same pool, i.e. the IP addresses of VTEPs on both Clusters are on the same Subnet? If common Vlan 0 is used for this Subnet, this means untagged or native vlan for the Physical switch, which indicates an access port, or trunk with native vlan. Is the physical switch port configurations the same for both Edge Cluster ESX hosts and Production ESX hosts for uplinks of the both dVS switches? Also is it possible to verify the vmk3 Mac addresses are learned by physical switch ports mac table on correct vlan for both Clusters? (or are they learned at all), can they ping their default gateways?

Also does the vmkping between hosts work for small packets such as 100 bytes or 200bytes? If this small packet works for Cluster2, then MTU of the second cluster vDS could be checked although normally during the Vxlan preparation this MTU could automatically be set to 1600 bytes as below:

From vCenter Server click, click Home > Inventory > Networking.
Right-click the vDS and click Edit Settings.
On the Properties tab, select the Advanced option.

https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.cross-vcenter-install.doc/GUID-...

The MTU for each switch must be set to 1550 or higher. By default, it is set to 1600. If the vSphere distributed

switch MTU size is larger than the VXLAN MTU, the vSphere Distributed Switch MTU will not be adjusted

down. If it is set to a lower value, it will be adjusted to match the VXLAN MTU. For example, if the vSphere

Distributed Switch MTU is set to 2000 and you accept the default VXLAN MTU of 1600, no changes to the

vSphere Distributed Switch MTU will be made. If the vSphere Distributed Switch MTU is 1500 and the

VXLAN MTU is 1600, the vSphere Distributed Switch MTU will be changed to 1600.

VTEPs have an associated VLAN ID. You can, however, specify VLAN ID = 0 for VTEPs, meaning frames

will be untagged.

Vlan 0 sample configuration for Vxlan may be as below:

http://wahlnetwork.com/2014/07/07/working-nsx-configuring-vxlan-vteps/

Floki00 · ‎01-30-2018

Hi Guys,

Thanks for helping me with this, very much appreciated. We opened a call with GSS and they were equally stumped as to what the problem was, they however said it was a data plane issue, we have now decided to go back to a version/revision we know works so going back to NSX 6.3.1 and ESXi 6.5.0d. I will keep you updated how that goes

Thank you all for your support it has been hugely appreciated.

Thank you!!

tanurkov · ‎01-31-2018

HI

what kind of teaming policy do you use on the hosts for prepared cluster?

Regards Dmitri

Floki00 · ‎01-31-2018

Using Failover_Order for NIC Teaming.

tanurkov · ‎01-31-2018

OK .

can you please provide out on the esxi itself

esxcli network ip connection list | grep 1234

Floki00 · ‎01-31-2018

Hi Guys,

This problem is now solved, seems the VXLAN did not like VLAN "0", untagged traffic, we tagged the VXLAN port group with VLAN ID and problem was solved. This used to work with VLAN "0", but not anymore.

Thanks,

Ola