What is the output of /etc/init.d/netcpad status and
cat /etc/vmware/netcpa/config-by-vsm.xml commands?
Which NSX version is used?
Dns forward and reverse could be resolved for hosts and NTP time is synchronized?
Is the host preparation successful, same VIB modules shown as installed and working for both Clusters?
Is it possible to ping from Management vmkernel interfaces to the Controller Ip addresses, could there be a firewall blocking port 1234 if seperate from Cluster1 hosts?
Are there any errors on /var/log/netcpa.log on Cluster2 hosts?
Messaging Bus between NSX Manager and Hosts working?
Thanks for response, very much appreciated, please find my answers below.
Netcpad status is running, config-by-vsm.xml shows all the controllers in there, NSX version 6.3.1, sms forward and reverse available for both ipv4 and ipv6, ntp is synchronised, host preparation takes a while to complete but comes up green and successful, all 3 vibs are installed and loaded , ping works from management vmkernel interface to the controllers, no firewall blocking port 1234, I can see established connections to the controllers, rabbitmq looks to be up and running on manager and hosts.
I have noticed my controllers don' t know about the other failing hosts but the hosts know about the controllers! Weird state.
Vtep table for cluster 2 hosts may be empty until there are VMs on the Logical switch that spans both VDs through the common transport zone belonging to two clusters. Is it possible to create a logical switch with this transport zone, and Vmotion a VM to A cluster 2 host? Does the controller tables change informing the Mac and Vtep of this VM on the new Cluster 2 host?
There are External Service Gateways deployed on the logical switches spanning both cluster vDS switches.
Are the ESG gateway VM on ESX hosts of Cluster1 or 2? If all of them are in Cluster1, is it possible to Vmotion one ESG VM to any ESX host on Cluster2? If there are ESG VM on cluster2 and still the Controller Cluster VTEP table is empty for this VNI Logical switch, then it is possible that controolers view and ESX view are not the same,
For this VNI Logical switch on the NSX Manager CLI, Is it possible to compare the difference between a ESX Host on Cluster1 and another host on Cluster2 this command:
sx01-cap-z51.sddc.lab> show logical-switch host host-15 vni 10000 verbose
VXLAN Global States:
Control plane Out-Of-Sync: No --> Control plane Out-of-Sync shoud be No
UDP port: 8472
VXLAN network: 10000
Multicast IP: N/A (headend replication)
Control plane: Enabled (multicast proxy,ARP proxy)
Controller: 10.51.10.72 (up) --> The Controller should be up state
MAC entry count: 0
ARP entry count: 0
Port count: 1
VXLAN port: vdrPort
Switch port ID: 50331655
vmknic ID: 0
For every Logical Switch, one of the 3 has the master role for VNI, so other 2 controllers may not show the table. Is the table checked on the master controller?
What does the Communication Channel Health shows between host and Controllers?
Installation -> Host Preparation-> Selecting Cluster2 -> Actions selecting Communication Channel Health normally shows status as Up with Green arrow for Control Plane Agent to Controller column.
Host and NSX Controller: Heartbeats are sent every 30 seconds, if 3 iterations are lost a sync will occur
Are there any messages on the NSX Manager logs or system events?
If the status of any of the three connections for a host changes, a message is written to the NSX Manager log. In the log message, the status of a connection can be UP, DOWN, or NOT_AVAILABLE (displayed as Unknown in vSphere Web Client). If the status changes from UP to DOWN or NOT_AVAILABLE, a warning message is generated. For example:
2016-05-23 23:36:34.736 GMT+00:00 WARN TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1941, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: DOWN.
If the status changes from DOWN or NOT_AVAILABLE to UP, an INFO message that is similar to the warning message is generated. For example:
2016-05-23 23:55:12.736 GMT+00:00 INFO TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1938, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: UP.
If the control plane channel experiences a communication fault, a system event with one of the following granular failure reason is generated:
- 1255601: Incomplete Host Certificate
- 1255602: Incomplete Controller Certificate
- 1255603: SSL Handshake Failure
- 1255604: Connection Refused
- 1255605: Keep-alive Timeout
- 1255606: SSL Exception
- 1255607: Bad Message
- 1255620: Unknown Error
Also, heartbeat messages are generated from NSX Manager to hosts. A configuration full sync is triggered, if heartbeat between the NSX Manager and netcpa is lost.
Where you able to successfully deploy the controllers?
You mention you are using same transport zone (VXLAN) for both the cluster which are in different vDS.
So you have used same VLAN ID right?
any reason you are using different VDS but same Transport Zone?With Great Regards,
vExpert 2012-2017 | VCP3-5 | VCAP5-DCD | VCP-NV | vSAN Specialist | VDI | Germany
Thanks for the questions, much appreciated. The VM's were on one cluster but the hosts have access to both vDS switches, moved a VM onto the other cluster and still not working.
Running the show logical-switch host host-15 vni 10000 verbose on hosts from both clusters shows the difference, controllers are up on the working cluster but not on the bad one.
We got fed-up with the behaviour and installed NSX Manager 6.4.0, but experiencing a different problem now, the same cluster will not take VXLAN configuration, states the one host already has VXLAN installed, I have performed a rest of the network and reconfigured the host networking, deleted the vDS for the cluster and recreated it. Still getting the same problem.....this makes me think, this may have been the host causing the issue for that cluster all along. I am rebuilding the hsot completely and will test if the problem persists and update the discussion.
As I said, since installing NSX Manager version 6.4.0, we now have a new problem manifesting on just one host in the cluster that was not working before. Whenever we try to configure VXLAN, the error is "Can not use switch "vDS-Name" to configure VXLAN. Switch "vDS-Name" in host "Host3" is already configured for VXLAN".
I have wiped the network config on this host, rebuilt the host (re-installed ESXi 6.5.0U1) , recreated the vDS switch but the problem persists. I am perplexed!!
Is this a new installation of NSX 6.4, or upgradied previous NSX Manager6.3.1?
What does the Installation > Host Preparation shows?
If the removal of vDS and recreating is done before taking the host to maintenance mode > Removing the host from prepared cluster > Rebooting the host, then there could be some configurations left from the previous installation (as the error seems to indicate)
If it shows VXLAN Error Unconfigured or VXLAN Transport as unconfigured these steps may be helpful(details below)
1. Unpreparation of Host (Removal of Host from NSX) and removing the host from dVS
2. Removing the Cluster from the NSX
3. Preparing the Cluster for NSX
4. Addition of Host to the dVS
5. Addition of Host to to the Prepared Cluster
6. Configuring VXLAN
1. Upreparing the the host If the problem is host related this would
This would uninstall the previously installed VIBs and VTEPS. If VTEPs are not removed, manual removal of VTEPs may be necessary:An ESXi host that had been previously prepared for NSX was removed from the NSX cluster, and then added back to an NSX-prepared cluster (or different cluster) triggering a host preparation activity. When the ESXi host was removed initially, it should have triggered an un-prepare action. However, the un-prepare did not complete successfully, the host was not rebooted to complete the un-preparation, and /or the VXLAN Tunnel End Point (VTEP vmkernel interface removal failed, which causes the host preparation task to fail when trying to re-create the VTEP as it already exist.
ResolutionTo resolve this issue, after removing the ESXi host from an NSX-prepared cluster, ensure that the VTEP vmkernel interface is removed before attempting to add it back into an NSX-prepared cluster.
- Put the ESXi host into maintenance mode
- Remove the ESXi host from the NSX prepared cluster.
- Reboot the host
- If the VTEP vmkernel interface is still present on the ESXi host after the reboot, manually delete it from the host networking VMkernel interface configuration.
- Re-add the ESXi host to the NSX prepared cluster and allow it to be prepar
2. Removing the cluster as following procedure may be helpful:
3. Preparing the host for the Cluster
5. Addition of host to the prepared Cluster
6. Configuring VXLAN
Yes correct, this is a new deployment of 6.4.0. I am able to perform the agent installAtion with no problems, but when it gets to configuring the vxlan in cluster 2 strange the begin to happen. It complAins that the first host already has a vxlan configuration. I have uninstalled agents via NSX manager, putting the host in maintenance mode, removed the host from the cluster , removed host from vds switch, restored Network settings, rebuilt the host but it still causes the same error message.
One thing though about our set up, we have 2 clusters Management/ESG cluster and Production cluster, all ESG and DLR reside on the first cluster with its own vds switch and all production vm's reside in Production, since we always add both clusters to the same transport zone, using the same subnet for vtep we can have a logical switch down both clusters so this way vm traffic from Production can communicate with DLR/ESG in Management/ESG. Hence we have hosts in both clusters connected to both vds switches, cluster 2 has all management uplinks on the management vds, this has never been a problem.
I will be trying your recommendations once in the office in about 45 minutes, and will provide feedback as soon as.
Apologies for the Ltd response, my answers below.
Yes for version 6.4.0 we have been able to deploy controllers, same vlan id "0" .
We have a separate ESG cluster for the North-South ingress/egress and East-West routing, hence having both clusters in the same transport zone, so the logical overlay stretches both clusters.
I can confirm the Agent install and the VXLAN configuration has been performed on the hosts, there was no lingering VXLAN/VTEPs on the hosts after the clean up activity this was done as required, going via NSX Manager - Uninstall - Resolve - Maintenance Mode - Agent removal (I checked as well esxcli software vib list | grep esx-nsxv as this is version 6.4.0).
I could not get both clusters prepared initially, as it was complaining the second host already had VXLAN configuration, this was due to the fact all hosts are connected to both vDS switches thus when one vDS is prepared/Installed VXLAN it didn't want to do this again, got round this by removing the hosts from one switch after VXLAN configuration, then performing the other vDS switch VXLAN config, then adding the hosts back into the vDS as before.
As before I still cannot have hosts in one cluster communicating over VXLAN. the first cluster (Management/ESG with 2 hosts can communicate with themselves) does communicate, but the second cluster will NOT talk to the hosts on the same switch (Production with 3 hosts, cannot communicate with themselves or the 2 hosts from Management/ESG).
I don't know where to go with this as the underlying Physical network connectivity are all trunked, same configuration in Management/ESG and Production, a normal VMKPing to the vmk for VXLAN does not work.
A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip) [Works OK] also works with MTU of 1570 as well. (Host from second cluster pinging itself).
A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip of another host) [Does not work] with MTU of 1570 does not work as well. (Host from second cluster pinging other hosts).
MTU on the switch is 9000, nothing has changed this used to work with NSX 6.3.1, but since upgrading to ESXi 6.5.0U1........the environment has been HELL!!
Just something I noticed, when running a ping test between hosts, I did a Packet capture, exported in .pcap format and it shows one of the faulty hosts pinging 126.96.36.199.
I know this to be a Multicast IP reserve, My Transport zone is Unicast, so not sure why my connectivity test using a Logical Switch in a Unicast Transport zone is trying to access a Multicast IP 188.8.131.52.
I hope I'm missing something here and this is the approved behaviour!
About the Subnets and Vlan-Id used on the Edge/Mmgmt and Production Clusters, for Edge Cluster vDS_Edge is used for Vxlan with Vlan 0, and for Production Cluster vDS_Production is used for Vxlan with again the same Vlan 0. (as the previous message), is the undertanding correct?
Does both Cluster use the same pool, i.e. the IP addresses of VTEPs on both Clusters are on the same Subnet? If common Vlan 0 is used for this Subnet, this means untagged or native vlan for the Physical switch, which indicates an access port, or trunk with native vlan. Is the physical switch port configurations the same for both Edge Cluster ESX hosts and Production ESX hosts for uplinks of the both dVS switches? Also is it possible to verify the vmk3 Mac addresses are learned by physical switch ports mac table on correct vlan for both Clusters? (or are they learned at all), can they ping their default gateways?
Also does the vmkping between hosts work for small packets such as 100 bytes or 200bytes? If this small packet works for Cluster2, then MTU of the second cluster vDS could be checked although normally during the Vxlan preparation this MTU could automatically be set to 1600 bytes as below:
- From vCenter Server click, click Home > Inventory > Networking.
- Right-click the vDS and click Edit Settings.
- On the Properties tab, select the Advanced option.
The MTU for each switch must be set to 1550 or higher. By default, it is set to 1600. If the vSphere distributed
switch MTU size is larger than the VXLAN MTU, the vSphere Distributed Switch MTU will not be adjusted
down. If it is set to a lower value, it will be adjusted to match the VXLAN MTU. For example, if the vSphere
Distributed Switch MTU is set to 2000 and you accept the default VXLAN MTU of 1600, no changes to the
vSphere Distributed Switch MTU will be made. If the vSphere Distributed Switch MTU is 1500 and the
VXLAN MTU is 1600, the vSphere Distributed Switch MTU will be changed to 1600.
VTEPs have an associated VLAN ID. You can, however, specify VLAN ID = 0 for VTEPs, meaning frames
will be untagged.
Vlan 0 sample configuration for Vxlan may be as below: