1 2 Previous Next 20 Replies Latest reply on Jan 31, 2018 3:33 AM by tanurkov

    ESXi host cannot communicate with NSX controllers

    Floki00 Novice

      Hi Guys,

       

      We have 2 clusters in the same transport zone, each cluster with a dedicated vDistributed switch, hosts on one cluster can ping test using vxlan traffic, hosts in the other cluster cannot ping themselves or the hosts in the first cluster.

      running net-vdl2 -l shows there are no controllers connects to the faulty hosts, running show control-cluster connection-table for vni in the environment only shows the vteps for the working cluster. It seems the controllers don't know about the second cluster. I have restarted netcpad on the faulty hosts to no avail.

       

      Has anyone faced this before, any ideas how to resolve this!? Quite weird error. How can I force update/sync the host information to the controllers?

       

      Thanks in advance.

        • 1. Re: ESXi host cannot communicate with NSX controllers
          canero Hot Shot

          What is the output of /etc/init.d/netcpad status and

          cat /etc/vmware/netcpa/config-by-vsm.xml commands?

           

          Which NSX version is used?

          Dns forward and reverse could be resolved for hosts and NTP time is synchronized?

          Is the host preparation successful, same VIB modules shown as installed and working for both Clusters?

          Is it possible to ping from Management vmkernel interfaces to the Controller Ip addresses, could there be a firewall blocking port 1234 if seperate from Cluster1 hosts?

          Are there any errors on /var/log/netcpa.log on Cluster2 hosts?

          Messaging Bus between NSX Manager and Hosts working?

          • 2. Re: ESXi host cannot communicate with NSX controllers
            Floki00 Novice

            Hello Canero,

             

            Thanks for response, very much appreciated, please find my answers below.

             

            Netcpad status is running, config-by-vsm.xml shows all the controllers in there, NSX version 6.3.1, sms forward and reverse available for both ipv4 and ipv6, ntp is synchronised, host preparation takes a while to complete but comes up green and successful, all 3 vibs are installed and loaded , ping works from management vmkernel interface to the controllers, no firewall blocking port 1234, I can see established connections to the controllers, rabbitmq looks to be up and running on manager and hosts.

             

            I have noticed my controllers don' t know about the other failing hosts but the hosts know about the controllers! Weird state.

             

            Thanks,

             

            Ola

            • 3. Re: ESXi host cannot communicate with NSX controllers
              canero Hot Shot

              Vtep table for cluster 2 hosts may be empty until there are VMs on the Logical switch that spans both VDs through the common transport zone belonging to two clusters. Is it possible to create a logical switch with this transport zone, and Vmotion a VM to  A cluster 2 host? Does the controller tables change informing the Mac and Vtep of this VM on the new Cluster 2 host?

              • 4. Re: ESXi host cannot communicate with NSX controllers
                Floki00 Novice

                There are External Service Gateways deployed on the logical switches spanning both cluster vDS switches.

                • 5. Re: ESXi host cannot communicate with NSX controllers
                  canero Hot Shot

                  Are the ESG gateway VM on ESX hosts of Cluster1 or 2? If all of them are in Cluster1, is it possible to Vmotion one ESG VM to any ESX host on Cluster2? If there are ESG VM on cluster2 and still the Controller Cluster VTEP table is empty for this VNI Logical switch, then it is possible that controolers view and ESX view are not the same,

                   

                  For this VNI Logical switch on  the NSX Manager CLI, Is it possible to compare the difference between a ESX Host on Cluster1 and another host on Cluster2 this command:

                   

                  http://cloudmaniac.net/nsx-central-cli-operations-troubleshooting/

                   

                  sx01-cap-z51.sddc.lab> show logical-switch host host-15 vni 10000 verbose

                  VXLAN Global States:

                          Control plane Out-Of-Sync:      No --> Control plane Out-of-Sync shoud be No

                          UDP port:       8472

                  VXLAN network:  10000

                          Multicast IP:   N/A (headend replication)

                          Control plane:  Enabled (multicast proxy,ARP proxy)

                          Controller:     10.51.10.72 (up) --> The Controller should be up state

                          MAC entry count:        0

                          ARP entry count:        0

                          Port count:     1

                          VXLAN port:     vdrPort

                                  Switch port ID: 50331655

                                  vmknic ID:      0

                   

                  For every Logical Switch, one of the 3 has the master role for VNI, so other 2 controllers may not show the table. Is the table checked on the master controller?

                   

                  What does the Communication Channel Health shows between host and Controllers?

                  Installation -> Host Preparation-> Selecting Cluster2 -> Actions selecting Communication Channel Health normally shows status as Up with Green arrow for Control Plane Agent to Controller column.

                   

                  http://www.virtualizationblog.com/vmware-nsx-6-2-communication-channel-health/

                  Host and NSX Controller: Heartbeats are sent every  30 seconds, if 3 iterations are lost a sync will occur

                   

                  Are there any messages on the NSX Manager logs or system events?

                  https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.troubleshooting.doc/GUID-6F1C026C-79FD-490E-BFAD-196228B39AA6.html

                  If the status of any of the three connections for a host changes, a message is written to the NSX Manager log. In the log message, the status of a connection can be UP, DOWN, or NOT_AVAILABLE (displayed as Unknown in vSphere Web Client). If the status changes from UP to DOWN or NOT_AVAILABLE, a warning message is generated. For example:

                  2016-05-23 23:36:34.736 GMT+00:00  WARN TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1941, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: DOWN.

                   

                  If the status changes from DOWN or NOT_AVAILABLE to UP, an INFO message that is similar to the warning message is generated. For example:

                  2016-05-23 23:55:12.736 GMT+00:00  INFO TaskFrameworkExecutor-25 VdnInventoryFacadeImpl$HostStatusChangedEventHandler:200 - Host Connection Status Changed: Event Code: 1938, Host: esx-04a.corp.local (ID: host-46), NSX Manager - Firewall Agent: UP, NSX Manager - Control Plane Agent: UP, Control Plane Agent - Controllers: UP.

                   

                  If the control plane channel experiences a communication fault, a system event with one of the following granular failure reason is generated:

                  • 1255601: Incomplete Host Certificate
                  • 1255602: Incomplete Controller Certificate
                  • 1255603: SSL Handshake Failure
                  • 1255604: Connection Refused
                  • 1255605: Keep-alive Timeout
                  • 1255606: SSL Exception
                  • 1255607: Bad Message
                  • 1255620: Unknown Error

                   

                  Also, heartbeat messages are generated from NSX Manager to hosts. A configuration full sync is triggered, if heartbeat between the NSX Manager and netcpa is lost.

                  • 6. Re: ESXi host cannot communicate with NSX controllers
                    Preetam Zare Expert

                    Where you able to successfully deploy the controllers?

                     

                    You mention you are using same transport zone (VXLAN) for both the cluster which are in different vDS.

                    So you have used same VLAN ID right?

                     

                    any reason you are using different VDS but same Transport Zone?

                    With Great Regards,
                    TechS
                    vExpert 2012-2017 | VCP3-5 | VCAP5-DCD | VCP-NV | vSAN Specialist | VDI | Germany
                    • 7. Re: ESXi host cannot communicate with NSX controllers
                      Floki00 Novice

                      Hello Canero,

                       

                      Thanks for the questions, much appreciated. The VM's were on one cluster but the hosts have access to both vDS switches, moved a VM onto the other cluster and still not working.

                      Running the show logical-switch host host-15 vni 10000 verbose on hosts from both clusters shows the difference, controllers are up on the working cluster but not on the bad one.

                       

                      We got fed-up with the behaviour and installed NSX Manager 6.4.0, but experiencing a different problem now, the same cluster will not take VXLAN configuration, states the one host already has VXLAN installed, I have performed a rest of the network and reconfigured the host networking, deleted the vDS for the cluster and recreated it. Still getting the same problem.....this makes me think, this may have been the host causing the issue for that cluster all along. I am rebuilding the hsot completely and will test if the problem persists and update the discussion.

                       

                      Thanks.

                      • 8. Re: ESXi host cannot communicate with NSX controllers
                        Floki00 Novice

                        Hi Guys,

                         

                        As I said, since installing NSX Manager version 6.4.0, we now have a new problem manifesting on just one host in the cluster that was not working before. Whenever we try to configure VXLAN, the error is "Can not use switch "vDS-Name" to configure VXLAN. Switch "vDS-Name" in host "Host3" is already configured for VXLAN".

                         

                        I have wiped the network config on this host, rebuilt the host (re-installed ESXi 6.5.0U1) , recreated the vDS switch but the problem persists. I am perplexed!!

                        • 9. Re: ESXi host cannot communicate with NSX controllers
                          canero Hot Shot

                          Is this a new installation of NSX 6.4, or upgradied previous NSX Manager6.3.1?

                          What does the Installation > Host Preparation shows?

                           

                          If the removal of vDS and recreating is done before taking the host to maintenance mode > Removing the host from prepared cluster > Rebooting the host, then there could be some configurations left from the previous installation (as the error seems to indicate)

                          https://kb.vmware.com/s/article/2137959

                           

                          If it shows VXLAN Error Unconfigured or VXLAN Transport as unconfigured these steps may be helpful(details below)

                           

                          1. Unpreparation of Host (Removal of Host from NSX) and removing the host from dVS

                          2. Removing the Cluster from the NSX

                          3. Preparing the Cluster for NSX

                          4. Addition of Host to the dVS

                          5. Addition of Host to to the Prepared Cluster

                          6. Configuring VXLAN

                           

                           

                           

                           

                          1. Upreparing the  the host If  the problem is host related this would

                          https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.install.doc/GUID-C388C016-95FF-44EC-B7EB-63154BACB32A.html#GUID-C388C016-95FF-44EC-B7EB-63154BACB32A

                          This would uninstall the previously installed VIBs and VTEPS. If VTEPs are not removed, manual removal of VTEPs may be necessary:

                          https://kb.vmware.com/s/article/2137959

                           

                          An ESXi host that had been previously prepared for NSX was removed from the NSX cluster, and then added back to an NSX-prepared cluster (or different cluster) triggering a host preparation activity. When the ESXi host was removed initially, it should have triggered an un-prepare action. However, the un-prepare did not complete successfully, the host was not rebooted to complete the un-preparation, and /or the VXLAN Tunnel End Point (VTEP vmkernel interface removal failed, which causes the host preparation task to fail when trying to re-create the VTEP as it already exist.

                           

                           


                          Resolution


                          To resolve this issue, after removing the ESXi host from an NSX-prepared cluster, ensure that the VTEP vmkernel interface is removed before attempting to add it back into an NSX-prepared cluster.

                           

                          To manually remove the VTEP vmkernel interface on the ESXi host:
                          1. Put the ESXi host into maintenance mode
                          2. Remove the ESXi host from the NSX prepared cluster.
                          3. Reboot the host
                          4. If the VTEP vmkernel interface is still present on the ESXi host after the reboot, manually delete it from the host networking VMkernel interface configuration.
                          5. Re-add the ESXi host to the NSX prepared cluster and allow it to be prepar

                           

                           

                          2. Removing the cluster as following procedure may be helpful:

                           

                          https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.install.doc/GUID-90BA85A9-1E3C-4BD8-8127-6BEDD8E96B54.html

                          NSX - Remove clusters from Transport Zones - vCrooky

                           

                          3. Preparing the host for the Cluster

                          https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-07ED3DD6-BF82-4097-8702-4587FA88CFE2.html

                           

                          5. Addition of host to the prepared Cluster

                          https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-7411A51D-3FA3-407D-B93D-15455EBF17D2.html

                           

                          6. Configuring VXLAN

                          https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.4/com.vmware.nsx.install.doc/GUID-7411A51D-3FA3-407D-B93D-15455EBF17D2.html

                          • 10. Re: ESXi host cannot communicate with NSX controllers
                            Floki00 Novice

                            Yes correct, this is a new deployment of 6.4.0. I am able to perform the agent installAtion with no problems, but when it gets to configuring the vxlan in cluster 2 strange the begin to happen. It complAins that the first host already has a vxlan configuration. I have uninstalled agents via NSX manager, putting the host in maintenance mode, removed the host from the cluster , removed host from vds switch, restored Network settings, rebuilt the host but it still causes the same error message.

                             

                            One thing though about our set up, we have 2 clusters Management/ESG cluster and Production cluster, all ESG and DLR reside on the first cluster with its own vds switch and all production vm's reside in Production, since we always add both clusters to the same transport zone, using the same subnet for vtep we can have a logical switch down both clusters so this way vm traffic from Production can communicate with DLR/ESG in Management/ESG. Hence we have hosts in both clusters connected to both vds switches, cluster 2 has all management uplinks on the management vds, this has never been a problem. 

                             

                            I will be trying your recommendations once in the office in about 45 minutes, and will provide feedback as soon as.

                            • 11. Re: ESXi host cannot communicate with NSX controllers
                              Floki00 Novice

                              Hi Techstarts,

                               

                              Apologies for the Ltd response, my answers below.

                               

                              Yes for version 6.4.0 we have been able to deploy controllers, same vlan id "0" .

                               

                              We have a separate ESG cluster for the North-South ingress/egress and East-West routing, hence having both clusters in the same transport zone, so the logical overlay stretches both clusters.

                              • 12. Re: ESXi host cannot communicate with NSX controllers
                                Floki00 Novice

                                Hi All,

                                 

                                I can confirm the Agent install and the VXLAN configuration has been performed on the hosts, there was no lingering VXLAN/VTEPs on the hosts after the clean up activity this was done as required, going via NSX Manager - Uninstall - Resolve - Maintenance Mode - Agent removal (I checked as well esxcli software vib list | grep esx-nsxv as this is version 6.4.0).

                                 

                                I could not get both clusters prepared initially, as it was complaining the second host already had VXLAN configuration, this was due to the fact all hosts are connected to both vDS switches thus when one vDS is prepared/Installed VXLAN it didn't want to do this again, got round this by removing the hosts from one switch after VXLAN configuration, then performing the other vDS switch VXLAN config, then adding the hosts back into the vDS as before.

                                 

                                As before I still cannot have hosts in one cluster communicating over VXLAN. the first cluster (Management/ESG with 2 hosts can communicate with themselves) does communicate, but the second cluster will NOT talk to the hosts on the same switch (Production with 3 hosts, cannot communicate with themselves or the 2 hosts from Management/ESG).

                                 

                                I don't know where to go with this as the underlying Physical network connectivity are all trunked, same configuration in Management/ESG and Production, a normal VMKPing to the vmk for VXLAN does not work.

                                 

                                A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip) [Works OK] also works with MTU of 1570 as well. (Host from second cluster pinging itself).

                                 

                                A vmkping ++netstack=vxlan -s 1470 -d -I vmk3 x.x.x.x(vtep ip of another host) [Does not work] with MTU of 1570 does not work as well. (Host from second cluster pinging other hosts).

                                 

                                MTU on the switch is 9000, nothing has changed this used to work with NSX 6.3.1, but since upgrading to ESXi 6.5.0U1........the environment has been HELL!!

                                • 13. Re: ESXi host cannot communicate with NSX controllers
                                  Floki00 Novice

                                  Just something I noticed, when running a ping test between hosts, I did a Packet capture, exported in .pcap format and it shows one of the faulty hosts pinging 224.0.0.1.

                                   

                                  I know this to be a Multicast IP reserve, My Transport zone is Unicast, so not sure why my connectivity test using a Logical Switch in a Unicast Transport zone is trying to access a Multicast IP 224.0.0.1.

                                   

                                  I hope I'm missing something here and this is the approved behaviour!

                                  • 14. Re: ESXi host cannot communicate with NSX controllers
                                    canero Hot Shot

                                    About the Subnets and Vlan-Id used on the Edge/Mmgmt and Production Clusters, for Edge Cluster vDS_Edge is used for Vxlan with Vlan 0, and for Production Cluster vDS_Production is used for Vxlan with again the same Vlan 0. (as the previous message), is the undertanding correct?

                                     

                                    Does both Cluster use the same pool, i.e. the IP addresses of  VTEPs on both Clusters  are on the same Subnet? If common Vlan 0 is used for this Subnet, this means untagged or native vlan for the Physical switch, which indicates an access port, or trunk with native vlan. Is the physical switch port configurations the same for both Edge Cluster ESX hosts and Production ESX hosts for uplinks of the both dVS switches? Also is it possible to verify the vmk3 Mac addresses are learned by  physical switch ports mac table on correct vlan for both Clusters? (or are they learned at all), can they ping their default gateways?

                                     

                                    Also does the vmkping between hosts work for small packets such as 100 bytes or 200bytes? If this small packet works for Cluster2, then MTU of the second cluster vDS  could be checked although normally  during the Vxlan preparation this MTU could automatically be set to 1600 bytes as below:

                                    1. From vCenter Server click, click Home > Inventory > Networking.
                                    2. Right-click the vDS and click Edit Settings.
                                    3. On the Properties tab, select the Advanced option.

                                     

                                    https://docs.vmware.com/en/VMware-NSX-for-vSphere/6.3/com.vmware.nsx.cross-vcenter-install.doc/GUID-49BAECC2-B800-4670-AD8C-A5292ED6BC19.html

                                    The MTU for each switch must be set to 1550 or higher. By default, it is set to 1600. If the vSphere distributed

                                    switch MTU size is larger than the VXLAN MTU, the vSphere Distributed Switch MTU will not be adjusted

                                    down. If it is set to a lower value, it will be adjusted to match the VXLAN MTU. For example, if the vSphere

                                    Distributed Switch MTU is set to 2000 and you accept the default VXLAN MTU of 1600, no changes to the

                                    vSphere Distributed Switch MTU will be made. If the vSphere Distributed Switch MTU is 1500 and the

                                    VXLAN MTU is 1600, the vSphere Distributed Switch MTU will be changed to 1600.

                                    VTEPs have an associated VLAN ID. You can, however, specify VLAN ID = 0 for VTEPs, meaning frames

                                    will be untagged.

                                     

                                     

                                    Vlan 0 sample configuration for Vxlan may be as below:

                                     

                                    http://wahlnetwork.com/2014/07/07/working-nsx-configuring-vxlan-vteps/

                                    1 2 Previous Next