7 Replies Latest reply on Aug 30, 2016 7:43 AM by leosilvapaiola

    Communication Problems - EDGE gateway and DLR

    leosilvapaiola Novice

      Hello community I hope everyone is fine.


      I’m a vSphere & NSX rookie (actually I have networking and security background, so everything involving VMWare it’s very new to me) anyway, I have an NSX Version:6.1.3 deployment and I’m experiencing a communication problem between Edge GW and DLR.

      Diagram DLR-EDGEgw.png

      I’m creating a very simple deployment: Network-A & Network-B communicated thru a DLR (picture) and everybody ping each other below the DLR, no problem there. The VMs ping each other in vDS1 and ping the VM in vDS2 and viceversa.

       

      The issue, is in the “Edge GW” that simply do not communicate properly with DLR (sometimes ON, sometimes OFF the ping between them) and no communications from the VMs to the outside world what so ever. I have disable all firewalls in the EDGE and the DLR FYi.

       

      So I followed a set of instructions (here) (I was redirected thru this article) and a simple workaround was to reset the netcpad daemon with the following command:

       

      ~ # /etc/init.d/netcpad stop  &   ~ # /etc/init.d/netcpad start

       

      and I’m getting very weird replies when I try this command:

       

      vcenter:~ # /etc/init.d/netcpad stop
      bash: /etc/init.d/netcpad: No such file or directory

       

      and then I tried (also recommended in the article above):

       

      vcenter:~ # net-vdr

      bash: net-vdr: command not found

       

      Now I’m very stuck because I’m not sure if I configured a proper “Host Installation” (I remember that I had to forced the installation, "Force-sync" I think it was the option I selected) or if I have to check if anything else is wrong.

       

      I have doubts if I’m trying the commands in the CLI in the wrong prompt, I’ve noticed that the screenshots on the workaround only have the ~# at the beginning and mine says “vcenter”. Maybe it sounds stupid but I’ve tried every workaround I had my hands on and I’m doubting everything around this deployment and again I'm a rookie, so I think I can ask every stupid question I want (haha ).

       

      Any clue or any idea you can direct my way, would be very much appreciate it.

        • 1. Re: Communication Problems - EDGE gateway and DLR
          canero Hot Shot

          During Host Preparation, EAM (ESX Agent Manager) on the Vcenter is used for deploying the VIB Modules to the ESXi Hosts. So vxlan, vsip, netcpa, vsfwd are the kernel modules and user processes on the ESXi hosts. So, /etc/init.d/netcpad stop and start , as well as net-dvr commands should be on Esxi hosts since vCenter does not have netcpa on itself.


          http://chansblog.com/tag/uwa/

           

          UWA (netcpa and vsfwd):

          ESXi_netcpad.jpg

           

           

          Force-sync may be helpful for synchronizing NSX Manager with NSX Edge, but it is used as troubleshooting tool,

          https://pubs.vmware.com/NSX-6/index.jsp#com.vmware.nsx.admin.doc/GUID-21FF2937-4CDF-491C-933E-8F44E21ED55E.html

           

          One quick check may be about the NSX Communication Health Check (It is available on NSX 6.2):

          http://www.virtually-limitless.com/nsx/nsx-6-2-communication-channel-health-check/

           

          General steps for NSX Installation may be:

          http://dailyhypervisor.com/vmware-nsx-for-vsphere-6-1-step-by-step-installation/

           

          If that order is followed, the ESXi netcpad - controller connection should be established:

           

          Is it possible to send output of this command?

          /etc/init.d/netcpad status

           

          For other commands this link may be helpful:

          http://www.vmwarearena.com/vmware-nsx-installation-part-7-verify-nsx-vibs-installation-from-esxi-hosts/

           

          If moudules are successfully installed and host preparation is ready, the ESG-DLR communication may be checked about Extend of Transport zones, arp or mac tables, but first the Control Plane should be working

          • 2. Re: Communication Problems - EDGE gateway and DLR
            leosilvapaiola Novice

            Hi cnrz thank you for the time and effort.

             

            A couple of things regarding the troubleshooting.

             

            First, I was applying the commands in the wrong device (something like barking up the wrong tree) but like I said before I'm a rookie and I'm allowed to make stupid mistakes (haha ) . Now following some of your directions and chanaka_ek blog's instructions I was able to verify the netcpa daemon, some screenshots here:

             

            init.d status & restart.PNG also instances of DLR.PNG

            interfaces of DLR.PNG& finally ARP table of DLR.PNG

             

            So everything here looks fine, it is consistent with the deployment I'm trying to accomplish.

             

            I took the task to graphically show you where exactly I'm having the communication issues.

             

            Ping Diagnostic.png

            As you can see (from TOP to BOTTOM)

             

            The Edge GW can communicate to the VMs in the inside part as well as the 3 DLR's interfaces. And also with the outside world (Internet). So I guess no problem there.

             

            Then the DLR can ping the VMs but cannot communicate with the Edge Gw through the transport zone.

             

            And finally but not least, the most strange behavior is with the VMs; where they can Ping each other (through the DLR), can ping the DLR's interfaces and even the Edge Gw's interfaces (both), but cannot go throughout the Edge Gw to the outside world.

             

            Any ideas, suggestions ??

             

            Thanks in advanced.

            • 3. Re: Communication Problems - EDGE gateway and DLR
              canero Hot Shot

              The unsuccessful ping could be related about Firewall on the Edge Gateway or the Firewall on the DLR, is it possible to check  or disable them? (They are different than the dFW)

               

              BEsides Firewall, it could be a routing problem, or a NAT problem for both Northbound or southbound, is it possible to check witth traceroute from the VMs to the outside? Mosstly for DLR a default gateway is sufficient, and EDGe and outside FW may need static routes towards the VMs. again this traffic should be allowed on the dFW(if default permit is not available)

              • 4. Re: Communication Problems - EDGE gateway and DLR
                chanaka_ek Novice
                vExpert

                For the VM's to communicate out of the ESG (Edge service gateway), you need to have the appropriate routes configured (unless you have dynamic routing configured) as well as the appropriate NAT rules (SNAT and DNAT) so that when you ping out to an external IP (say the default gateway), the response can come back to source. Are these also in place?

                 

                Also as one of the others have point out, do you have any firewall rules / distributed firewall rules set up...etc?

                 

                Also what version of ESXi and NSX are you looking at here?

                • 5. Re: Communication Problems - EDGE gateway and DLR
                  leosilvapaiola Novice

                  Hi chanaka_ek & cnrz , thank you both for the follow-up

                   

                  I'm sorry I didn't give this information earlier, but I've checked configurations regarding: routes, trace routes, firewalls, etc and to my eyes everthing looks fine. Although I have to say I haven't configured any NAT features anywhere, so maybe the solution is that way.

                   

                  But I'm gonna double check with you guys all the configurations mentioned before and hoping you maybe see somenthing I am not.

                   

                  EDGE GW:

                   

                  • Firewall: I didn't disable it, I just added an "any accept" traffic rule:
                  • Edge GW firewall.PNG
                  • IP route table: I'm going with static routing:

                  Edge GW ip table.PNG

                  • Trace route from EDGE GW to VM:

                  traceroute from EDGE gw to 10.1.20.99.PNG

                  • Trace route & ping from EDGE GW to Perimeter Firewall:

                  traceroute from EDGE gw to 10.6.0.1 & ping to 10.6.0.1.PNG

                  As you can see trace route fails and ping is succesful.

                   

                  DLR:


                  • Firewall: "any any accept" rule.

                  DLR firewall.PNG

                  • IP route table: statics

                  DLR ip table.PNG

                  VM:

                   

                  • tracert from VM to EDGE GW (to the uplink interface of the EDGE gw):

                  tracert from 20.99 to EDGE gw.PNG

                  • tracert from VM to the perimeter firewall:

                  tracert from 20.99 to EDGE gw part2.PNG

                   

                  ESXi version = 6.0.0

                  NSX Version: 6.1.3 Build 2591148

                   

                  And again the only configuration I'm positive I haven't touched is the NAT. I'll take a look at it and report back.

                   

                  Regards.

                  • 6. Re: Communication Problems - EDGE gateway and DLR
                    canero Hot Shot

                    From the routing table of the DLR, it has no default route, but only a static route for 10.6.0.0. So DLR (so the VMs) can't reach beyond 10.6.0.0/24 outside. If the VMs can also ping 10.6.0.1 and they have and Perimeter FW has static routes for VM subnets to .230, as well NAT, then lack of DLR default route towards EDGE (10.1.100.1) may be the issue.

                    • 7. Re: Communication Problems - EDGE gateway and DLR
                      leosilvapaiola Novice

                      Hello again guys,

                       

                      And again chanaka_ek & cnrz thank you so very much for your replies, both of you pointed me to the right direction.

                       

                      I finally got the Lab up and running thanks to some pointers of you.

                       

                      First was the missing default route for the DLR to the EDGE (thanks cnrz)

                       

                      DLR ip table with default GW.PNG

                       

                      And second was the missing EDGE's SourceNAT rule (thanks chanaka_ek)

                       

                      EDGE GW SNAT rule.PNG

                       

                      Combined, both missing configurations resolved all the communication issues.

                       

                      From this point on, it's explore time for me with NSX capabilities.

                       

                      Thanks again.