VMware Networking Community
chanaka_ek
Enthusiast
Enthusiast
Jump to solution

NSX - Ping issues between DLR and Edge gateway on the transit network

Hi,

I've deployed a NSX in a POC environment and having some weird issues. I've deployed a distributed router (DLR) with 2 internal interfaces (connected to app & web NW segments) and an uplink interface connected to a transit network (192.168.10.0/29). I've also deployed a Edge services gateway with an internal link connected to the same transit interface (192.168.10.0/29) and an uplink interface connected to the outside world.

The issue is, when I putty on the Edge service gateway and ping the DLR's uplink interface using its transit network ip address (192.168.10.2), I don't get a response. The firewall is set to accept all traffic on both the DLR and the Edge.

Does anyone have any ideas? Note that the DLR's had the default gateway configured which is pointing at the Edge gateway's IP on the transit network (as this is the only north bound connection DLR has)

Cheers

Attached is a rough drawing of the topology. Ping fails from 192.168.10.1 to 192.168.10.2

1 Solution

Accepted Solutions
chanaka_ek
Enthusiast
Enthusiast
Jump to solution

FYI - turned out to be an issue with NSX 6.1.2 which is fixed on 6.1.3. No KB for the issue as of yet nor is it mentioned as being fixed in 6.1.3 in the release notes but VMware GSS engineer confirmed that its fixed.

temporary work around is to stop and start the netcpa daemon on the ESXi hosts of the compute & edge cluster

See more details on my blog   http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-...http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-...

View solution in original post

3 Replies
grosas
Community Manager
Community Manager
Jump to solution

Hi chanaka_ek

The DLR gateway configuration shouldn't matter for this type of connectivity.  Your intra-Logical Switch communications though is totally dependent on VXLAN functioning correctly though.  

Do you have control over the hosts in your POC? Is the PGW01 Edge Gateway in the same cluster as your DLR?  If so, you can place the VMs together to isolate the issue, if you can achieve reachability by placing the VMs together, you may have an issue at the upstream switch or at the VTEP.

Are you able to test VTEP to VTEP successfully between all hosts?  It sounds to me at glance like the VTEP function may not be working wherever you have PGW01 deployed.  I would first review the ESXi hosts where your PGW Edge is deployed to ensure the VXLAN configuration is healthy. You could deploy a tiny VM to the transit network for easier troubleshooting. 

What do you see for the following:

esxcli network vswitch dvs vmware vxlan list (are you seeing your VXLAN VMKNIC counted?)

esxcli network vswitch dvs vmware vxlan network mapping list --vds-name [vdsname] --vxlan-id [vxlan-id for your transit network]

As long as all the components (DLR Interface and Edge Interface in this case) are correctly connected to the same Logical Switch with the FW off, then you should be focusing on VXLAN functionality for each esxi host subscribed to that Logical Switch.

_____________________________________
Gabe Rosas (VMware HCX team at VMware)
Blog: hcx.design
LinkedIn: /in/gaberosas
Twitter: gabe_rosas
chanaka_ek
Enthusiast
Enthusiast
Jump to solution

Hi Grosas,

Thanks for the reply....

Yes all the components are in the same compute & Edge cluster and VXLAN communication is fine between all hosts. (VTEP and VXLAN comms were fine)

I logged a call with GSS at the end and one of the engineer's found out that the netcpa service on each host was somewhat buggered in that it didn't have the correct information of logical router instances. A restart of the netcpa service seemed to have re-established the connection back to the controller node (rabbit MQ service on port 5671) and are now up to date with the configuration spec. I can also now ping the previously unpingable IP's

They are doing a root cause analysis to see why the netcpa service went funny and will get back to, I will post an update once they do

Thanks for your help

Cheers

Chan

chanaka_ek
Enthusiast
Enthusiast
Jump to solution

FYI - turned out to be an issue with NSX 6.1.2 which is fixed on 6.1.3. No KB for the issue as of yet nor is it mentioned as being fixed in 6.1.3 in the release notes but VMware GSS engineer confirmed that its fixed.

temporary work around is to stop and start the netcpa daemon on the ESXi hosts of the compute & edge cluster

See more details on my blog   http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-...http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-...