The DLR gateway configuration shouldn't matter for this type of connectivity. Your intra-Logical Switch communications though is totally dependent on VXLAN functioning correctly though.
Do you have control over the hosts in your POC? Is the PGW01 Edge Gateway in the same cluster as your DLR? If so, you can place the VMs together to isolate the issue, if you can achieve reachability by placing the VMs together, you may have an issue at the upstream switch or at the VTEP.
Are you able to test VTEP to VTEP successfully between all hosts? It sounds to me at glance like the VTEP function may not be working wherever you have PGW01 deployed. I would first review the ESXi hosts where your PGW Edge is deployed to ensure the VXLAN configuration is healthy. You could deploy a tiny VM to the transit network for easier troubleshooting.
What do you see for the following:
esxcli network vswitch dvs vmware vxlan list (are you seeing your VXLAN VMKNIC counted?)
esxcli network vswitch dvs vmware vxlan network mapping list --vds-name [vdsname] --vxlan-id [vxlan-id for your transit network]
As long as all the components (DLR Interface and Edge Interface in this case) are correctly connected to the same Logical Switch with the FW off, then you should be focusing on VXLAN functionality for each esxi host subscribed to that Logical Switch.
Thanks for the reply....
Yes all the components are in the same compute & Edge cluster and VXLAN communication is fine between all hosts. (VTEP and VXLAN comms were fine)
I logged a call with GSS at the end and one of the engineer's found out that the netcpa service on each host was somewhat buggered in that it didn't have the correct information of logical router instances. A restart of the netcpa service seemed to have re-established the connection back to the controller node (rabbit MQ service on port 5671) and are now up to date with the configuration spec. I can also now ping the previously unpingable IP's
They are doing a root cause analysis to see why the netcpa service went funny and will get back to, I will post an update once they do
Thanks for your help
FYI - turned out to be an issue with NSX 6.1.2 which is fixed on 6.1.3. No KB for the issue as of yet nor is it mentioned as being fixed in 6.1.3 in the release notes but VMware GSS engineer confirmed that its fixed.
temporary work around is to stop and start the netcpa daemon on the ESXi hosts of the compute & edge cluster
See more details on my blog http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-vdr-command/http://chansblog.com/nsx-6-1-2-bug-dlr-interface-communication-issues-how-to-troubleshoot-using-net-vdr-command/