I have the following scenario:
Site 1: NSX Edge with L2VPN Client
Site 2: Edge in vCloud Director dedicated to L2VPN Server (no internet access, only to extend VLAN)
Egress Optimization Gateway Address is not enabled on both sites.
At the time of lifting the L2VPN tunnel, the VLAN is extended without problems, however, some VMs in Site 1 begin to lose connectivity to the internet, and when checking the ARP table of the VM, the gateway IP address shows the MAC of the Edge of the L2VPN.
VMs at Site 2 communicate with VMs at Site 1 without problems.
Why might VMs at Site 1 recognize Site 2 Edge as a gateway?
I am not an expert in networks, but here I leave you the output of the show ip route command in each Edge
The 10.40.2.0/24 network is the Edge's Uplink network
The 10.40.2.0/24 network is the Edge's Uplink network
The 192.168.2.0/24 network is the subinterface that you create in vCloud to extend the customer network.
So actually I was reading a little bit more about it and because of not having Local Egress Optimization enabled on both sites, L2VPN will capture all the ARP packets and forward it to the server where it is connected.
Local egress should be enabled for the traffic to be locally routed to the internet instead of going to the secondary site, however all the other connectivity to the same L2 will be bridged to the NSX-V Edge Server.
Also please take a look at this interesting post: https://networkinferno.net/ingress-optimisation-with-nsx-for-vsphere
There it gives you a little more insight about the functionality and the purpose of the Egress IP Optimization: L2 VPN Overview
The current design is that Site 2 (Edge Server) should continue to go out to the internet and other networks over the physical GW. Currently, the Edge server only fulfills the function of L2VPN server mode, it does not have another uplink established that allows access to the Internet. By this design, I cannot enable Output Optimization.
Our network engineer tells me for all clients at the Switch level, he sees the MAC of the Edge Server subinterface, but there are clients that have not had problems on their virtual servers (Site A) with the ARP tables. We only have the report of 1 client (their VLAN's are born in the same Switch L3)
Are there any switch-level settings that are enabled or disabled that is causing this problem?
That diagram was excellent to understand a little bit more, so here I can see that purpose of Site B is just to extend the L2 and provide E-W connectivity but on Site A where everything is located some connectivity to internet is not working. So here I have some doubts:
I am not an expert in vCloud but do you configure the NSX-V from the VDC site as Server? From there you can enable Egress Optimization IP on Site A sites and all the machines on that site will be able to reach internet:
"For this use case, the VMware Cloud Provider NSX Edge appliance acts as the L2VPN server and on the customer side, the standalone NSX Edge appliance acts as the client. For implementations in which the migrated workloads require Internet access, enable egress optimization on the NSX Edge fulfilling the L2VPN server role. This supports the local routing of migrated systems as opposed to sending data across the VPN tunnel. This allows, for example, workloads on the VMware Cloud Provider Program side of the VPN to access the Internet locally"
For this scenario, VLAN 21 is the one that was extended to Site B. All VMs in VLAN 21 have had problems (change from gateway MAC to Edge Site B MAC) at different times. Therefore, they lose connectivity to other networks and the internet and VMs from other VLANs cannot communicate with these servers, this is evidenced since the monitoring alerts that the VMs are down at the network level.
By default all the ARP requests are redirected to the L2VPN Server so as you are not configuring Egress IP on the Client this will happen for sure. You told me you cannot configure Egress IP but on Site B, what if you configure it in Site A?
If I configure the Egress Optimization Gateway at Site A, the VMs at Site B are unable to communicate with VMs at Site A. I believe that by enabling Egress Optimization Gateway on either side, the communication is suspended bi-directionally.
We did a test of enabling the subinterfaces on the Edge Server and the CIDR Gateway with a free IP other than the original gateway. And it had no problems while we had the tunnel running.
I think in these cases, when you only extend the L2 network and the output is still through the physical gateway, that is the best option. In order not to create problems with the ARP table.
What do you think? Will there be a use case with this documented diagram? I've searched, but haven't found anything so far.
To be honest that should not happen as Egress IP is used for help the Local Site VMs to be able to reach some external networks and not route to another site, even I remember I tried this myself in the past and it worked perfectly.
So what you did was enabling a new IP for using as Gateway instead of the one in Physical Router and it worked right?
To be honest I tried this a long time ago, even with vCloud but during the week I will try to replicate your scenario between NSX-V and a Legacy Platform to understand a little bit more. However, in the meanwhile, maybe one of the experts in this forums can help you with all these understanding of the feature.