If I understand your scenario correctly , issue is specific to L2 VPN server side VM's ( Packet Drops when VM's they reside on different hosts ) ? If that is the case, issue has nothing to do with L2-VPN. I'm suspecting this could be a potential VLAN tagging issue on Host/Switch based on the design. To rule out , can you double check if VLAN reachability is there for VM's when they reside on different hosts , keeping VPN aside ? You have also mentioned "L2-VPN routing loop" , have you encountered any loops or it was an assumption ?
Thanks for your response Sree and sorry for missing to mention one point in my initial post.
On the server site, when L2VPN Server ESG and the workload vms(stretch) are on the same host then stretch workloads on on-prem can reach the remote vms. However when i separate the workload vms and L2VPN server ESG vm on the server site then within the site vms can reach each other however the vm which is on separate host cannot be reached by on-premise vm.
VMs which are still on the same host with L2VPN server ESG VM can still be reached from on-premise workload vms.
It is a collapsed cluster design so one cluster has everything i.e. workload vms and Edges. I tried to follow the below guide however not sure why it is not helping there. Please suggest.
However when i separate the workload vms and L2VPN server ESG vm on the server site then within the site vms can reach each other however the vm which is on separate host cannot be reached by on-premise vm
In the above case, can the workload VM reach L2-VPN server when they are on different server ?
Thanks for the followup.
Short answer, No.
For details, below is the view.
L2VPN Server ESG VM(For testing purpose, configured Trunk NIC with one available ip assigned from the stretch subnet to the sub-interface)
1- VM3(on-prem) can reach VM2(Server site) and vice versa
2- VM3(on-prem) can reach L2VPN Server sub-interface(server site)
3- VM2(on-prem) can reach L2VPN Server sub-interface(server site)
4- VM2(Server site) can reach VM1(Server site) and vice versa
--> This is only possible when Trunk port group has teaming policy as "Route based on original port ID" and with one active uplink.
Workload port group has the same load balance policy as Trunk PG however has two active uplinks. If i match the teaming policy of workload PG to Trunk PG even this communication stops. I am trying to follow the earlier link i posted for L2VPN routing loop mitigation.
5- VM1(Server site) can not reach L2VPN Server sub-interface(server site) based on the packet capture taken on the trunk interface of L2VPN Server.
--> Not sure why as VM1 can reach VM2 which is on the same host where L2VPN Server resides.
Thanks for the detailed explanation. Few more queries
1. L2 VPN Client side deployment is done on Standard/DVS ?
2. What type of port policies are we using here ? Forged/Promiscuous or Sink port or combination of all ?
3. Instead of moving the VM's behind the L2 VPN server, if we migrate L2 VPN esg to another host, in that case tunnel is going down ?