We have a pretty strange situation which I'm having trouble understanding.
In our vCF 3.9.0 SDDC in the Edge cluster (4 x ESXi 6.7 hosts) we have 4 NSX-T 2.5.0 VM edge nodes. Each pair is part of a different BGP VRF and are successfully peered with our network fabric. The symptoms described below do not affect the BGP relationships with our fabric. Egde node controller and manager connectivity is also unaffected. Each edge node pair has a single T0 and T1. All Edge nodes were deployed through NSX-T admin interface but eth0 was changed to a segment following deployment as deploying through the UI doesn't allow you to select a nvds segment during deployment. All ESXi host transport nodes are dual nic with NVDS networking ONLY.
We have L2 bridging configured to allow us to migrate across systems on a couple of legacy VLANs in another DC. This is working fine and connectivity across the bridge works as expected for all guests except when those guests are located on the ESXi host on which the second edge node in our second pair of edges exists. For whatever reason this ESXi instance won't allow guests placed on it to get across to the bridge. Symptoms of the problem are a lack of arp cache entries on the guest for the default gateway which exists across in the other datacentre. Adding a manual arp entry on these guests does not help. When checking the logical switch arp table on the host being affected the mac entries end up as ff:ff:ff:ff:ff:ff. When shutting down the edge on that particular host the bridge traffic for guests on that host immediately come good but the bridge traffic for all VMs on the other esxi host transport node on which the other edge node is located, stops working. If you shutdown that edge node the exact opposite behavior occurs again!
All bridge segments are not connected to any T0 or T1 constructs as the route point is across in the other datacentre. All bridge segments are using mac-bridge-profile for mac discovery.
I wonder if we've hit some kind of a bug with 2.5.0 but the upgrade path for vCF 3.9.0 to 4.10 is a long and painful one.
Appreciate if anyone has seen this or might understand what we've missed.
May need to open a case with VMware and troubleshoot what's happening under the hood.
But would it be possible for you to deploy a new set of Edge dedicated for Bridging?
I have a customer on VCF and bridging works using dedicated Edge and a dedicated transport zone for the Edge
However, we had a bug when we make a change to the bridging, it got stuck/config corrupted - fix was to remove bridging profile and create a new one which causes downtime
Thanks Bayu, certainly that could be an option. Currently I'm looking to get our VCF 3.9.0 upgraded to VCF 3.10 which will then give us NSXT2.5.1 which I am hoping could resolve this and a number of other issues we are seeing. The bridging is also temporary (it always is!) but once we get rid of that it won't be a worry either.