We are working on an NSX deployment across two sites and three clusters, cross vcenter with UDLR controller at each site. All clusters use a vDS with LAGs on the relevant interfaces to a Juniper QFX top of rack. We are seeing issues where VMs are intermittently not able to egress traffic. There is an HA pair of ESGs at each site for this purpose, configured with local egress. When it works, it works correctly - traffic egresses at the right spots. The VMs on the same segment can always ping each other, regardless of where they are, but certain ones won't be able to egress out via either ESG. The furthest they can ping is the uplink interface of the UDLR. If we vmotion to a new host, it might start working, but it's entirely random - sometimes we vmotion from host A to host B, egress starts working, then we vmotion back to A and it keeps working, then we vmotion to host B again and it stops working. Issue occurs at both sites/clusters.
At one point earlier in the deployment we couldn't ping half the VTEP interfaces and found that it was the load balancing policy that the NSX-created port groups were using. Fixing that brought them all up consistently. I'm still suspecting there is some LACP/LAG issue occurring here, but I'm not certain. I've perused a few guides on NSX configuration with vDS, but some of the interchangeable terminology has me confused - maybe someone can point out what we might want to change on our load balancing policies here:
Site 1:
Cluster A / DVS A
Cluster B / DVS B (Management/Edge)
Site 2:
Cluster C / DVS C
is LACP required? There are some strict requirements when using LACP, very easy to mess up.
Otherwise, I'm assuming you've checked MTU is all good?
is LACP required? There are some strict requirements when using LACP, very easy to mess up.
Otherwise, I'm assuming you've checked MTU is all good?
Not strictly required, but it's what is configured already across the entire environment. It would be a delicate process to pull it out. I'm looking at removing LACP, but it's a delicate operation on the production hosts (which includes VDI) and I'm unsure of the transition path. First I'd like to understand what those strict requirements are though.
Yes, MTU is good - or it is now, there was an issue with that between sites.
Also, any reason why the ESG's were put in HA mode, and not peering with the UDLR individually?
They are peering individually, sorry if I wasn't clear. There are two distinct HA pairs, one HA pair for DC1 egress/ingress and the other pair for DC2. Four ESG VMs altogether.
We decided this looked enough like a LACP/LAG issue and there's enough VMware documentation frowning on LACP/LAG usage, particularly for the NSX edge clusters, that we would go ahead and rework the network without it. After some testing at the DR site, our process looked like this:
That worked well across the board and we are now LACP-free. This issue appears to have gone away with it.