Solved: NSX with LACP/LAG - Intermittent Connectivity Loss

pbhite · ‎05-06-2019

We are working on an NSX deployment across two sites and three clusters, cross vcenter with UDLR controller at each site. All clusters use a vDS with LAGs on the relevant interfaces to a Juniper QFX top of rack. We are seeing issues where VMs are intermittently not able to egress traffic. There is an HA pair of ESGs at each site for this purpose, configured with local egress. When it works, it works correctly - traffic egresses at the right spots. The VMs on the same segment can always ping each other, regardless of where they are, but certain ones won't be able to egress out via either ESG. The furthest they can ping is the uplink interface of the UDLR. If we vmotion to a new host, it might start working, but it's entirely random - sometimes we vmotion from host A to host B, egress starts working, then we vmotion back to A and it keeps working, then we vmotion to host B again and it stops working. Issue occurs at both sites/clusters.

At one point earlier in the deployment we couldn't ping half the VTEP interfaces and found that it was the load balancing policy that the NSX-created port groups were using. Fixing that brought them all up consistently. I'm still suspecting there is some LACP/LAG issue occurring here, but I'm not certain. I've perused a few guides on NSX configuration with vDS, but some of the interchangeable terminology has me confused - maybe someone can point out what we might want to change on our load balancing policies here:

Site 1:

Cluster A / DVS A

Version 6.5
LACP Enabled / Active (v1)
Load balancing for all port groups set for "Route based on originating virtual port"

Cluster B / DVS B (Management/Edge)

Version 6.5
LACPv2/Enhanced LACP
Load balancing mode: Source and destination IP address, TCP/UDP port and VLAN
Edge gateways for Site 1 live here, VMs do not

Site 2:

Cluster C / DVS C

Version 6.7
LACPv2/Enhanced LACP
Load balancing mode: Source and destination IP address, TCP/UDP port and VLAN (tried swapping to virtual port)
Edge gateways and VMs coexist here

iktech00 · ‎05-06-2019

is LACP required? There are some strict requirements when using LACP, very easy to mess up.

Otherwise, I'm assuming you've checked MTU is all good?

View solution in original post

iktech00 · ‎05-06-2019

is LACP required? There are some strict requirements when using LACP, very easy to mess up.

Otherwise, I'm assuming you've checked MTU is all good?

pbhite · ‎05-06-2019

Not strictly required, but it's what is configured already across the entire environment. It would be a delicate process to pull it out. I'm looking at removing LACP, but it's a delicate operation on the production hosts (which includes VDI) and I'm unsure of the transition path. First I'd like to understand what those strict requirements are though.

Yes, MTU is good - or it is now, there was an issue with that between sites.

iktech00 · ‎05-06-2019

Also, any reason why the ESG's were put in HA mode, and not peering with the UDLR individually?

pbhite · ‎05-07-2019

They are peering individually, sorry if I wasn't clear. There are two distinct HA pairs, one HA pair for DC1 egress/ingress and the other pair for DC2. Four ESG VMs altogether.

pbhite · ‎05-09-2019

We decided this looked enough like a LACP/LAG issue and there's enough VMware documentation frowning on LACP/LAG usage, particularly for the NSX edge clusters, that we would go ahead and rework the network without it. After some testing at the DR site, our process looked like this:

Create a new DVS
Recreate all the port groups with proper VLAN assignments, using default "route based on originating virtual port" as per VMware recommendations
vMotion off any mission-critical VMs to other hosts
Remove the aggregate ethernet/LACP config from the top-of-rack physical switches
Add host to the new dvs, assign uplinks, migrate vmks and VMs
Remove the host from the old DVS and delete

That worked well across the board and we are now LACP-free. This issue appears to have gone away with it.