VMware Networking Community
grosas
Community Manager
Community Manager

Single-packet loss availability (Issues with ECMP-based approaches)

Bringing in a question addressed to me personally on the outside  Understanding High Availability on the NSX Edge Services Gateway | comment #2382‌  I'm looking into it, but thought it was interesting enough to share with the collective.

"Question for you. I’m currently working on a project to deploy NSX for our university. I’ve read through the NSX design guide and found it to be a bit too vague when it comes to information regarding how to achieve high availability with fast single packet loss (1 sec?) convergence.

I’ve tested several different ESG routing designs thus far.

1.) ESG Active/Standby HA – failover is too slow (10) seconds?

2.) (3) ESGs with ECMP with OSPF Area0 to Core and Area51 NSSA (VXLANs on DLR) – failover on ESG 1 and 2 are very fast single packet lost. However, restarting ESG 3 results in a very long (many minutes) outage due to issue with the NSSA type 7 to type 5 translation only being done on the ESG with the highest IP.

Furthermore I suspect that OSPF may present additional challenges when it comes to protecting the core from receiving only trusted prefixes advertised from my ESGs. Not sure how this can be accomplished with OSPF.

3.) (3) ESGs with ECMP running iBGP peering with Core and OSPF Area 0 peering with DLR. My BGP knowledge is admittedly somewhat limited but I tried every combination of tweaks and still couldn’t get better then about 2 minutes to converge after the failure of an ESG. I’m inclined to believe that BGP is not the best option for achieving fast convergence with NSX.

Any input you might be able to provide would be greatly appreciated.

Thanks!"

_____________________________________
Gabe Rosas (VMware HCX team at VMware)
Blog: hcx.design
LinkedIn: /in/gaberosas
Twitter: gabe_rosas
Reply
0 Kudos
2 Replies
grosas
Community Manager
Community Manager

Oops - forgot to add my initial response:

In your scenario #1, you saw failover of 10 seconds with a 6 second dead time? The standby peer should definitely take over in <10, but you're looking for 1 packet-loss convergence.

For scenario #2; I actually wasn't aware of that limited translation behavior. You say "due to issue with the NSSA type 7 to type 5 translation only being done on the ESG with the highest IP" Do you have a PR or reference for that? If the area type was implemented per the NSSA RFC; that LSA translation function should be distributed. Did you tinker with an identical configuration with normal areas?

For scenario number 3 are you running Multipath/maximum path" BGP commands at the core for your peering with the ESG cluster? Definitely should not be looking at convergence in minutes with path load balancing.

_____________________________________
Gabe Rosas (VMware HCX team at VMware)
Blog: hcx.design
LinkedIn: /in/gaberosas
Twitter: gabe_rosas
Reply
0 Kudos
tanurkov
Enthusiast
Enthusiast

HI grosas

this the issue that is mentioned

"

Issue 1492547: Extended convergence time seen when NSX-based OSPF area border router with highest IP address is shut down or rebooted

If an NSSA area border router which does not have the highest IP address is shut down or rebooted, traffic converges rapidly to another path. If an NSSA area border router with the highest IP address is shut down or rebooted, a multi-minute re-convergence time is seen. The OSPF process can be cleared manually to reduce the convergence time.

Workaround: See VMware knowledge base article 2127369.

"

Regards Dmitri

Reply
0 Kudos