VMware Networking Community
rajeevsrikant
Expert
Expert

NSX Edge - ECMP

Below is the diagram of the setup.

Edge Gateway in Active- Standby with ECMP enabled.

The cost is of 1 in both NSX side & Router side.

So 2 equal paths from Edge Gateway to Physical routers

So the traffic flows through both the paths.

Now I want to isolate the traffic flowing through VLAN 101 for maintenance purpose.

Traffic should only flow through vlan 100

How should I do it.

I don't want to have any traffic interruption.

Is the below steps correct.

Step 1. Disable ECMP - Not sure doing this will have any impact

Step 2 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101

This should isolate the  traffic flowing through 101

OR

Step 1 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101 - No need to disable the ECMP

Let me know what is the right step.

pastedImage_0.png

Tags (1)
21 Replies
rajeevsrikant
Expert
Expert

Any inputs

0 Kudos
amolnjadhav
Enthusiast
Enthusiast

Hi Rajeev,

  We do have similar deployment but my question to you is Why do you have Active and standby Edge deployment if you have ECMP enabled.

  As per my understand ECMP should be enabled if you have ACTIVE-ACTIVE Edge deployed. Correct me if i am wrong?

Please consider marking this answer "correct" or "helpful" if you think your query have been answered correctly. Regards Amol Jadhav VCP NSXT | VCP NSXV | VCIX6-NV | VCAP-DCA | CCNA | CCNP - BSCI
0 Kudos
rajeevsrikant
Expert
Expert

Since in the active edge gateway it has 2 equal cost path to the physical routers ecmp has been enabled

0 Kudos
rajeevsrikant
Expert
Expert

0 Kudos
rajeevsrikant
Expert
Expert

Any inputs

0 Kudos
oergmann
VMware Employee
VMware Employee

I would go forward with your first approach.

Is the below steps correct.

Step 1. Disable ECMP - Not sure doing this will have any impact

Step 2 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101

There should be no interruption during Step 1. I suppose you have configured the OSPF Hello/Dead timers for 1/3 thus the routing change should be very fast.

0 Kudos
rajeevsrikant
Expert
Expert

Thanks. But I have the below query.

If i disable the ECMP, only 1 path will be selected.

My understand is that it will be either through vlan 100 or vlan 101

Is there any way to control it. how to control that once ECMP is disabled my active route is via vlan 100, so that i can make the cost change in vlan 101 & do the required maintenance.

0 Kudos
Sreec
VMware Employee
VMware Employee

Why don't you simply disable the second Edge Uplink(20.x.x.x) and do the required change ?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
0 Kudos
rajeevsrikant
Expert
Expert

Thanks.

That is one of the options i am looking at. But there will be impact doing that.

The live traffic flowing through that particular interface will be affected.

I just wanted to achieve uninterrupted traffic by doing that.

0 Kudos
Techstarts
Expert
Expert

The live traffic flowing through that particular interface will be affected.I just wanted to achieve uninterrupted traffic by doing that.

maximum 30 seconds it will take traffic to failover. I'm saying maximum assuming your Edges are service provider Edge.

With Great Regards,
0 Kudos
rajeevsrikant
Expert
Expert

Both the NSX Edges & the Physical devices are not service provider devices.

So u mean to say that , if i disable the interface in NSX edge there will be no impact.

From my understanding there will be impact. The reason is from NSX Edge -> Physical device the path may get changed but from the physical device - > NSX Edge it will take time to re converge . Because of this there will be down time.

Correct me if i am wrong.

0 Kudos
Sreec
VMware Employee
VMware Employee

You can monitor the rx/tx packet flow from host esxtop were the edge is running and check the VNIC counters  , when there is no flow you can just flip the ECMP to disable state or simply disable the second  interface -> This one way you can do the change with less outage

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
0 Kudos
rajeevsrikant
Expert
Expert

Thanx

But this is practically not possible since it is a production environment.

The traffic volume will be high, so can expect to wait & watch.

So what is the best way or method to change from 2 equal cost paths to 1 path without traffic interruption.

1 way is to change the cost at both  NSX side & physical router side.

But for doing this is it required to disable ECMP ?

0 Kudos
Sreec
VMware Employee
VMware Employee

Going via your steps and requirement it will no more be a ECMP,it will be unequal cost load balancing.So if ECMP is enabled and if you are changing the cost, as far as i know existing flows will not have any impact - because hash is already calculated and it will be stored in router memory(in this case edge), cost change should not trigger a new hash calculation ,however over a period of time for new flows new hash will be calculated and it will pass via optimized path which would be 10.10.x.x. network and it will take 20.x.x.x path whenever situation demands.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Techstarts
Expert
Expert

Dear Rajeev,

Please review with your management team how much downtime is acceptable.

If they cannot accept 30 sec downtime, I could imagine incorrect mapping of requirement into the design implementation.

Please review design guide to modify Hello Interval and dead time interval when you deploy ECMP

With Great Regards,
0 Kudos
rajeevsrikant
Expert
Expert

Thanks

I can not have down time for this activity

That is for sure.

Also my previous experience is as below.

ECMP was enabled

I changed the OSPF cost of one of the NSX Edge Gateways uplinks from 1 -> 100

No OPSF cost change was done at the router side.

With the above I experienced packet loss & network disconnection.

0 Kudos
ASIS_Intl
Contributor
Contributor

This is the correct answer, I believe. Sessions will not renegotiate their path based on the routing cost changes, but new sessions will use just the active path. To achieve no downtime you would need to make the changes you describe some time before the maintenance window and monitor the active flows on the no-longer-routable-but-still-active link. Eventually, the flows should go down to nothing and then you could either proceed with disabling the inactive uplink on the ESG and/or proceed directly to completing the maintenance.

edit - it appears I didn't read every post. Oops. I would guess you had packet loss because ECMP was still enabled on the ESG - thus the ESG is still trying to send packets out of both of the active links while OSPF is screaming not too. I'm actually unsure on your options in that scenario - is there no VMware KB or doc that explains how to handle this situation? Hmmm.

edit2 - I will try testing this scenario in a lab later today if I have a few minutes. Should be fairly easy to replicate with my current setup.

0 Kudos
Mparayil
Enthusiast
Enthusiast

Based on my understanding you want one link to be as a stantBy and one to be active

1) changing the Cost to 1 for the Primary link and increase the cost for the standby link

2) change the Hello timers to 1/3

3) disable the Grace full restart from all the devices so any ospf reset or restart of the router will flush the routing table you may notice 1 ICMP drop.

or you can have one link with default route towards your Physical router with increased AD value so when the OSPF is down with primary link it use the other standby link

0 Kudos
Techstarts
Expert
Expert

There is absolutely no need to change the OSPF cost. It is not the right approach.

As Sreec​ mentioned it will apply unequal paths and bring more downtime.

Instead as suggested earlier change hello interval which are neatly described in NSX design guide.

ECMP is for near zero downtime, it is unrealistic to carry this with ZERO downtime. I believe Zero down is less real world but academic discussion.

With Great Regards,
0 Kudos