Below is the diagram of the setup.
Edge Gateway in Active- Standby with ECMP enabled.
The cost is of 1 in both NSX side & Router side.
So 2 equal paths from Edge Gateway to Physical routers
So the traffic flows through both the paths.
Now I want to isolate the traffic flowing through VLAN 101 for maintenance purpose.
Traffic should only flow through vlan 100
How should I do it.
I don't want to have any traffic interruption.
Is the below steps correct.
Step 1. Disable ECMP - Not sure doing this will have any impact
Step 2 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101
This should isolate the traffic flowing through 101
OR
Step 1 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101 - No need to disable the ECMP
Let me know what is the right step.
Any inputs
Hi Rajeev,
We do have similar deployment but my question to you is Why do you have Active and standby Edge deployment if you have ECMP enabled.
As per my understand ECMP should be enabled if you have ACTIVE-ACTIVE Edge deployed. Correct me if i am wrong?
Since in the active edge gateway it has 2 equal cost path to the physical routers ecmp has been enabled
Elver's Opinion: When to ECMP over Edge HA
For reference
Any inputs
I would go forward with your first approach.
Is the below steps correct.
Step 1. Disable ECMP - Not sure doing this will have any impact
Step 2 : Change OSPF cost from 1 to 100 in both NSX side & R2 side for VL:AN 101
There should be no interruption during Step 1. I suppose you have configured the OSPF Hello/Dead timers for 1/3 thus the routing change should be very fast.
Thanks. But I have the below query.
If i disable the ECMP, only 1 path will be selected.
My understand is that it will be either through vlan 100 or vlan 101
Is there any way to control it. how to control that once ECMP is disabled my active route is via vlan 100, so that i can make the cost change in vlan 101 & do the required maintenance.
Why don't you simply disable the second Edge Uplink(20.x.x.x) and do the required change ?
Thanks.
That is one of the options i am looking at. But there will be impact doing that.
The live traffic flowing through that particular interface will be affected.
I just wanted to achieve uninterrupted traffic by doing that.
The live traffic flowing through that particular interface will be affected.I just wanted to achieve uninterrupted traffic by doing that.
maximum 30 seconds it will take traffic to failover. I'm saying maximum assuming your Edges are service provider Edge.
Both the NSX Edges & the Physical devices are not service provider devices.
So u mean to say that , if i disable the interface in NSX edge there will be no impact.
From my understanding there will be impact. The reason is from NSX Edge -> Physical device the path may get changed but from the physical device - > NSX Edge it will take time to re converge . Because of this there will be down time.
Correct me if i am wrong.
You can monitor the rx/tx packet flow from host esxtop were the edge is running and check the VNIC counters , when there is no flow you can just flip the ECMP to disable state or simply disable the second interface -> This one way you can do the change with less outage
1 way is to change the cost at both NSX side & physical router side.
But for doing this is it required to disable ECMP ?
Going via your steps and requirement it will no more be a ECMP,it will be unequal cost load balancing.So if ECMP is enabled and if you are changing the cost, as far as i know existing flows will not have any impact - because hash is already calculated and it will be stored in router memory(in this case edge), cost change should not trigger a new hash calculation ,however over a period of time for new flows new hash will be calculated and it will pass via optimized path which would be 10.10.x.x. network and it will take 20.x.x.x path whenever situation demands.
Dear Rajeev,
Please review with your management team how much downtime is acceptable.
If they cannot accept 30 sec downtime, I could imagine incorrect mapping of requirement into the design implementation.
Please review design guide to modify Hello Interval and dead time interval when you deploy ECMP
Thanks
I can not have down time for this activity
That is for sure.
Also my previous experience is as below.
ECMP was enabled
I changed the OSPF cost of one of the NSX Edge Gateways uplinks from 1 -> 100
No OPSF cost change was done at the router side.
With the above I experienced packet loss & network disconnection.
This is the correct answer, I believe. Sessions will not renegotiate their path based on the routing cost changes, but new sessions will use just the active path. To achieve no downtime you would need to make the changes you describe some time before the maintenance window and monitor the active flows on the no-longer-routable-but-still-active link. Eventually, the flows should go down to nothing and then you could either proceed with disabling the inactive uplink on the ESG and/or proceed directly to completing the maintenance.
edit - it appears I didn't read every post. Oops. I would guess you had packet loss because ECMP was still enabled on the ESG - thus the ESG is still trying to send packets out of both of the active links while OSPF is screaming not too. I'm actually unsure on your options in that scenario - is there no VMware KB or doc that explains how to handle this situation? Hmmm.
edit2 - I will try testing this scenario in a lab later today if I have a few minutes. Should be fairly easy to replicate with my current setup.
Based on my understanding you want one link to be as a stantBy and one to be active
1) changing the Cost to 1 for the Primary link and increase the cost for the standby link
2) change the Hello timers to 1/3
3) disable the Grace full restart from all the devices so any ospf reset or restart of the router will flush the routing table you may notice 1 ICMP drop.
or you can have one link with default route towards your Physical router with increased AD value so when the OSPF is down with primary link it use the other standby link
There is absolutely no need to change the OSPF cost. It is not the right approach.
As Sreec mentioned it will apply unequal paths and bring more downtime.
Instead as suggested earlier change hello interval which are neatly described in NSX design guide.
ECMP is for near zero downtime, it is unrealistic to carry this with ZERO downtime. I believe Zero down is less real world but academic discussion.