VMware Networking Community
janschreiber
Contributor
Contributor

Availbility of edges during NSX-T upgrade?

Hi,

a few weeks ago we rolled out NSX-T (version 3.1.1) and deployed two edges running in one edge cluster. On this edge cluster we're running Tier-0 and Tier-1 gateways (probably like everyone else does). Gateways are running in Active/Standby and Non Preemptive mode.

Last week we updated to NSX-T 3.1.2 and this caused a 10 Minutes downtime for all traffic flowing through Tier-0/1 gateway. This came as a surprise because the NSX-T documentation states the edges get updated in serial mode, so we assumed one edge would be available all the time for handling the traffic? If a couple of pings get lost during transition between the edges that's fine, but 10 Minutes is way to much.

Is this normal? Would love to know how others are dealing with this.

Kind regards

Jan

0 Kudos
11 Replies
Sreec
VMware Employee
VMware Employee

1.What kind of services are hosted in T0/T1? 

2. For which traffic type we faced outage? 

3. Is it Static or Dynamic Routing from TO to TOR/CORE? 

4. Is your T0 in A-S mode? 

5. Have you ever done any failover test on this platform? 

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
0 Kudos
janschreiber
Contributor
Contributor

Hi,

 

1.) Gateway Firewall, DNS-Forwarder, NAT and IP-Sec on T0 and Gateway Firewall. Load Balancer and another DNS-Forwarder on T1

2.) All traffic between any segments, all traffic to and from the internet (LB has not been setup yet at that time).

3.) Not 100% sure what you mean but we have only static routes, no BGP configured anywhere

4.) Yes, and with non preemptive.

5.) Yes (with switching each edge to maintenance mode back and forth before upgrading, and at this time there's been no outage)

Thank you and kind regards

Jan

0 Kudos
mauricioamorim
VMware Employee
VMware Employee

That is not a normal behavior. I would suggest testing failover by shutting down active edge in vCenter and checking if this works.

0 Kudos
janschreiber
Contributor
Contributor

Hi,

did that now, once i should down the second edge VM all traffic is interrupted until i start the VM again. Tier-0 is currently tied to the second edge VM. I went through all settings of the Tier-0 and came accros the "HA VIP" Setting which is currently unconfigured.

Is this required for full failover to work? This would explain it.

Kind regards

Jan

0 Kudos
Sreec
VMware Employee
VMware Employee

As of now your static routes are pointing to one of the edge uplink IP . Please do configure HA VIP ( static routes must point to HA VIP)and test the connectivity. Also keep in mind there should routing redundancy in upstream routers.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
0 Kudos
janschreiber
Contributor
Contributor

Hi,

 

the upstream router are fully redundant. We're just moving from physical routers/firewalls on the perimeter to virtual edges of NSX-T for the first time. The physical Firewalls we currently using do not require a HA VIP is set to be fully redundant, so we missed this when setting up the edges.

May i ask a final question before fixing this: Do i need 3 IPs in the transport network to the upstream routers (one for each edge + HA VIP ip) or is the one we're already using sufficient?

Thank you so much and kind regards

Jan

0 Kudos
Sreec
VMware Employee
VMware Employee

Yes. Minimum 3 IP needed (One for each Edge uplink interface and Active Edge will own HA VIP) in the same subnet. 

 

Screen Shot 2021-06-03 at 3.42.42 PM.png

 

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
0 Kudos
janschreiber
Contributor
Contributor

Hi,

i changed the setup now it's like in the picture below. Also made sure no gateway security rules, NATs or static routes are assingned to a single interface on a single edge.

Yesterday, i did the upgrade to NSX-T 3.1.3 and the behaviour changed but is still not perfect:

When the upgrade on the second edge node started (this has been the active one) all traffic again stopped. But this time for only 6-7 minutes and then the first edge took over as expected. When the update on the second edge was complete it took over again (it's defined as the preferred one) and this switch over was perfect and barely noticeable (only a single ping was lost, on some interfaces no ping was lost).

This puzzles me a bit: Why is the first switch taking so long but the second not?

Anyone any ideas?

Thank you and kind regards

Jan

0 Kudos
nutthanon
Contributor
Contributor

Using NSX-T 3.1.1 I faced similar issue even if running bgp or static route on T0. I found that when T0 and T1 with SR running on same edge cluster if active node down T0 will failover but T1 stuck on down node and not failover (I think that your T1 stuck on upgrade node because T1 active maybe your second edge so it took 6-7minutes until it available so the second swtich take less than first because it's already uograded), Not sure why but maybe issue or something. Workaround I use is created new edge cluster and move all T1 to that cluster so everything look fine. I tested down active node (T1 edge cluster) now all T1 only down around 1-2 ping loss and tested down active node on T0 cluster it's took around 1 minutes (in BGP) to switch to another T0). If I move T1 to T0 edge cluster issue come again. Hope this help.

0 Kudos
janschreiber
Contributor
Contributor

Him

thanks for sharing this experience and solution! Just to make sure i got it right:

Once you moved your T1 to a new edge cluster running only the T1 you've been able to do updates and stuff without any other measures? You serial update like before and it worked?

I gonna try that for sure, is moving the T1 to a new edge cluster disruptive or barely noticable?

 

Thank you so much and kind regards

 

Jan

0 Kudos
nutthanon
Contributor
Contributor

Yes, After moved all T1 router to new edge cluster we can start serial upgrade with T0 edge cluster follow by T1 edge cluster without any failover issue. 
 
Move T1 to new edge cluster is a disruptive action for me services down around 1-4 ping loss for each T1. Make sure you config new edge cluster with proper uplink and vlan used for your T1 SR. I recommend that you create new T1 on your old edge cluster, config service and then test change edge cluster to new edge cluster for that T1 to observe chenge T1 edge cluster impact on your environment before move your T1 production to new edge cluster.

0 Kudos