srodenburg
Expert
Expert

Network goes down because of DLR VM vMotion

Hello,

I have a major issue and I have been having it for a long time (so the versions mentioned alone are the current versions, but it has been happening with older vSphere 6 and NSX 6.x versions too).

Often (so not always), when I put a Host from the "normal cluster" (with NSX) into Maint.mode and DRS vmotions all VM's away to other hosts and one of the NSX Logical Router VM's is on there and is also vMotioned to another host, the entire network goes down for 5 to 10 Minutes.

From my physical admin workstation, I lose ping to everything. And i mean everything: Core-Switch, Distribution switches, edge-switches, WiFi Accesspoints, all the ESXi servers and all the VM's. Really **everything** becomes unreachable.

After 5 to 10 minutes, everything comes back.

From the tsunami of Alert-emails that flood my mailbox (once my PC's email client re-established connection to the mailserver), it becomes clear that "everybody lost everybody" during the outage.

Environment:

NSX 6.3.2 running in Unicast mode.

vCenter Appliance 6.5 U1

1 x "Management Cluster" without NSX. This is where the NSX Manager and the three NSX Controller VM's live.

1 x "normal Cluster", all ESXi 6.5 U1 hosts run NSX and the two (HA-pair) Logical Router VM's run on this cluster.

No NSX Edge appliances other than the LR's.

Anti-affinity rules to ensure the LR VM's don't end up on the same host.

NSX shows no errors before the outage. Everything humming along fine.

Firewalls are non-vmware virtualized firewall-appliances (NSX distr.firewall not in use at this site).

I spent a lot of time going through logs on all devices etc. and what it is starting to look like, is that Spanning-tree (rstp) goes flippy, effectively killing layer-2, until everything settles down and gradually starts coming back (I keep a screen open with dozens and dozens of pings running to all kinds of systems and network-components).

The pattern that I have started noticing, is that, when it goes wrong (again, not always), it is always when the LR VM that is vmotioned was part of a mass-vmotion when a host, through DRS, is evacuated due to entering maint.mode.

What it looks like: "in vCenter, I see all the ESXi server's VM's be in a vMotioning state, at various percentages, some already done, but most not yet, and the LR VM that is amongst the ones that are not done yet. Then the webbrowser freezes and on my other monitor, where all those pings are running, all the pings to all devices and hosts are dead within a couple of seconds.

Question: Am I doing something wrong, that causes such total network outages, when a LR Vm is vmotioned? It happened twice this evening (we were patching the ESXi hosts) and the LR VM's where vmotioned around a few times. During two such vmotions, the entire network went down as described.

Does anyone else have such experiences?

9 Replies
rajeevsrikant
Expert
Expert

I faced similar kind of issue , but not exactly the same.

In my scenario, i had a cluster with 2 hosts.

Host 1 - Active DLR Control VM

Host 2 - Standby DLR Control

Anti Affinity Rule - Enabled.

I did the vMotion of the DLR control VM with Anti Affinity rule enabled to the other host.

vMotion was successful, but the DRS kicked in due to Anti Affinity rule & moved the DLR to the original host.

During this time, i faced network disruption. Split brain scenario happened which affected by network.

Because of split brain scenario, my network was interrupted.

Can you check from the logs whether split brain scenario had occurred or not in your environment ?

0 Kudos
srodenburg
Expert
Expert

So if I understand you correctly, both DLR VM's ended up on the same host?

2 questions:

- why would that cause a split-brain? I thought that was fixed in versions after 6.2.2

- why does a split-brain of DLR VM's cause a total network melt-down? (instead of just the VM's attached to Logical Switches)

0 Kudos
rajeevsrikant
Expert
Expert

When the active DLR Control VM  was migrated from host1 to host2, the processes on this DLR Control VM did not resume because the DRS ant-affinity rule prevented it.

The controller VM was in stun stage for a longer time.

During this time the communication between the Active – Standby DLR was lost & it entered into Split brain scenario which results in network interruption.

When the active DLR Control VM was migrated back to host1 (automatically by DRS), it was recovered from the split brain scenario & the network connectivity. was restored.

0 Kudos
srodenburg
Expert
Expert

"The controller VM was in stun stage for a longer time."

I think this is what is causing my issue when Maint.mode triggered DRS evacuates all VM's at once. A whole lot of VM's are "almost done vMotioning / waiting to finish" and when a DLR (probably the active one) is stunned too long, boooom !!

That must also be the reason why it does not happen when I vMotion the DLR VM away (not conflicting with anti-affinity of course) as a single VM, before putting the host in maint.-mode triggering the mass-vMotion.

I noticed issues, especially on very large ESXi hosts, where tons of VM's are evacuated all at once. One sees dozens of VM's in various states of vMotion. I'll experiment with reducing the number of parallel vMotions in vCenter, because migrating so many VM's causes enough issues on it's own. I've seen enough problems with vMotions being aborted and retried as even beefy environment get to bite off more that they can chew...

0 Kudos
rajeevsrikant
Expert
Expert

The findings in my scenario was checked with Technical support & they have confirmed it.

Yours should also be the similar issue.

Let me know where you able to check the DLR logs related to split brain.

0 Kudos
srodenburg
Expert
Expert

"Let me know where you able to check the DLR logs related to split brain."

Any hints as to what to look for in LogInsight? Particular strings like "blah blah active" on both DRL VM's at more or less the same time?

0 Kudos
rajeevsrikant
Expert
Expert

lcp-daemon: [daemon.notice] ovs|00223|lcp|INFO|Detected the start of split-brain-recovery

You can check based on key word split-brain

Also check the below command to check the last up & last down timings, to check if it matches the timings for which you faced the issue,

show service highavailability internal bfd

0 Kudos
srodenburg
Expert
Expert

searched for "split", "brain" and split-brain but nothing found. And as i upgraded to 6.3.5 yesterday, I destroyed the evidence of anything on the DRL VM's as they got deleted and re-deployed in the new version.

Guess i'll have to wait for another incident to see what Info's I can extract.

0 Kudos
rajeevsrikant
Expert
Expert

ok got it.

Just share with us if you face similar again.

0 Kudos