Solved: Re: Full Network Down Scenario and expectations?

vmrulz · ‎09-18-2018

We have a small 5 node vSAN 6.6 all flash (VxRail 4.5.x) cluster in a remote office with fully redundant access layer 10GE Cisco 4500 switches. We have network uplinks configured in active/standby (since DellEMC doesn't support port channels on vSAN networking 😞 )

We were doing some failure testing before the site goes live by powering off A and B side switches one at time. Switch A uplinks went down and Switch B uplinks took over and all was good. The Switch A uplinks never came back up so our network guys thought they'd force the issue by powering off Switch B. Now we're in a full down down scenario.

Since each host is partitioned network wise, vSAN is obviously unhealthy. Is it safe to just shut down the nodes and vSAN will recover including VM's riding on it once we have the network healthy?

Thanks,

Ron

TheBobkin · ‎09-18-2018

Hello Ron,

Shouldn't be an issue regarding the data - on a daily basis I help fix clusters where someone/something broke the networking between nodes and generally all that is required is to restore communication (and sometimes use vsan.fix_renamed_vms).

No real benefit in powering down the nodes - having them all online and ready to sync is a better option.

Out of interest, do you have failback set to No on the vSwitch?

Bob

View solution in original post

TheBobkin · ‎09-18-2018

Hello Ron,

Shouldn't be an issue regarding the data - on a daily basis I help fix clusters where someone/something broke the networking between nodes and generally all that is required is to restore communication (and sometimes use vsan.fix_renamed_vms).

No real benefit in powering down the nodes - having them all online and ready to sync is a better option.

Out of interest, do you have failback set to No on the vSwitch?

Bob

srodenburg · ‎09-19-2018

Don't worry. vSAN reacts to a "nobody can talk to anybody" by freezing IO (just like any proper Storage-Cluster Solution would do in a full split-brain scenario).

Your VM's will poop themselves and hang/go zombie because their disks suddenly disappeared. Poof gone.

As soon as network-connectivity is restored, vSAN will re-sync and clean itself up. Give it some time to straighten things out.

And don't shut anything down. There is no need.

We actually do this sort of thing on purpose during "Production approval testing" at customers. We yank everything out, let vSAN freak out, plug it back in, let vSAN heal itself. It always works. No big deal.

Oh, and fire that networking-person. He's a <fill in something not so nice>...

vmrulz · ‎09-19-2018

Thanks for the replies,

One interesting thing I've never seen before is that even after the access layer switches fully came back up none of the ESXi hosts would re-establish their uplink connections so they all stayed in a partitioned state. Reluctantly knowing vSAN was not happy I chose in the interest of time to reboot a host to see what would happen. Sure enough the host networking came back online and the host took forever to boot waiting on vSAN to initialize. I then rebooted the other hosts knowing the vSAN would be happier if all the kids came to play. Eventually all the hosts came back online. Guests were in an orphaned like state since they had their plugs pulled.

Have you guys ever seen a situation where the ESXi hosts would not re-establish their uplink connections short of a reboot? It sounds like STP port blocking at the switch but the network guys didn't see that.

We have a Cisco case open and I'm going to open a case through DellEMC since it is VxRail.

Ron

srodenburg · ‎09-27-2018

"since it is VxRail."

Uh oh. I have a VxRail customer with Cisco Nexus switches with the exact same problem. After a reboot, some ports on the switch crap out and go offline. Reason is, as it turned out, the driver in ESXi, in combination with the NIC firmware, is having issues with auto-negotiation. The switch sees this as a flapping interface and shuts it down.

A VxRail upgrade is scheduled to upgrade both NIC Firmware and the driver inside ESXi.

Until then, as a workaround, inside vCenter / ESXi we hard-set the link-speed to 10000 FD and have had no issues since.

All

Full Network Down Scenario and expectations?