Re: 2-node 6.6.1 Cluster, vSAN dead because one ho...

srodenburg · ‎02-19-2018

Hello,

A customer had quite a scare the other day. One of their 2-Node back-2-back ROBO's lost all access to the vSAN Datastore because one of the two Nodes lost it's "back-2-back NIC".

Both nodes have a dedicated dual-port 10G NIC with Port 1 connected to Port 1 of the other node's card, port 2 to port 2 etc. You know what it looks like.

In one of the two nodes, that "back-2-back" PCI NIC completely died. Poof. Gone. As if there was no NIC in the chassis. Just gone. Within vCenter, the two vmNIC's where still there in the vSwitch but both where in a "dead/zombie" state (by lack of a better term).

Anyway, the other node noticed both 10G links going down, but vSAN was still up. All the other NIC's (with the VM Port-groups etc.) where allright and everything else had connectivity. In other words, it behaved like a node who felt it's partner node rebooting of being shutdown. All fine and dandy. No impact to production whatsoever. Behaved as expected.

Then it happened.

The administrator rebooted the node with the broken card. After the reboot, both vmNIC's where gone from ESXi and this is when the other node, decided to crap itself, all VM's where inaccessible but the vSAN Datastore was still there. Browsing it however revealed that all files and folders where gone.

All the VM's where powered-off too. So storage was completely gone.

The customer replaced the broken NIC, booted the ESXi server, both vmNIC's re-appeared. I assigned them to the vSAN Port-group and viola, the vSAN Datastore suddenly showed it's content again and pang, the VM's not longer said "inaccessible". We powered on the VM's and all was fine.

My question is this: what the h*ll happened?? One node loses it's vSAN Port-group uplinks (corresponding physical vmNIC's disappear) and the other node loses access to the vSAN Storage (empty datastore).

That all the cluster's VM's got killed is understandable, that's just vSAN specific HA kicking in. But why on earth did the vSAN datastore die (as said, still there but all files and folders on it have disappeared), all just because one node's physical NIC's disappear from the vSAN vSwitch/Port-group.

Cluster's HA settings:

- HA: On

- Admission Control: disabled

- Host failure: restart VM's

- Host Isolation: Power-off and restart VM's

- Datastore for heartbeating: Use datastores only from the selected list, no datastores selected

- Advanced settings: das.ignoreInsufficientHbDatastore = true

(these recommended settings are from Duncan Epping's Blog)

TheBobkin · ‎02-19-2018

Hello srodenburg,

Was the node that lost its NICs the 'preferred' Fault Domain?

Using vDS or standard switches?

vCenter running on this cluster or somewhere else?

"That said, this particular scenario, where a node is powered back on, with essentially a vSwitch with a vSAN enabled vmkernel-port but without physical NIC's attached to the vSwitch, is something I've personally never come across. It made the other (healthy) node lose it's marbles."

Did the healthy node lose its proverbial marbles when the other host came back up, as opposed to when it went down? I am going to guess when it came back up.

Potentially it had some network functionality and the driver/firmware on the NIC weren't making sense.

What do the logs tell about the healthy nodes connection to the Witness at this time and the cluster-state as per clomd?

"as said, still there but all files and folders on it have disappeared"

This is how it will appear if all Objects are State 12/13 (e.g. 1/3 component accessible).

Bob

srodenburg · ‎02-20-2018

Hello Bob,

"Was the node that lost its NICs the 'preferred' Fault Domain?"

Yes.

"Using vDS or standard switches?"

Standard vSwitches. One for Management and VM. The second vSwitch is used for vMotion and vSAN (each on an active/passive setup but it's the same dual-port PCIe NIC)

vmk0 is configured for witness traffic.

"vCenter running on this cluster or somewhere else?"

Somewhere else (in the main environment which is a vBlock)

"Did the healthy node lose its proverbial marbles when the other host came back up, as opposed to when it went down?"

When the problem-node came back up. It came back with a vSwitch without any nic-ports whatsoever (so not half-dead like before the reboot).

According to the logs, we never lost access to the witness appliance (which runs in a 3rd datacenter).

Again, the question that races in my mind is: why does the "good" node even care about the other "broken" node? It has a mirror copy, it has access to the witness so 2/3rd of the components are available. Why does it care that the other guy is foaming at the mouth?

Could it be that, as they could still communicate over their management interfaces, they exchanged information leading to the "good" node getting confused or something? The physical back2back NIC ports where gone on the "broken" node so the vSAN Network between the nodes was down. But that is the same situation when you reboot a node for regular maintenance.

Could we be talking about a split-brain / partition? Both nodes had perfect connectivity to the outside world, incl. the Witness appliance. It was only the inter-node vSAN communication that was down.

So essentially, both nodes are in an identical situation. Neither can communicate vSAN traffic to the other guy, but both can talk to everything else. It's like pulling out both back2back cables isnt't it. That raises the question: how will the Wittness appliance decide which node should win? Will it select the node in the preferred fault-domain? Both nodes have the same issue of both NIC ports being down (the reason why is not that relevant). Is this a "double failure" to whom there is no protection?

depping · ‎02-21-2018

That raises the question: how will the Wittness appliance decide which node should win? Will it select the node in the preferred fault-domain? Both nodes have the same issue of both NIC ports being down (the reason why is not that relevant). Is this a "double failure" to whom there is no protection?

If both nodes still have access to the witness then the preferred fault domain always wins.

depping · ‎02-21-2018

What is the isolation address by the way? is this an ip on the management network or the vsan network?

srodenburg · ‎02-21-2018

It's the regular default gateway (we only use the default IP-stack).

According to your own blog, it's irrelevant in a 2-node back2back cluster (or am I a donkey by interpreting your writings incorrectly).

srodenburg · ‎02-21-2018

"If both nodes still have access to the witness then the preferred fault domain always wins."

Exactly. And it was the preferred FD node that lost the NIC. But that should be irrelevant, as effectively, both nodes lost all back2back connections to each other, so the preferred FD node should still have won after it came back, instead the datastore died as soon as the pref.FD node came back up after the reboot and all files and folders optically disappeared (but came back after the NIC was replaced in the pref.FD and was powered back on).

srodenburg · ‎02-27-2018

Discussion seems to have died. Pity. Maybe we can chat about it at the VMUG on March 8th in Switzerland. See you there Duncan.

In the meanwhile i'm building a 2-node in our Lab to see if I can re-produce the issue. I'll simply yank out both back2back cables at the same time, simulating a total vsan-network failure while both nodes can talk to the witness appliance via their vmk0 vmkernel-interfaces.

I'll share the results when i'm done. Will take a couple of days though as i'm very busy.

Matt217 · ‎03-09-2018

Any update?

I too am experiencing this in my home lab. ESXi management interfaces for both hosts, including the witness are in their own vlan and IP space. VSAN traffic on another... yet, if both hosts lose connectivity to one another, the node having all the VMs (secondary in this case) will drop the vsan datastore and its respective VMs, even if the primary node has 0 VMs. In this last instance, it was due to VUM patching the primary node... all was well until the primary node began responding to pings after the reboot at which the secondary node dropped storage. Not good...

srodenburg · ‎03-10-2018

I am able to reproduce this in the lab with 100% repeatability.

I've talked to Duncan about it when he was in Switzerland last week. The root issue is, simply put:

- The preferred node being away (reboot) during the split brain. Nothing bad happens at this point.

- The secondary node lives on, data is of course updated only there. The data on the downed preferred node get's older every second.

- The witness appliance now only sees the secondary which has valid data so the datastore remains intact. That is why vm's keep running as long as the preferred node does not come back.

- The preferred node is coming back after it's reboot (both are still in split brain at this point)

- The preferred node reports in with the witness appliance and says "i'm the boss", the witness goes "wft ??" and invalidates the data on the secondary, causing all VM's losing access to their disks

- As soon as the vSAN network (direct-connect) is restored, the split-brain is gone, everybody now understands the bigger picture and the (good) data on the secondary is regarded as the valid data (because it is)

- At this point, VM's are automatically started by HA and all is good again.

All of this is exactly what I see happening in my Lab.

Also remember that I, or the customer, never lost data.

Duncan told me this situation will be resolved in vSAN 6.7 (remember, vSAN version has little to do with the ESXi version).

In the meanwhile, he asked me to try the following: when, from a vSAN network perspective, both nodes lose sight to one another and the split-brain effectively happens, before bringing the preferred node back up, change the cluster Fault Domain Config and swap the nodes in the preferred Fault-domain with the secondary.

Pref. FD ESXi-A -> Second. FD

Second. FD ESXi-B -> Pref. FD

Then bring the primary back up. When it comes up, and reports in with the witness and with vCenter, it gets pressed in the role of secondary. The old secondary was promoted to preferred just before, thus it's data is regarded as "the winner" and thus the datastore is not brought down by the self-protect mechanism which shuts everything down to avoid data-corruption.

Mind you, an EMC vPLEX and many other streched-data-cluster technologies, do exactly the same thing. If the vPLEX witness cannot determine, with 100% certainty, which side should win, it halts all IO and let's the human make the decision where IO should be resumed. Being 99% sure is not enough. It's your data.

This concept of temporarily swapping the hosts in the two fault domains is what will be automated in some way in 6.7. The exact details he could not disclose, which I respect.

I will try this in my lab. As said, I can reproduce "what happened" at this customer with 100% repeatability and will write the results of the suggested workflow here.

Matt217 · ‎03-12-2018

Thank you for the response... and a timely one at that.

I too can say without a doubt that this is happening in my lab as well, and 100% of the time.

I have been manually moving the primary role around and this is working for now. I manually change the role after one or the other nodes is in maintenance node (depending on which is being brought back online) so I am assured that the latest data has been replicated. I'm not certain if this is necessary, but I am playing it safe until a fix in place. I'm amazed that this wasn't caught in QA... a basic offline/online operation of both nodes (one at a time of course) would have caught this.

Thanks again.

depping · ‎03-13-2018

Our apologies for that, sometimes these use cases / failure scenarios slip through the cracks. We know what the problem is, and it will be solved in an upcoming release. Thanks,

zid01 · ‎08-12-2019

Hi,

is there any update about this issue and is it fixed in vSAN 6.7?

"Duncan told me this situation will be resolved in vSAN 6.7"

All

2-node 6.6.1 Cluster, vSAN dead because one host lost it's NIC