A customer had quite a scare the other day. One of their 2-Node back-2-back ROBO's lost all access to the vSAN Datastore because one of the two Nodes lost it's "back-2-back NIC".
Both nodes have a dedicated dual-port 10G NIC with Port 1 connected to Port 1 of the other node's card, port 2 to port 2 etc. You know what it looks like.
In one of the two nodes, that "back-2-back" PCI NIC completely died. Poof. Gone. As if there was no NIC in the chassis. Just gone. Within vCenter, the two vmNIC's where still there in the vSwitch but both where in a "dead/zombie" state (by lack of a better term).
Anyway, the other node noticed both 10G links going down, but vSAN was still up. All the other NIC's (with the VM Port-groups etc.) where allright and everything else had connectivity. In other words, it behaved like a node who felt it's partner node rebooting of being shutdown. All fine and dandy. No impact to production whatsoever. Behaved as expected.
Then it happened.
The administrator rebooted the node with the broken card. After the reboot, both vmNIC's where gone from ESXi and this is when the other node, decided to crap itself, all VM's where inaccessible but the vSAN Datastore was still there. Browsing it however revealed that all files and folders where gone.
All the VM's where powered-off too. So storage was completely gone.
The customer replaced the broken NIC, booted the ESXi server, both vmNIC's re-appeared. I assigned them to the vSAN Port-group and viola, the vSAN Datastore suddenly showed it's content again and pang, the VM's not longer said "inaccessible". We powered on the VM's and all was fine.
My question is this: what the h*ll happened?? One node loses it's vSAN Port-group uplinks (corresponding physical vmNIC's disappear) and the other node loses access to the vSAN Storage (empty datastore).
That all the cluster's VM's got killed is understandable, that's just vSAN specific HA kicking in. But why on earth did the vSAN datastore die (as said, still there but all files and folders on it have disappeared), all just because one node's physical NIC's disappear from the vSAN vSwitch/Port-group.
Cluster's HA settings:
- HA: On
- Admission Control: disabled
- Host failure: restart VM's
- Host Isolation: Power-off and restart VM's
- Datastore for heartbeating: Use datastores only from the selected list, no datastores selected
- Advanced settings: das.ignoreInsufficientHbDatastore = true
(these recommended settings are from Duncan Epping's Blog)