Solved: Cluster failure after failed host

atoerper · ‎05-17-2017

We are seeing issues with our 2-node ROBO cluster after one of the host failed and came back online.

Specifically, it is being reported that the previously failed host is in a separate partition. The VSAN Datastore is only showing half of the capacity that it should be showing.

Many cluster heath checks are failing. The nodes can ping each other and the witness over the VSAN networks.

atoerper · ‎05-17-2017

There was an issue on the physical switches with multicast

View solution in original post

TheBobkin · ‎05-17-2017

Hello,

Check that the host is not in vSAN Maintenance Mode (It can be in this regardless of the MM state in vCenter):

#cmmds-tool find -t NODE_DECOM_STATE -f json

This should be "{\"decomState\": 0, \"decomJobType\": 0 for all hosts.

What version of vSAN are you using?

Check Multicast connectivity if using 6.5 or lower:

What responses are you seeing when you run these commands on each site?:

tcpdump-uw -i <VMk used for vSAN> -s0 udp port 23451

tcpdump-uw -i <VMk used for vSAN> -s0 udp port 12345

Check Unicast traffic if using 6.6 and have this configured.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

atoerper · ‎05-17-2017

There was an issue on the physical switches with multicast

All

Cluster failure after failed host