VMware Cloud Community
atoerper
Enthusiast
Enthusiast
Jump to solution

Cluster failure after failed host

We are seeing issues with our 2-node ROBO cluster after one of the host failed and came back online.

Specifically, it is being reported that the previously failed host is in a separate partition. The VSAN Datastore is only showing half of the capacity that it should be showing.

Many cluster heath checks are failing. The nodes can ping each other and the witness over the VSAN networks.

0 Kudos
1 Solution

Accepted Solutions
atoerper
Enthusiast
Enthusiast
Jump to solution

There was an issue on the physical switches with multicast

View solution in original post

0 Kudos
2 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello,

Check that the host is not in vSAN Maintenance Mode (It can be in this regardless of the MM state in vCenter):

#cmmds-tool find -t NODE_DECOM_STATE -f json

This should be "{\"decomState\": 0, \"decomJobType\": 0  for all hosts.

What version of vSAN are you using?

Check Multicast connectivity if using 6.5 or lower:

What responses are you seeing when you run these commands on each site?:

tcpdump-uw -i <VMk used for vSAN> -s0 udp port 23451

tcpdump-uw -i <VMk used for vSAN> -s0 udp port 12345

Check Unicast traffic if using 6.6 and have this configured.

Bob

-o- If you found this comment useful please click the 'Helpful' button and/or select as 'Answer' if you consider it so, please ask follow-up questions if you have any -o-

0 Kudos
atoerper
Enthusiast
Enthusiast
Jump to solution

There was an issue on the physical switches with multicast

0 Kudos