VMware Cloud Community
Samuris
Contributor
Contributor

vSAN Dual Partition

After a recent full stack shutdown and restart our 2-Node vSAN witness came back up in partition 1 and 2. This is causing an error in our Skyline Health page. Identical Hostname, identical Host UUID. Just listed twice. How can I remove it from partition 2?

Reply
0 Kudos
2 Replies
Tibmeister
Expert
Expert

There's some fun instructions on how to manually do this, but, what I've found the easiest is to have a second witness appliance in the environment and move the cluster to the new witness when a network partition occurs.  Unfortunately I have two sites that have some pretty bad WAN links and have to do this at least once a week, where I'm moving clusters between my two witness appliances to resolve the cluster partition.  I had one cluster also do this after a cluster shutdown, and it was due to a DNS issue where the hosts could not perform a DNS lookup for the witness appliance because I set the hosts primary DNS to a VM in that cluster.  That was pure PEBKAC, but, the resolution was the same, transfer to the other witness appliance and all was happy again.

Having the second appliance is also nice if you have multiple 2-node clusters and want to do a full stack shutdown because the wizard will shut down the witness appliance for the cluster as well, so being able to have the cluster being shutdown on it's own appliance and then have the others on a separate one is absolutely a requirement in my book.

Reply
0 Kudos
TheBobkin
Champion
Champion

@Samuris  Do you still have this issue? Asking as question was posted weeks ago (only noticed it now when @Tibmeister replied on it).

Typically when this occurs it is due to one-way only communication between Data-node(s) and Witness - the health check is showing this because from the data-nodes perspective the cluster is fully formed (3 members) whereas from the Witnesses perspective it is isolated (1 member).

 

This is often due to firewall rules or something else preventing proper communication in both directions, the step to validate that is the cause is to check netcat ('nc' on ESXi) on ports UDP 12321 and TCP 2233 from and to the Witness.

I have also seen cases where incorrect traffic type tagging on the Witness can result in this e.g. tagging both vsan and witness type traffic on the Witness - it should ONLY ever have vsan type traffic tagged.

Reply
0 Kudos