VMware Cloud Community
slekkus
Contributor
Contributor
Jump to solution

"witness host not found" alert with witness paritition (just one side).

Hi, for a streched vsan metro cluster with 2 data sites and one witness site, we get in Skyline health check an alert which reads "Witness host not found" when partitioning the witness communication. Now while partitioned, I can still esxcli diag ping (using the vsan vmknic) the witness appliance from the site where the link to the witness is still active.

For a planned maintenance we have to restart core switches, which will severe the witness link as the link to the other datasite.

Our assumption is that because one data site will still be communicating with the Witness, the cluster will stay up. But the alert description in Skyline reads as if the Witness appliance can not be reached from both datasites. 

Is there any info on why the alert is so generic and what can be expected?

We run vSphere 6.5u3 (which is vSAN 6.6?).

 

witnessnotfound.jpg

Reply
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution

No you shouldn't lose the cluster, assuming that indeed that one site still has access to the witness. As long as it maintains that connection the VMs in that location will continue to run.

Why the alert is so generic I don't know. Also, you are running a version which is no longer supported, please consider updating/upgrading.

View solution in original post

5 Replies
depping
Leadership
Leadership
Jump to solution

No you shouldn't lose the cluster, assuming that indeed that one site still has access to the witness. As long as it maintains that connection the VMs in that location will continue to run.

Why the alert is so generic I don't know. Also, you are running a version which is no longer supported, please consider updating/upgrading.

slekkus
Contributor
Contributor
Jump to solution

Hi depping, reasoned the same, why would the software give the cluster up while there is still communication available between one site and the witness. we're just put off and dare not do maintenance, due to the stupid skyline alert. It is what it is 🙂

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

@slekkus, most vSAN/Skyline Health checks query from every nodes perspective e.g. the nodes in the isolated site report (correctly) that they cannot communicate with the Witness, this is likely what you observed.

 

In such a case you can easily validate remaining site can still communicate with Witness via the cluster partition and data health checks (e.g. everything still accessible but in reduced-availability state).

slekkus
Contributor
Contributor
Jump to solution

Hi, just reporting back in case someone stumble upon this.

Severing the ISL between site A and B did not result in any loss, to the contrary, in Skyline the Witness not found alert turned green.

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

I created a fairly extensive failure scenario matrix by the way, an documented it here:

https://www.yellow-bricks.com/2023/05/30/vsan-stretched-cluster-failure-matrix/

As many people were asking for such a thing.

Reply
0 Kudos