VMware Cloud Community
danstr1
Contributor
Contributor
Jump to solution

2 - Node VSAN Cluster - Witness Failure

I'm currently in the midst of upgrading our server infrastructure. This is for a small company but they'd like as much redundancy as possible. This has led me to explore a 2 -node VSAN as our environment is already VMWare based. I understand that this also requires a a third host serving as a witness. Most of the related documentation I found deals primarily with larger implementations and also discusses what occurs during host failures etc but I've found little that addresses what happens when a witness fails with a simple 2-node implementation. Does the Witness represent a single point of failure? What happens if the Witness is offline/down as there's no redundancy for it in this implementation? Will the servers continue working normally or would it isolate the servers to protect them until the witness is back online? Would any down time be experienced by the users? If things function normally without the witness, what would the time window be to get the witness back online?

The hosts and the witness would all be running locally in a single data center. I've seen where the 2-node approach is often used from branch or remote offices. This wouldn't be the case for us as it represents  our "data center." Pursuing the minimum of one more host would provide redundancy for the witness but this isn't feasible financially with the additional hardware and software costs. Basically, I have to weigh the risks of the 2-node or stick with a traditional 2 hosts running on a direct attached SAN.

Thanks in advance!

1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello danstr1

Welcome to Communities! Some useful info on participating here:

https://communities.vmware.com/docs/DOC-12286

"This has led me to explore a 2 -node VSAN as our environment is already VMWare based"

Are you considering re-purposing existing servers or acquiring new metal? If the former then please do ensure that the servers a) meet the ESXi supported HCL requirements for the version of ESXi you plan on running and b) that these servers have the necessary slots etc. to accomodate the required local storage e.g. enough slots for the disk groups you plan on implementing and vSAN-HCL certified controller(s).

"Most of the related documentation I found deals primarily with larger implementations and also discusses what occurs during host failures etc but I've found little that addresses what happens when a witness fails with a simple 2-node implementation."

There is plenty of documentation available that is not just about larger stretched clusters (and either way the outcome is essentially the same):

Overview (of what it sounds like you are considering):

https://blogs.vmware.com/virtualblocks/2016/10/18/2nodedirectconnect/

Not directly applicable but gives good overview supporting information for what you intend here:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vcat/vmware-virtual-san-two-node-a...

In-depth technical information:

https://storagehub.vmware.com/export_to_pdf/vsan-stretched-cluster-2-node-guide

"Does the Witness represent a single point of failure?"

No - provided both data nodes remain accessible from a vSAN perspective then all data Objects stored on these have a) Access to a full data-set and b) Quorum (>50% of components available).

"What happens if the Witness is offline/down as there's no redundancy for it in this implementation?"

Nothing happens - all we have lost is the components responsible for being tie-breaker, we still have a) & b) as per above.

"Will the servers continue working normally or would it isolate the servers to protect them until the witness is back online?"

Yes, as we have a) and b) - quorum has not been lost and thus there is no chance of split-brain and thus isolation would not occur.

"Would any down time be experienced by the users? If things function normally without the witness, what would the time window be to get the witness back online?"

No, as a) and b) - You are essentially running as FTT=0 until you get the Witness back but a Witness is fairly easy and fast to re-deploy if the original is somehow kaput (minimal rebuild of data time too as each witness-component (per Object) is 16MB).

"The hosts and the witness would all be running locally in a single data center. I've seen where the 2-node approach is often used from branch or remote offices. This wouldn't be the case for us as it represents  our "data center." Pursuing the minimum of one more host would provide redundancy for the witness but this isn't feasible financially with the additional hardware and software costs."

Run this as an appliance, free license, can run other things on the host running it etc. etc.

As per the detailed vsan-stretched-cluster-2-node-guide you should consider how you are going to run the network to this e.g. L2/L3, and/or separate vmk for witness-traffic and vsan-traffic (Witness Traffic Separation).

"Basically, I have to weigh the risks of the 2-node or stick with a traditional 2 hosts running on a direct attached SAN.""

Not to be a naysayer of the product I love and support 24/7 but if your current SAN solution is reliable and/or 'performant' for the workload it is running on then do consider if the money spent on licensing, upgrades/additions could be better spent elsewhere. That being said, if you want a cost-effective fast solution then (depending on the metal!) vSAN is a great solution, especially if you potentially plan to scale this out/up as this is far easier with HCI.

Feel free to ping me if you have any specific questions regarding clarification of any of the above (either via here or via a Support Request as I am VMware-GSS).

Bob

View solution in original post

2 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello danstr1

Welcome to Communities! Some useful info on participating here:

https://communities.vmware.com/docs/DOC-12286

"This has led me to explore a 2 -node VSAN as our environment is already VMWare based"

Are you considering re-purposing existing servers or acquiring new metal? If the former then please do ensure that the servers a) meet the ESXi supported HCL requirements for the version of ESXi you plan on running and b) that these servers have the necessary slots etc. to accomodate the required local storage e.g. enough slots for the disk groups you plan on implementing and vSAN-HCL certified controller(s).

"Most of the related documentation I found deals primarily with larger implementations and also discusses what occurs during host failures etc but I've found little that addresses what happens when a witness fails with a simple 2-node implementation."

There is plenty of documentation available that is not just about larger stretched clusters (and either way the outcome is essentially the same):

Overview (of what it sounds like you are considering):

https://blogs.vmware.com/virtualblocks/2016/10/18/2nodedirectconnect/

Not directly applicable but gives good overview supporting information for what you intend here:

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vcat/vmware-virtual-san-two-node-a...

In-depth technical information:

https://storagehub.vmware.com/export_to_pdf/vsan-stretched-cluster-2-node-guide

"Does the Witness represent a single point of failure?"

No - provided both data nodes remain accessible from a vSAN perspective then all data Objects stored on these have a) Access to a full data-set and b) Quorum (>50% of components available).

"What happens if the Witness is offline/down as there's no redundancy for it in this implementation?"

Nothing happens - all we have lost is the components responsible for being tie-breaker, we still have a) & b) as per above.

"Will the servers continue working normally or would it isolate the servers to protect them until the witness is back online?"

Yes, as we have a) and b) - quorum has not been lost and thus there is no chance of split-brain and thus isolation would not occur.

"Would any down time be experienced by the users? If things function normally without the witness, what would the time window be to get the witness back online?"

No, as a) and b) - You are essentially running as FTT=0 until you get the Witness back but a Witness is fairly easy and fast to re-deploy if the original is somehow kaput (minimal rebuild of data time too as each witness-component (per Object) is 16MB).

"The hosts and the witness would all be running locally in a single data center. I've seen where the 2-node approach is often used from branch or remote offices. This wouldn't be the case for us as it represents  our "data center." Pursuing the minimum of one more host would provide redundancy for the witness but this isn't feasible financially with the additional hardware and software costs."

Run this as an appliance, free license, can run other things on the host running it etc. etc.

As per the detailed vsan-stretched-cluster-2-node-guide you should consider how you are going to run the network to this e.g. L2/L3, and/or separate vmk for witness-traffic and vsan-traffic (Witness Traffic Separation).

"Basically, I have to weigh the risks of the 2-node or stick with a traditional 2 hosts running on a direct attached SAN.""

Not to be a naysayer of the product I love and support 24/7 but if your current SAN solution is reliable and/or 'performant' for the workload it is running on then do consider if the money spent on licensing, upgrades/additions could be better spent elsewhere. That being said, if you want a cost-effective fast solution then (depending on the metal!) vSAN is a great solution, especially if you potentially plan to scale this out/up as this is far easier with HCI.

Feel free to ping me if you have any specific questions regarding clarification of any of the above (either via here or via a Support Request as I am VMware-GSS).

Bob

danstr1
Contributor
Contributor
Jump to solution

Thanks so much for your thorough response Bob! That clarified a great deal and gives me more confidence as I continue to consider VSAN as a solution. Dell has assisted me in the server builds to ensure their compliancy with the VSAN HCL. The servers and VSAN include their ProDeploy services so they'll be assisting with the deployment and ongoing support. The servers will interface directly via 10G connections while the witness/monitoring aspect will be through our standard 1G switch separating the traffic. In terms of our IT department, I'm it. I have the costs inline so financially it's feasible. My apprehensions now deal primarily with my lack of resources and fear of the unknown. I'll continue reading the links you provided to gain a more thorough understanding as I make the final decision.

Thanks again!

Reply
0 Kudos