I'm trying to nail down concepts about the VSAN architecture and resilience implementation.
So far I've been reading a lot, but still don't quite get some facts about the architecture.
Just to recap, VSAN provides storage with flexible services as of resilience (to failures of disks/hosts/network), performance (via paralel stripes on different groups, % of SSD) and safety (via % of allocation) based on ESXi "native" services (i.e. no appliance involved) and resources (SSDs, disks, 1Gbps nic).
Services are defined via policies on objects.
The implementation requires at least 3 hosts providing disk resources, and a minimum disk resource (group) is composed of 1 SSD and at least 1 disk.
Now, things start less firm to me.
i) Resilience is stated as number of failures to tolerate. That is any of group/host/network ?
Most of the examples I've seen are for FTT = 1, interesting things are with FTT = 2 or 3
ii) Given that it tolerates network failures, an involved scheme using quorum is used. I guess that because of that, the # of hosts required goes from n+1 to 2n+1, but I have been not able to find a document that explains how it works. It would seem that n+1 replicas on different hosts are needed and the rest to 2n+1 is just enough to go with witnesses (on different hosts) but again, I have not seen a document that clearly states so. I have not seen any requirements to network redundancy either.
iii) Were this to be correct, a FTT = 2 would need 3 replicas and 2 witnesses. In this scenario, a network failure could create a partition, say 2/3 hosts. Given that votes seem to have the same weight, it could be that majority is obtained in a 3 component partition. I assume that the 2 component partition will safely shutdown. But the 3 comp. side could be just 2 witnesses and 1 replica... then if THAT replica fails... ???
I'd love some insight.
One of the great blogs on vSAN. It is one stop:
If you want more insight on witness & component, no other blog I could find which deep dive into witness/component except below blog etc.
Thanks. Actually, the reference you cite is the only place where primary, secondary an tertiary witnesses are mentioned AFAIK.
But even there, how the architecture works is not described. Some things can be inferred, but I'm reaching some dead ends.
Let me try to help answer your questions.
i) FTT is specifically host failures to tolerate. To create resiliency in the network you must have multiple NICs attached to the vSwitch where your Virtual SAN enabled VMK is connected and using either explicit failover or LACP load balancing. Also disk failures impact the objects on the disk group where the failure occured so if you have multiple groups per server only objects that live on the group where the disk failed are degraded. More about the difference between "degraded" and "absent" states here
ii) and iii) Remember that network failures are going to create an "absent" situation for the host providing you time to fix the issue. Should you not be able to fix the issue within the default 60 minute time frame, the disk groups on the isolated host transition to degraded. See the following link for more information that I think will give you the info you are looking for regarding witness count, node requirements, etc. - VMware Virtual SAN: Witness Component Deployment Logic | VMware vSphere Blog - VMware Blogs
Basically the formula for calculating the minimum number of hosts to acheive your FTT is 2 * FTT + 1 = number of hosts. This always results in an odd number of hosts making primary and secondary witnesses unnecessary which is how you would end up with 3 data components and 2 witness components in your example. However in a 3 node cluster in your example you would need an additional 2 components - 1 primary witness since your cluster is not greater than 2* FTT +1 and one tiebreaker witness to create an odd number of components.
thanks for taking the time to answer.
But... do you have some official vmware document to back up your statement ?
I've seen/read many vmware docs/presentations where FTT is defined to be failures to tolerate (no news up to here) be it disk/host/network. Last time I hear it was in the vmworld 2013 presentation.
Even better, the technical paper describing VSAN says (and I quote):
5.1.1 Number of Failures to Tolerate
This property requires the storage object to tolerate a defined number of concurrent host, network, or disk
failures in the cluster and still ensure availability of the objects.
ii) and iii)
I do not agree. Network failures can create all sort of issues. E.g. a partition is not an absent situation.
And if you say you can tolerate 2 failures, it is no good if then you add things like "if they happen with enough time in between".
Many HA schemes can do self healing. That does not qualify as able to tolerate multiple failures, AFAIK.
Going to the detail, keep in mind that a stripe also counts as a component, so even if you have FTT=2, you may end up having witnesses to balance the host count toward partition quorum.
Regarding FTT, suppose you have 2 hosts with 3 disk groups (2 in the first host, 1 in the second). Your default policy of FTT=1 would be in violation from the beginning. Why? Because you can lose a single host
Now suppose you introduce a 3rd host and add a single disk group to it. FTT=1 is OK. To this end when you define FTT you are defining how many hosts you will tolerate failing.
Suppose you have 3 disk groups per host and 3 hosts for a total of 9. You lose one of the disk groups on one of the hosts and all disks are marked degraded. Assuming you have enough capacity remaining, your objects will be rebuilt using existing capacity. From the standpoint of the policy, you are not in violation of FTT=1. This is why disks/disk groups are not specifically impacted by FTT. The exception is when you have one disk group per host at which point losing that group causes you to lose the entire contribution from the host.
Now loss of disk and group does have impact when FTT=0. This means you have 0 copies and if you lose the disk you lose the objects that are not protected. From that perspective FTT has a direct relationship to disks/groups but when 0.
As for i) and ii) what exactly don't you agree with? The term absent is Virtual SAN specific here - it will not begin to rebuild objects should you lose network connectivity until the 60 minute timer expires.
Proper network design is important here. If you are connecting 3 - 7 hosts to a pair of common upstream switches and you've configured your failover policy correctly, the only thing that would cause a 3/5 or 5/7 split would be someone misconfiguring multiple ports simultaneously or a physical issue with multiple cables.
Stripes are objects but if you designed the cluster properly to meet your FTT requirements the same rules apply to 1 stripe vs. multiple stripes
let's focus on one thing or else it is hard to have a meaningful interaction. So let's focus in i).
You say FTT is only defined as hosts failures. I say that documents explicitly say otherwise.
Can we agree on something here ?
I disagree with how I think you may be interpreting the document.
5.1.1 Number of Failures to Tolerate
This property requires the storage object to tolerate a defined number of concurrent host, network, or disk
failures in the cluster and still ensure availability of the objects.
This is a general statement. Again, tell me where I'm wrong here. Suppose you have 3 servers in the cluster. Each has:
- 2 x 10GbE NICs
- 1 x SSD 100GB
- 2 x 1TB SATA drives
All configured with FTT=1. According to how I believe you are interpreting this, if you lose either:
- 1 NIC
- 1 drive
- 1 host
You are violating FTT=1. I am saying this is not necessarily true at all. Depending on what fails and when you have either an absent or degraded situation.
Absent = 60 minute timer kicks in and flips to degraded at the end
Degraded = failed, time to rebuild the components
Now, let's walk through the 5.1.1 concurrent definition via some scenarios. Suppose you have 50 VMs, single disk, and they are all FTT=1 via the default policy. You have (not simultaneously) the following failure scenarios:
- You lose a host. This means it died, it lost all of its drives simultaneously, or it lost all network connectivity somehow. For the first 60 minutes it is considered absent and all objects on that host are now effectively (not policy) FTT=0. Since you have a policy of FTT=0, you cannot provision new machines/disks because doing so would violate your policy. After 60 minutes, assuming you have capacity, the objects will begin to rebuild within the free space of your cluster however you may not provision new machines until you have a 3rd host. So this is where FTT is based on HOST failures as I am stating.
- You lose a NIC. Assuming you have configured your switches/failover policy according to best practice, nothing happens. You are not out of policy, FTT=1 is still effectively your protection level and your policy is not violated. So depending on how you interpret 5.1.1, if you assume a NIC failure is a network failure, it is independent of FTT. Now suppose you only have one NIC per server. If you lose one, the host is completely isolated. This is basically a HOST failure and the above HOST failure scenario applies.
- You lose a HDD. All of the objects on that HDD are unprotected (effectively FTT=0) until rebuilt. The disk is marked degraded and background rebuilds of unprotected objects begins immediately. Note that this will protect things that have no other copy until the disk is replaced. Your effective policy is still FTT=1 and you can provision VMs. This is not a HOST failure and FTT is not impacted (although total pool capacity is)
- You lose a SDD. While I haven't tested this scenario it's effectively the same as a loss of a disk group (each disk group requires at least one SSD). If your host only has a single disk group, this is very much like a HOST failure. If your host has multiple disk groups, it's like losing a single HDD - your effective FTT for machines that lived on that group is 0 and they are unprotected. As far as policy goes you can still provision, everything is still considered FTT=1, life is good.
The entire point is that FTT is determining how many HOSTS you can lose. If you have a single disk group per host and you have an event that causes that disk group to be lost, whether that's network isolation for >60 minutes, loss of the SSD, all of the HDDs, or the host itself dies, this is considered a HOST failure and thus FTT policy is impacted.
OK, lets progress:
First "nitpick", FTT is configured in the object policy, not in the infrastructure, so the "all configured with FTT=1" I take to imply that the cases we talk deal with an object with FTT=1.
I'm not saying (nor thinking) that you violate FTT=1 ever. I would only use "violate" at provision time (when you try to define a new object with a given policy on your infrastructure, and the infrastructure is not able to comply with the requested policy).
After that (i.e. your object has been admitted so to say) an event can degrade it, or make it unavailable.
So first part of your view was wrong about what I interpret from the doc.
To be clear, what I interpret from the doc is that *after* being provisioned with FTT=n, an object will tolerate n failure events and still be available. And those failures can be concurrent, i.e., happen at once (not with healing time in between). And these failures can be from any of the three domains that are talked about: groups, hosts or network.
Just as an example, one network failure can create a partition. Nowhere in the design documents is there a need for network redundancy. I consider a switch failure to be a valid network failure (to be tolerated).
Now, you go on on examples, but present like an event changes your policy. I do not see it that way. Your policy stays, but you have a failure and a degraded system. System can heal by itself but that's not the point. If you had FTT=1, all that VSAN was promising is that it would keep the object available after 1 failure. I expect all such objects to be available after 1 failure. Nothing more.
The docs do not require failover NICs to be present for this to happen. So when you say that loosing a NIC will change nothing, you are assuming more things than documented.
And to support my view that a network event can create a partition, I see lots of material covering witnesses and quorum, all issues that relate only to a partitioned cluster. So the solution seems to be designed to tolerate a partition as a network event.
So lets keep at 5.1.1 until we agree about what it means
Do we ?
I am not in disagreement with you at all that a network split can occur. By the way, the design doc page 8 discusses the use of multiple NICs. While Virtual SAN itself does not require such a design (you can use one NIC), vSphere best practices have always recommended the use of multiple NIC uplinks.
FTT=n means you can tolerate n host failures. I realize exactly what the doc is stating, we are totally on the same page there. I'm telling you it's host failures and that network, disk, and cpu/memory/etc. can all trigger host failures. The reason I'm saying this is laid out in my examples - you can have FTT=1 and tolerate multiple disk failures and never violate policy and have all of your objects online. You can tolerate 3 NIC failures, 1 switch failure (assuming you have 2) and a disk failure simultaneously - all of these things will maintain total policy compliance with FTT=1 despite it being FTT=1 if you have things engineered correctly.
What you cannot tolerate is the loss of multiple hosts with FTT=1. If you have FTT=2 and you have enough hosts according to the formulate I laid out - FTT*2 + 1 which makes 5 hosts for FTT=2, 7 hosts for FTT=3 - you can tolerate the number of host failures designated in the FTT.
If you are not following that formula, say you have FTT=2 but only 3 hosts - you cannot lose two hosts.
My point is how many concurrent things you can lose, etc. is not only dictated by FTT but your design.
At this point I'm not really sure what your issues are. Do you agree that FTT refers to the number of components that an object relies on failing? I.e. two disk groups containing an object simultaneously failing with FTT=1 would mean loss of data?
My entire point in saying this is a host issue is aside from the event where you simultaneously lose two disk groups on two different hosts, everything else requires host isolation/failure and the point of planning FTT is around planning host failure.
> At this point I'm not really sure what your issues are.
My issue now is that you say FTT means HFTT (Host failures to tolerate) and sincerelly, I see no supporting document to your view.
I read FTT as concurrent failures to tolerate being Host, Groups or Network. If we can not agree on this, then going forward is pointless.
Did you stop reading my reply immediately after the statement you quoted? Because the very next sentence is
Do you agree that FTT refers to the number of components that an object relies on failing? I.e. two disk groups containing an object simultaneously failing with FTT=1 would mean loss of data?
Do you disagree or agree with this?
No reason for you to be argumentative here, I'm trying to help.
Sorry, I'm not trying to be argumentative. But if we don't agree on what FTT means, then it is pointless to move on, as this definition is fundamental.
Strictly speaking, I disagree, because "components" is one kind of resource in the VSAN, be it a replica, a stripe, or a witness.
I guess I've just stumbled into a piece of doc that shows where the issue might be: it seems that words like "component" are used with different meaning at different places, which does not help when being strict.
In 4.1, it says "... vSphere cluster can contend with the failure of a vSphere host, or of a component within a host—for example, magnetic disks, flash-based devices, and network interfaces—and continue to provide complete functionality for all virtual machines."
Which is inline with your view os a failure, inside a host.
But still, a host failure can not create a partition (do you agree ?) and there is plenty of complexity that is there to cope with network failures in general. I just would like a clear definition of what failures does a VSAN tolerate. It would seem clear that it tolerates some network failures, but now it seems (to me) that "host network failure" is what is implied in the FTT account.
BTW, even if you have FTT=1, two failures do not imply loss of data... you may get lucky and get two disks in the same group going down
Ah ok I see the confusion now
I was trying to draw the distinction between Virtual SAN components and host components and I think I see where I failed.
There are 4 main Virtual SAN objects:
The VM home (its home folder in VMFS)
The VM swap file
Each of these make up one or more components based on a variety of factors. One is disk size - disks larger than 255GB break apart into multiple 255 or less GB components. Each snapshot is a component (these are each a vmdk subject to the 255GB limit). Each stripe written is a component. Each mirror copy of the above is a component. Finally each witness is a component.
This is different from host components such as a NIC.
From the perspective of the Virtual SAN cluster a host failure creates a partition. If nodes 1 and 2 can't communicate with node 3, node 3 is partitioned essentially.
Now if vcenter can see all 3 nodes but the VSAN enabled nodes but the vmkernels that are VSAN enabled can't communicate that's also a partition.
Which of the above 2 scenarios are you referring to with regard to partition?
I feel like we are on the same page now. Curious what other questions you have regarding partitions.
Thanks for your patience
>From the perspective of the Virtual SAN cluster a host failure creates a partition. If nodes 1 and 2 can't communicate with node 3, >node 3 is partitioned essentially.
Not really. Again, that depends on the usage of the word, but usuallt a partition is when you have more than one host at two sides of a "fracture" (new word, just not to confuse things). If there is only one host, usually it is called "isolated" and not "partitioned".
This is very well documented in HA documents. The whole thing started for me because as quorum and partitions were being analyzed, it was (wrongly) obvious to me that a network failure was a, well, a network failure And not a host network component failure, as it now seems evident.
> Now if vcenter can see all 3 nodes but the VSAN enabled nodes but the vmkernels that are VSAN enabled can't communicate that's > also a partition.
You lost me here.
To use your terms:
In a four node cluster, if nodes 1 and 2 can communicate with each other but not with 3 and 4, and 3 and 4 can communicate with each other but not nodes 1 and 2, you have a fracture creating two partitions? Correct?
As opposed to my scenario where you have three hosts fractured creating an isolated host and a partition?
Regarding vcenter, Virtual SAN uses different vmkernels than management, again assuming you followed best practice. In that case it's entirely possible for vCenter to see all three hosts and manage them while the Virtual SAN cluster is in a partitioned state. My point is we need to distinguish between which clusters are isolated or partitioned.
Right. The problem with partitions is that they may think "i'm the one" and keep working.
That could turn ugly afterwards, hence the need for strictly greater than 50% components availability (no way to have two active partitions).
Yes, I know that a different network is used. This also affects FDM (HA) to have the "same idea" of what's going on between VSAN and HA. I don't want to get into implications about vCenter and what it sees. It is quite complicated in terms of HA, I don't have enough info to think about in the case of VSAN. (HA can indeed keep working in a fractured cluster, one master in each partition).
I suspect (but have not seen info on) that a master is indeed implemented in the VSAN.
working from phone and memory but I believe the command is:
esxcli vsan cluster get
you'll see cluster status and the guid of the master
Does this help your understanding? vSphere 5.5 Documentation Center
HA behaves differently with Virtual SAN. If a host is isolated and the objects fully reside in the isolated host I believe the VM keeps going. If the objects are on other hosts the VM would HA restart.
I guess I will have to play with it ti get more confident with how things work.
But the docs are not filling the gaps. To the contrary, every time I read something, I come up with "how is this so" ?
Case in point, the doc you cited has an example of how a system with 3 hosts can not get to a working state even with quorum:
Consider an example where a Virtual SAN virtual machine is provisioned to tolerate one host failure. The virtual machine runs on a Virtual SAN cluster that includes three hosts, H1, H2, and H3. All three hosts fail in a sequence with H3 being the last host to fail.
After H1 and H2 recover, the cluster has a quorum (one host failure tolerated). Despite this, vSphere HA is unable to restart the virtual machine because the last host that failed (H3) contains the most recent copy of the virtual machine object and is still unaccessible.
The writing suggest that somehow H1 and H2 "know" that H3 was alive after them. To me, I don't know how that could be the case,
although it might be the case that H2, being a witness, knows that (presumably H3) has a later state than stored at H1.
In any case, my main issue is now solved, Thank you!