VMware Cloud Community
RSEngineer
Enthusiast
Enthusiast
Jump to solution

What Happens When a Node Fails?

hello - VMs set for FTT=1. Three nodes in the cluster (E-Nodes). What happens when one node fails? What do you lose? What wont you be able to do? Can someone paint a detailed picture of what to expect?

1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello RSEngineer

"this is why .vswp Objects have Force-Provisioning=1 so that they dynamically create FTT=0 Objects if there are not enough Fault Domains (nodes here) to make FTT=1 Objects and thus allowing VMs to be powered on with only 2/3 nodes online."?'

Not sure what else there is to add: for an FTT=1 Object you need 3 Fault Domains(nodes in a non-stretched cluster) data+data+witness components, .vswp are essentially a reserved space for the VMs assigned memory (minus memory-reservation) in the case that it needs to swap to disk (under contention/usage). These are stored as Objects in vSAN and thus need to be created when a VM is powered on - if you only have 2 nodes available then you can't make an FTT=1 Object then making an FTT=0 Object so the VM can power on is essential (and make it FTT=1 later when the 3rd node becomes available).

"Also, with only 2 out of 3 nodes online, can you create a VM with an FTT=0 during that outage? I know this isn't advisable because there is no protection, but I am just wondering if it's possible."

Yes, of course you can as these only require a single Fault Domain/node for component placements.

"Also, when you mentioned the VMs migrating to the surviving nodes from the failed node, what VMware feature is doing that? is it vSphere HA? Does that automatically come with vSAN or is it an added license?"

Yes, vSphere High Availability - vSphere Essentials Plus or higher licensing is required for this feature .

As Joerg mentioned. having 4 nodes (or more) versus 3 really adds to the in-place resiliency that vSAN can provide - if you have a 3-node and you have a hardware failure that may take a number of days to replace/fix then the data is going to be FTT=0 until that gets resolved. If you have 4 or more nodes (and enough overhead space left free), the data can be rebuilt back to FTT=1 on the remaining available nodes, after which it can of course withstand another node failure with all the data still being available, this in my opinion provides a lot of peace of mind.

Bob

View solution in original post

6 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello RSEngineer​,

"What do you lose?"

You shouldn't lose access to anything provided all Objects are FTT=1 and compliant with their Storage Policy.

"What wont you be able to do?"

You won't be able to create (compliant) FTT=1 Objects until you have the node back up and participating in the cluster - this is why .vswp Objects have Force-Provisioning=1 so that they dynamically create FTT=0 Objects if there are not enough Fault Domains (nodes here) to make FTT=1 Objects and thus allowing VMs to be powered on with only 2/3 nodes online.

"Can someone paint a detailed picture of what to expect?"

If a host has a sudden hard failure then all of the VMs that re running on it are going to go down with and provided you have HA set-up correctly and enough compute resources it will restart the VMs on the remaining nodes.

Once the absent node comes back the changes that have been applied to the remaining data copies are synced.

Relatively old articles but the fundamentals remains the same:

https://cormachogan.com/2013/09/17/vsan-part-9-host-failure-scenarios-vsphere-ha-interop/

How VSAN handles a disk or host failure

Bob

IRIX201110141
Champion
Champion
Jump to solution

Duncan write dozen of articles about vSAN. Take a look to Search for "vsan" - 56/58 - Yellow Bricks

10154498566_d0e1dbcc15[1].jpg

If the VM runs on another host... nothing happends but you will lose the Disk redundancy.

If the VM runs on the affected host.... HA kicks in and restart the VM as long as one Data copy and Witness or both Data copys are available.

If you have one Host in maintenance mode and during this period you loose disk you have  problem. This is why most ppl prefer to start with 4 or more Hosts and vCenter Foundation was changed to 4 managable Hosts.

If you lose a cache device... the whole diskgroup goes offline.

If you lose a capacity device.... the data will be recreated on host #4  if you have it.

Regards,

Joerg

RSEngineer
Enthusiast
Enthusiast
Jump to solution

Thanks, Bob.

Can you please elaborate a bit more on this phrase: "this is why .vswp Objects have Force-Provisioning=1 so that they dynamically create FTT=0 Objects if there are not enough Fault Domains (nodes here) to make FTT=1 Objects and thus allowing VMs to be powered on with only 2/3 nodes online."?'

Also, with only 2 out of 3 nodes online, can you create a VM with an FTT=0 during that outage? I know this isn't advisable because there is no protection, but I am just wondering if it's possible.

Also, when you mentioned the VMs migrating to the surviving nodes from the failed node, what VMware feature is doing that? is it vSphere HA? Does that automatically come with vSAN or is it an added license?

Reply
0 Kudos
RSEngineer
Enthusiast
Enthusiast
Jump to solution

Thanks to you, too, Joerg. Good stuff. I am asking all these questions because I have a client (I'm a pre-sales SE) who is balking at the price of a 4-node cluster. He has a small environment, but it runs all his mission critical apps, so I configured 4 nodes to be safe and as flexible as possible. I am wondering if I should just take out a node

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello RSEngineer

"this is why .vswp Objects have Force-Provisioning=1 so that they dynamically create FTT=0 Objects if there are not enough Fault Domains (nodes here) to make FTT=1 Objects and thus allowing VMs to be powered on with only 2/3 nodes online."?'

Not sure what else there is to add: for an FTT=1 Object you need 3 Fault Domains(nodes in a non-stretched cluster) data+data+witness components, .vswp are essentially a reserved space for the VMs assigned memory (minus memory-reservation) in the case that it needs to swap to disk (under contention/usage). These are stored as Objects in vSAN and thus need to be created when a VM is powered on - if you only have 2 nodes available then you can't make an FTT=1 Object then making an FTT=0 Object so the VM can power on is essential (and make it FTT=1 later when the 3rd node becomes available).

"Also, with only 2 out of 3 nodes online, can you create a VM with an FTT=0 during that outage? I know this isn't advisable because there is no protection, but I am just wondering if it's possible."

Yes, of course you can as these only require a single Fault Domain/node for component placements.

"Also, when you mentioned the VMs migrating to the surviving nodes from the failed node, what VMware feature is doing that? is it vSphere HA? Does that automatically come with vSAN or is it an added license?"

Yes, vSphere High Availability - vSphere Essentials Plus or higher licensing is required for this feature .

As Joerg mentioned. having 4 nodes (or more) versus 3 really adds to the in-place resiliency that vSAN can provide - if you have a 3-node and you have a hardware failure that may take a number of days to replace/fix then the data is going to be FTT=0 until that gets resolved. If you have 4 or more nodes (and enough overhead space left free), the data can be rebuilt back to FTT=1 on the remaining available nodes, after which it can of course withstand another node failure with all the data still being available, this in my opinion provides a lot of peace of mind.

Bob

RSEngineer
Enthusiast
Enthusiast
Jump to solution

Thanks, again, Bob

Reply
0 Kudos