VMware Cloud Community
shahidpp
Contributor
Contributor

Testing no of host failures to tolerate capability on vsan 6.0

I have a 4 node  VSAN cluster where each host contributes one HDD(3.5T) and one SSD

The VSAN datastore is up and running.

Say the hosts are A,B,C,D

I have a VM using default VSAN storage policy residing on VSAN datastore whose data objects are mirrored across two hosts.

-VM is under host A

-Components are mirrored across host A and host B

-Witness is Host C

-HA is disabled in the cluster.

When i shutdown host A ,the VM is showing disconnected although "no. of host failures to tolerate" is set to 1.

I was expecting as the data is mirrored across host B, this host will take ownership of the VM and VM stays connected.

However when HA is enabled the VM is restarted on HostB

So my question is, Do we have to enable HA for VSAN host failure scenario?

Please give me some light on why the data are mirrored.?

I am still a beginner to VSAN. Seeking help on this

Tags (1)
7 Replies
jonretting
Enthusiast
Enthusiast

Assuming you are using DRS and entering maintenance mode for the host your are taking down, the VM should be vmotioned to another host. If you are simulating a complete failure of that host, and the VM in question is using that host for compute, then that VM will be offline. It would seem you are mixing up compute node and the storage policy. In a single host failure scenario that VM's storage is still 100% available. But would need to be restarted on another host by you, scripts, and especially HA. In certain situations without HA you might need to remove the VM from inventory and re-register it (via datastore browser) to a live compute node. Cheers

shahidpp
Contributor
Contributor

Thanks Jonretting for clarifying. Smiley Happy

Yes, I was completely powering of the VM's compute host, causing a VM offline situation.

But the HA scenario below is common to any cluster having a shared storage(other than vsan), ie VM will get restated on an available host in the cluster.

So what does this capability does additionally?

What does the Host failure mentioned in the storage policy means, Is it a disk failure or just a network partition?

Will a manual shutdown come under this?

Thanks in advance

Reply
0 Kudos
zdickinson
Expert
Expert

Failures To Tolerate (FTT) is how many hosts can fail and still have data availability.  The host can fail in any number of ways.  Crash, purple screen of death.  SSD failure, assuming only one diskgroup in a host.  A network failure like you mentioned.  If you have three nodes and 1 fails, you will be w/o redundancy until the node is brought back online. If you have four or more nodes, a rebuild will be started.  I believe there is a timeout before the rebuild start to account for maintenance windows and reboots.

HA will power a machine up on another host in the event of a failure, if that machine was running.  If the machine was powered off at the time of the failure it will show as disconnected until the host is back online.  I hope this helps.  Thank you, Zach.

Reply
0 Kudos
jonretting
Enthusiast
Enthusiast

The default amount of time before a rebuild takes place is still 60 minutes. On my lab setup I would occasionally forget to bring a host out of maintenance, or leave it off too long doing working on hardware.

The setting to change is "VSAN.ClomRepairDelay"

And to avoid restarting the host after modification you can manually restart the "clomd" daemon with:

%$ /etc/init.d/clomd restart

Cheers

Reply
0 Kudos
npadmani
Virtuoso
Virtuoso

HA will power a machine up on another host in the event of a failure, if that machine was running.  If the machine was powered off at the time of the failure it will show as disconnected until the host is back online.

FYI, Little correction is needed in above statement.

if the VM was powered off and host fails which was part of HA cluster, provided that powered off VM was part of shared datastore, it will still be re-registered by HA on one of the other healthy hosts in HA cluster. It's just that it will remain powered off.

Narendra Padmani VCIX6-DCV | VCIX7-CMA | VCI | TOGAF 9 Certified
shahidpp
Contributor
Contributor

Thanks a lot for the information

Reply
0 Kudos
shahidpp
Contributor
Contributor

Thanks a lot Smiley Happy

cheers

Reply
0 Kudos