VMware Cloud Community
cluey
Enthusiast
Enthusiast
Jump to solution

HA Isolation response with vSAN

Hi everyone, hopefully someone can offer some advice on this.

I'm new to vSAN but trying to get some design decisions together for HA clusters in a vSAN environment. Our environment (in short0 is as follows:

  • 8 node cluster
  • All nodes have storage and participate in the vSAN
  • n+1 resilience required
  • HA/DRS required
  • dual, 10Gbe NICs will be used for all traffic (with NIOC shares configured for QoS)
  • VMFS datastore (shared across all hosts) will be used for templates, ISO's etc.


Question is, I'm struggling a little on some aspects of isolation response. There are some good articles out there and I'd say I understand 80-90% of it. In our scenario, if a host were to become isolated, then HA heartbeats (over the vSAN network) would fail and the isolation response would be triggered, that's fine (in our scenario power off/shutdown I guess would be best option as VM's would have lost all network access too).

Question is, how does having a VMFS datastore available to all hosts in the cluster (that HA would configure for Datastore heartbeats) change the decision for what isolation response use?

Also, if there are, say, two hosts that become partitioned form the other cluster hosts, the isolation response wouldn't be triggered on those two hosts as they would simply elect a new master and continue to operate (along with the VMs running on those hosts). However, the other hosts (lets say 6 of them) that are now in their own partition can't see the other two hosts and they initiate the HA response (restarting the VM's from the other two hosts). What strategy needs to be in place to deal with this?

Thanks in advance.

Andy

Tags (2)
Reply
0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution

Hi there, good question. Lets go over this.

Question is, how does having a VMFS datastore available to all hosts in the cluster (that HA would configure for Datastore heartbeats) change the decision for what isolation response use?

This would not change the decision to define the isolation response. Look at it differently, when the VSAN network has failed the host cannot access the components of the impacted objects any longer. Which means that the VMs which are running on the host which is isolated just lost connection with their storage. If connection is lost with storage then in most cases the VMs running there will be useless. Even if you add heartbeat datastores this doesn't change the fact that those VMs are not able to connect to the storage system. Either way, I would always go for "power off". That way when the isolation is lifted the "isolated VM" is already gone.

For a partition this is different. There is no "partition response" that you can define. So if there is a partition then the partition which owns > 50% of the components will get ownership of the object, the other side will lose ownership. And then the VM can be restarted... but it won't be powered off automatically as can be done with a isolation event. In the case of a partition when the partition is lifted the host running the VM which has lost access to its storage will recognize it has lost access and then kill the processes of the VM.

Does that help?

View solution in original post

Reply
0 Kudos
3 Replies
depping
Leadership
Leadership
Jump to solution

Hi there, good question. Lets go over this.

Question is, how does having a VMFS datastore available to all hosts in the cluster (that HA would configure for Datastore heartbeats) change the decision for what isolation response use?

This would not change the decision to define the isolation response. Look at it differently, when the VSAN network has failed the host cannot access the components of the impacted objects any longer. Which means that the VMs which are running on the host which is isolated just lost connection with their storage. If connection is lost with storage then in most cases the VMs running there will be useless. Even if you add heartbeat datastores this doesn't change the fact that those VMs are not able to connect to the storage system. Either way, I would always go for "power off". That way when the isolation is lifted the "isolated VM" is already gone.

For a partition this is different. There is no "partition response" that you can define. So if there is a partition then the partition which owns > 50% of the components will get ownership of the object, the other side will lose ownership. And then the VM can be restarted... but it won't be powered off automatically as can be done with a isolation event. In the case of a partition when the partition is lifted the host running the VM which has lost access to its storage will recognize it has lost access and then kill the processes of the VM.

Does that help?

Reply
0 Kudos
cluey
Enthusiast
Enthusiast
Jump to solution

Thanks Duncan, it does help. Think I've got it now.

So even if the host were to have a completely separate storage network for the VMFS datastore, because a slave wouldn't make use of the datastore heartbeat it would still class itself as isolated and trigger the isolation response. With each VM that gets powered off the isolated host removes the entry for the VM from the Poweron file which the Master can see and so restarts the Virtual machines.

Thanks again.

Andy

Reply
0 Kudos