VMware Cloud Community
Ms934
Contributor
Contributor
Jump to solution

Host isolation vs Host Failure

I just wanted to know how the esxi hosts differentiate between a host failure and a host isolation , especially if they are in a vSAN cluster.

Reply
0 Kudos
1 Solution

Accepted Solutions
sk84
Expert
Expert
Jump to solution

This is quite a complex subject, but I will try to outline it briefly.

There are 3 different types of failures that vSphere HA (or now vSphere Availability) can detect:

- Complete host failure

- Network isolation / partition

- Host cannot communicate with the master host in the cluster

Various checks are performed to detect these failures. For example, the master host checks every second if it can reach the agent on the slave hosts and pings the slave hosts if necessary. Furthermore there are datastore heartbeats where all hosts periodically write a file on a shared datastore. So all hosts can see if every member in the cluster still has access to this storage.

If a host no longer responds to the HA agent heartbeats and is no longer accessible via ping and no datastore heartbeats are seen from this host, the master host assumes a total failure of this slave host.

If the master host can no longer reach a slave host over the network, but there are still datastore heartbeats, it suspects a network isolation of this host.

And a host can also consider itself network isolated if it no longer receives agent heartbeats and cannot ping the cluster isolation addresses.

But with vSAN there are some changes which impacting these failure checks.

First, these HA heartbeats are no longer transmitted via the management network, but via the vSAN network. Therefore, the default cluster isolation address should be overwritten with an ip address of the vSAN network. Furthermore, datastore heartbeats do not work on vSAN datastores. If you want to use this feature, you need at least 1 additional conventional storage (NFS, iSCSI, FC) for this cluster.

Hope this helps.

You can find here more information about this topic:

- https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-availability-guide.pdf

- vSphere HA considerations

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.

View solution in original post

Reply
0 Kudos
1 Reply
sk84
Expert
Expert
Jump to solution

This is quite a complex subject, but I will try to outline it briefly.

There are 3 different types of failures that vSphere HA (or now vSphere Availability) can detect:

- Complete host failure

- Network isolation / partition

- Host cannot communicate with the master host in the cluster

Various checks are performed to detect these failures. For example, the master host checks every second if it can reach the agent on the slave hosts and pings the slave hosts if necessary. Furthermore there are datastore heartbeats where all hosts periodically write a file on a shared datastore. So all hosts can see if every member in the cluster still has access to this storage.

If a host no longer responds to the HA agent heartbeats and is no longer accessible via ping and no datastore heartbeats are seen from this host, the master host assumes a total failure of this slave host.

If the master host can no longer reach a slave host over the network, but there are still datastore heartbeats, it suspects a network isolation of this host.

And a host can also consider itself network isolated if it no longer receives agent heartbeats and cannot ping the cluster isolation addresses.

But with vSAN there are some changes which impacting these failure checks.

First, these HA heartbeats are no longer transmitted via the management network, but via the vSAN network. Therefore, the default cluster isolation address should be overwritten with an ip address of the vSAN network. Furthermore, datastore heartbeats do not work on vSAN datastores. If you want to use this feature, you need at least 1 additional conventional storage (NFS, iSCSI, FC) for this cluster.

Hope this helps.

You can find here more information about this topic:

- https://docs.vmware.com/en/VMware-vSphere/6.5/vsphere-esxi-vcenter-server-65-availability-guide.pdf

- vSphere HA considerations

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.
Reply
0 Kudos