We are currently running 5.1 Update 3 with the following HA settings applied
Restart priority - High
Host Response - Leave Powered on
VM Monitoring - Disabled
Datastore Heart beating - Select any of the cluster datastores. Then checking the heartbeat datastores I see that vsphere has selected 2 datastores for its heartbeat.
We had an issue with a single host where it dropped its datastores. Virtual machines were still pinging as they were still in memory but they weren't in a healthy state. Virtual machines didn't HA onto other hosts in the cluster until we rebooted the ESXi host that had the storage issue.
Can someone please explain why servers didn't HA on the ESXi which had no SAN storage available to it. I would of thought it would of kicked in as there would be no heartbeat for the datastores.
That's not the purpose of the heartbeat datastore. The HB datastore is intended for an host to update the datastore to inform the HA master that it's either partitioned or isolated. If partitioned, no HA restart will happen. When isolated, the isolation response will kick in. In your case, the VMs would still not restart, even when a host is determined to be isolated. This is because vSphere 5.1 HA determines if a host is dead by monitoring the management network. I am assuming that the host management network was available the entire time during the storage outage.
If you want vSphere HA to recover from a host outage that is impacting only subset of hosts in a cluster, you will want to look into using VM Component Protection that is included with 6.0 and greater. This offers the protection from a storage outage and will restart VMs on other hosts in the cluster.
Thanks for confirming. I was under the impression heartbeat datastore offered something else. Looking back on it I guess its best if storage is lost on a host you want to be able to confirm first that the storage is ok before moving servers off to other hosts in the environment.
This depends on what exactly had happened there. When a host loses connectivity to the datastores, VMs become inaccessible and are restarted on other hosts. Apparently this was not exactly what happened in your case. VMs were in a semi-failed state, as you could still ping them and that's why HA didn't restart them.
Not sure what you mean by "if storage is lost ... you want to be able to confirm first that the storage is ok". Storage is either lost or not, so there is no need for additional mechanisms to determine that.