We had an isolation issue on our 4.1 production systems many months ago. Since then, we've not been allowed to use HA to automatically restart VMs on another server. Now that v5.0 is in the mix with datastore heartbeats, I'm pushing to be allowed to re-enable it.
In the event that occurred, our vCenter system became disconnected from our two-host cluster due to a switch outage. The same outage caused one of the hosts in the cluster to become isolated. The VMs were still running on the isolated host, but when the non-isolated host tried to launch the VMs covered under HA, they corrupted data on the datastore because two identical VMs were running against the same data (or rather, both hosts were attempting to start and restart the same VM over and over).
How do the hosts (in an isolation event) prevent data corruption from having multiple VMs accessing the same data at the same time? By that, I mean if we have a host that goes into isolation but the VM continues to run (and continues to access the datastore), and the VM subsequently starts up on another host, what's to prevent both running VMs from causing my datastore to fry? What is the best way to configure HA in this respect? We have several VMs for which a graceful shutdown would be highly desirable (SQL, Exchange), but we have some that can undergo a hard shutdown.
I've read a lot of documentation about HA, but none of it seems to cover the data access aspect of it. If you know of some documentation that does, I would appreciate a link.
Message was edited by: fpineau - Clarified incident to rule out file locks
Thanks, I found that just a few minutes after posting my question. The author describes my scenario exactly (except for the datastore corruption which may have been only tangentially related to the isolation incident):
When one of the hosts is completely isolated, including the Storage Network, the following will happen:
- Host ESX001 is completely isolated including the storage network(remember iSCSI/NFS based storage!) but the VMs will not be powered off because the isolation response is set to “leave powered on”.
- After 15 seconds the remaining, non isolated, hosts will try to restart the VMs.
- Because of the fact that the iSCSI/NFS network is also isolated the lock on the VMDK will time out and the remaining hosts will be able to boot up the VMs.
- When ESX001 returns from isolation it will still have the VMX Processes running in memory and this is when you will see a “ping-pong” effect within vCenter, in other words VMs flipping back and forth between ESX001 and any of the other hosts.
As of version 4.0 Update 2 ESX(i) detects that the lock on the VMDK has been lost and issues a question which is automatically answered. The VM will be powered off to recover from the split-brain scenario and to avoid the ping-pong effect. The following screenshot shows the event that HA will generate for this auto-answer mechanism which is viewable within vCenter.
It looks like as of at least v4.0U2 (and especially in v5.0 with datastore heartbeats) this is no longer an issue.
In the event that occurred, our vCenter system became disconnected from our two-host cluster due to a switch outage.
I know it's not a direct answer to your query, but I believe the best "preventing corruption" solution would have been to utilise a redundant switch and multipathing so that your datastore would have just failed over.
Normally you should never see corruption of VMDK files, not under 4.0 or 5.0 that is. I would definitely recommend upgrading to 5.0 or 5.1 (when it is released) though. vSphere HA is a lot smarter in 5.0 / 5.1 when it comes to isolation events and should prevent many of the known issues in smaller environments.