VMware Cloud Community
DSeaman
Enthusiast
Enthusiast

FT isolation response? secondary VM powers off and we have downtime

We have a 4.1 U1 ESXi cluster with three hosts. HA, DRS and FT are all enabled (FT just for one VM). If we pull the power from one server HA kicks in and restarts the VM on another host and the secondary FT VM becomes the primary and network connectivity is restored in a second or two. Life is good!

But, if on the FT primary host we pull all network cables but leave the host running, HA restarts the non-FT VMs on the other host but the secondary FT VM is powered OFF and primary FT VM is still running in the isolated host. This creates a situation where the VM is no longer accessible on the network and thus experiences permanent downtime.

I can't imagine a host isolation situation with FT is designed to cause the VM to lose all network connectivity. So this is either a bug, or something is wrong with our configuration. Thoughts?

Derek Seaman
Reply
0 Kudos
9 Replies
FranckRookie
Leadership
Leadership

Hi DSeaman,

As long as the first VM is still running and locks vmdk files, secondary can't run. There is no other solution but to keep the first VM ON as you configured it to be always available with FT. And as the hosts can't communicate with each other, how could it be possible to switch the VM to another host?

That demonstrates once more that having a VMware cluster without hardening the network could be useless. You must double all uplinks and physical network components to avoid having any trouble when one fails.

Good luck!

Regards

Franck

Reply
0 Kudos
DSeaman
Enthusiast
Enthusiast

But my point is with "regular" HA that the isolation response is to shutdown the isolated VMs, thus freeing the lock, and allowing it to restarted on another host. This works perfectly. My problem is that the behavior for FT is NOT the same, and thus actually lowers availability because it does not automatically power off the primary VM when it is isolated from the network. So the secondary can't take over, and then my application is in a downed state until the primary host is brought backup, or other intervention is performed. That's not what I call fault tolerance.

So my question is, is the behavior I'm seeing "as designed" for FT, or is there something wrong with our environment/configuration. I can't imagine that VMware designed it this way.

Derek Seaman
Reply
0 Kudos
FranckRookie
Leadership
Leadership

I think you observed a normal behaviour of FT. You could read this document "The Design and Evaluation of a Practical System for Fault-Tolerant Virtual Machines", on page 8: "Detecting and Responding to Failure".

Franck

Reply
0 Kudos
GlennGD
Contributor
Contributor

When a host is isolated (due to pulling all of the network cables) the heartbeat goes away and it should initate an FT response. I believe by deafult the heartbeat is sent over the VMkernel port used for FT. If there is no response form either host within 1 seocnd FT should be initiated.

     On the other hand,for detecting HA Isolation responses the host pings the isolation address (the gateway for the vmkernel network by default) and an HA event will happen after 12 sec. of being isolated.    

     Just curious to know how you're exactly simulating the host failure, are you pulling all network cables?

DSeaman
Enthusiast
Enthusiast

Yes we are pulling all network cables from the server. The only connectivity left are fibre channel cables for the shared storage. When the primary FT host is isolated from the network the primary FT protected VM continues to run and the secondary VM is powered off. Since the primary VM can't talk to the network, the application is essentially down until the network connected to the primary FT host is back up. 

Derek Seaman
Reply
0 Kudos
GlennGD
Contributor
Contributor

Ok so now i understand your problem. You have a split-brain situation due to loss of networking. When this happens both VMs try and become the primary and the Vm with the lock (original primary) on the VMDK wins. Since the secondary dosent have a lock on the file he is shut down and the primary would try and restart another secondary if there was another host.

     If you maintained connectivity with the FT logging network or lost shared storage then the FT scenario would work how you expect. I think the only way to safeguard yourself from this type of failure is to ensure you have redundant network connectivity.

DSeaman
Enthusiast
Enthusiast

I guess I'm failing to understand why this is by design, if that's the case. HA would shutdown the isolated VM and restart it on another host, minimizing downtime. FT, from what you are saying, keeps the isolated primary VM running and thus end user see the application as being down. This is a permanent situation until the underlying networking/hardware problem is fixed. This means regular HA has better availability since it powers on the VM on a working host, but FT does not. So I failing to understand how FT is highly available if a split brain situation causes down time. HA handles split brain smartly (if you have host isolation mode set to shutdown), but FT does not. 

Derek Seaman
Reply
0 Kudos
GlennGD
Contributor
Contributor

The only reason the primary stays the primary is because it still has network access to the storage but not the other host via the FT Logging VMkernel port. If you have a redundant network connection this situation dosent happen unless you have multiple failures. I think VMware feels if you have an application important enough to run Ft on then you would make the switches and NIC redundant also.

     FT does have some limitations like your situation. I gues you just have to look at the pros and cons. FT would give you zero downtime if your primary VM failed, primary host failed, or lost connectivity to your storage. FT would not work (due to split-brain) if you lost the FT network beteen the hosts while still maintaining storage connectivity.

Reply
0 Kudos
Gleed
VMware Employee
VMware Employee

Correct, when FT is enabled for a VM it gets excluded from HA, hence why it doesn't get shutdown as part of isolation response.  The theory behind this is since FT is providing near instantaneous failover protection it would have already responded to outages by the time that HA responds (in the ~16 seconds timeframe).

You are looking at a specific corner case for which this decision to exclude FT protected VMs from HA actions can backfire.  The key is to understand the risk and design the network with the required redundancy to avoid getting into this situation.

Reply
0 Kudos