FT behavior during datastore outage

tehberg · ‎08-01-2012

Hi all,

I have a question regarding a datastore failure on a particular host, and the behavior of FT in this scenario.

I have two VM hosts. Each has two vSwitches. Each are running 5.0U1.

vSwitch0 has my management/FT/vMotion vmkernel ports that are connected via four 1Gb interfaces to separate switches, two per. Routing is set to originating virtual port id.

vSwitch1 has my VM portgroups, and my NFS vmkernel port. There are two 10Gb ports in this vSwitch, each going from a distinct card to a separate Nexus 7000. Same routing, no etherchannels.

All storage lives on an NFS datastore.

Let's say that Host #1's 10Gb nics freeze. (crappy NC522SFP cards) Therefore, access to the NFS datastore is halted. However, I believe HA would not trigger because the HA agent can still talk to the master via vSwitch0, and therefore the condition of heartbeat being gone, and the network gone would not be met. As I understand it, both have to be met for HA to trigger.

In this case, would the FT secondary take over, because the primary will begin to hang/freeze/freak out? Or will we end up in some weird state where the FT replication would continue, but the VM would not actually fail over? I know the two VM's in the FT pair heartbeat each other, would the loss of disk create a situation where the heartbeat would be lost?

I feel like this is a weird edge case where FT may not provide protection since there is technically not an HA event here.

Thoughts? Design flaws?

Thanks.

Footnote:

I'm thinking adding a management vmkernel port to vswitch1 might be in order to add more interfaces for HA to check? But if I did that, technically it would still be up since the other management port works. Or should I move all management ports to the same vSwitch as the IP storage? If HA did trigger if those nics died, I'd have to set my host isolation response to shutdown all VM's in that case, if the nics recovered that would be a bad thing.

prashantd · ‎08-13-2012

FT won't detect datastore or VM network outage and won't trigger failover unless datastore outage results in Primary VM crash. There has been numerous feature requests to detect storage failure & trigger FT failover if storage on other side is healthy. If you are attending VMWorld, you may be interested in https://vmworld2012.activeevents.com/connect/sessionDetail.ww?SESSION_ID=2807