VMware Cloud Community
hennish
Hot Shot
Hot Shot

Why can't ESXi and/or HA handle APD situations properly?

Hi. I'm designing a vSphere 5.1 solution for a customer that wants a high uptime of its VMs and services (like everyone else), and want to enable this by putting two sets of ESXi hosts, EMC VPLEXes and VNXes in two different (low-latency between them) sites.

Most of the crash test cases will work well, but in the case of APD due to for example both FC switches crashing or zoning getting misconfigured, nothing will make the zombie VMs (that don't have disk access anymore) get restarted in the second site.

The KB article Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.0 talks about the difference between PDL and APD.

If the issue results in a PDL, everything is fine and the VMs can get killed and restarted by HA if the settings "disk.terminateVMOnPDLDefault" and "das.maskCleanShutdownEnabled" are set to True. (More info in http://www.yellow-bricks.com/2012/04/25/what-is-das-maskcleanshutdownenabled-about/)

But in an APD situation, nothing will happen(!) The ESXi host(s) will go into an endless loop (or 140 seconds if setting "Misc.APDHandlingEnable" to True), get completely useless and rather unreachable, disconnect from the vCenter Server, prevent vMotioning, but they will insist on keeping their VMs running.

The main problem here is that HA will (if I understand correctly) just sit by and watch, since all it cares about is the hosts' FDM heartbeats, not the health of the VMs or their network or storage connectivity. This is even if VM monitoring is enabled, since all it checks is the VMwareTools-to-host heartbeat.

Perhaps we have to invent an VM monitoring app that monitors the VM's disks and interfaces with HA Application monitoring..

Please tell me if there is something I am missing or misunderstanding in my story above.

Suggestion: Make HA check the actual health of the VMs, such as their network and storage connectivity, rather than only the FDM heartbeats.

Thanks in advance!

Reply
0 Kudos
2 Replies
hennish
Hot Shot
Hot Shot

I think I found the answer: This can't be solved in today's version, but might be solvable in an upcoming one:

http://www.yellow-bricks.com/2012/09/05/inf-bco2807-vsphere-ha-and-datastore-access-outages/

I also viewed INF-BCO2807 briefly and concluded the same thing that Duncan wrote in his post.

Reply
0 Kudos
3CV
Enthusiast
Enthusiast

We had a situation where I needed VMs in an APD state to shut down to enable fast failover so the redundant pair of each VM on a 2nd Host and SAN could take over the service.  I had to script it, so that I run a script from crontab every minute and scan the SANs for the presence of the datastores by writing a small text file.  Exit code of 0 means success.  Anything else shuts the VMs on that SAN.  Works perfectly, with VMs powering off in less than 45 seconds. 

Reply
0 Kudos