2 Replies Latest reply on Nov 15, 2012 12:40 AM by hennish

    Why can't ESXi and/or HA handle APD situations properly?

    hennish Hot Shot
    vExpert

      Hi. I'm designing a solution for a customer that wants a high uptime of its VMs and services (like everyone else), and want to enable this by putting two sets of ESXi hosts, EMC VPLEXes and VNXes in two different (low-latency) sites.

       

      Most of the crash test cases will work well, but in the case of APD due to for example both FC switches crashing or getting misconfigured, nothing will make the zombie VMs (that don't have disk access anymore) get restarted in the second site.

       

      The KB article Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.0 talks about the difference between PDL and APD.

       

      If the issue results in a PDL, everything is fine and the VMs can get killed and restarted by HA if the settings "disk.terminateVMOnPDLDefault" and "das.maskCleanShutdownEnabled" are set to True. (More info in http://www.yellow-bricks.com/2012/04/25/what-is-das-maskcleanshutdownenabled-about/)

       

      But in an APD situation, nothing will happen(!) The ESXi host(s) will go into an endless loop (or 140 seconds if setting "Misc.APDHandlingEnable" to True), get completely useless and rather unreachable, disconnect from the vCenter Server, prevent vMotioning, but they will insist on keeping their VMs running.

       

      The main problem here is that HA will (if I understand correctly) just sit by and watch, since all it cares about is the hosts' FDM heartbeats, not the health of the VMs nor their network or storage connectivity. This is even if VM monitoring is enabled, since all it checks is the VMwareTools-to-host heartbeat.

       

      Perhaps we have to invent an app that monitors the VM's disks and interfaces with HA Application monitoring..

       

      Please tell me if there is something I am missing or misunderstanding in my story above.

       

       

      Request/suggestion:


      Make HA check the actual health of the VMs, such as their network and storage connectivity, rather than only the FDM heartbeats.

       

      Thanks in advance!