VMware Cloud Community
virtualInceptio
Contributor
Contributor

What happens to guests when ESXi with LUN during failure scenarios?

I have not been able to find information about how guest integrity is maintained if a single ESXi server with LUN storage experiences a failure scenario and what would be the outcome of the active running guest virtual machines.

I'd like to hear feed back how worried I should be about guest virtual machine integrity when using a LUN to store the guests.  I get the impression the answer might be don't worry about it but still make an effort to backup the gust operating systems and the data held within them.

I understand there are preventive measures to mitigate the following failures from happening.  It will not be necessary to spend time discussion how to prevent a scenario from happening.  In addition the scenarios lead into HA discussions and it is not necessary to cover HA solutions.  The exception to HA would be maintaining integrity of the guest if it happens that a single ESXI and Network storage server cannot provide fault tolerance when the systems are restarted.  Tips on configuration to maintain guest integrity are welcome.

The hypothetical setup is the free ESXi 6.0u1 hypervisor on a server with a LUN setup to a network storage.  Likely setup with ISCSI.

Some talking point failure situations.

  1. Power failure or hard power/off to both the host and network storage and the host was not powered down and guests were active.  (preventing unexpected power outage is understood so it can be omitted in the discussion)
  2. Network storage power failure/off without shutting down ESXi host or guest VMs. (HA is not the concern here.)
  3. Host power failure/off without shutting down ESXi host or guests.
  4. Host and Network Storage are online but the connection between them is lost.  The connection lost could be intermiterary but long enough that would cause timeouts for handling connection faults within tolerance levels.
0 Kudos
2 Replies
deepaknegiee
Contributor
Contributor

I'm not sure if I've fully understood your question however I understand that your questions are specific to failure scenario in ESX environment.

  1. Power failure or hard power/off to both the host and network storage and the host was not powered down and guests were active.  (preventing unexpected power outage is understood so it can be omitted in the discussion)

The VMs will restart as soon based on the hearbeat setup between your ESXI lost (default 15 seconds) This operation will be taken care by FDM regardless of your vCenter availability as HA is independent for that matter.

   2. Network storage power failure/off without shutting down ESXi host or guest VMs. (HA is not the concern here.)

    3. Host power failure/off without shutting down ESXi host or guests.

HA will restart the virtual machine in other ESXi hosts provided they have enough capacity

  1. Host and Network Storage are online but the connection between them is lost.  The connection lost could be intermiterary but long enough that would cause timeouts for handling connection faults within tolerance levels.

If you have

VMware KB:    Understanding High Availability Host Isolation Response with Network Attached Storage

Host network isolation occurs when a host is still running but it can no longer communicate with other hosts in the cluster and it cannot ping the configured isolation addresses. When the HA agent on a host loses contact with the other hosts, it will ping the isolation addresses. If the pings fail, the host will declare itself isolated.


HA Response Time

In VMware vSphere 5.x, if the agent is a master, then isolation is declared in 5 seconds. If it is a slave, isolation is declared in 30 seconds.

In vSphere 4.x, isolation is declared in 12 seconds after heartbeats have ceased to arrive. 15 seconds after the start of the isolation event, other hosts in the cluster consider that the isolated host has failed and will initiate the isolation response workflow. You can change these default timeout values using VMware HA advanced options in VMware vCenter Server. The default isolation response is set to "shutdown".

HA Response Types

Leave powered on – When a network isolation occurs on the host, the state of the virtual machines remain unchanged and the virtual machines on the isolated host continue to run even if the host can no longer communicate with other hosts in the cluster. This setting also reduces the chances of a false positive. A false positive in this case is an isolated heartbeat network, but a non-isolated virtual machine network and a non-isolated iSCSI/NFS network. Should the host become unresponsive or fail and can no longer access/run the virtual machines, the virtual machines will be registered and powered on by another running host in the cluster. By default, the isolated host leaves its virtual machines powered on.

Power off – When a network isolation occurs, all virtual machines are powered off and restarted on another ESXi host. It is a hard stop. A power off response is initiated on the fourteenth second and a restart is initiated on the fifteenth second.

Shut down – When a network isolation occurs, all virtual machines running on that host are shut down via VMware Tools and restarted on another ESXi host. If this is not successful within 5 minutes, a power off response type is executed.

For more information read this-

vSphere High Availability (HA) Technical Deepdive - Yellow Bricks

0 Kudos
virtualInceptio
Contributor
Contributor

Thank you for taking the time to answer the question.  Your answer took into account vCenter was used and that there was more than one ESXi server providing HA.  I was wondering what happens when not using vCenter and only one ESXi server.  The free version of the hypervisor.  I am going to assume that VMWare has this taken into consideration and the VM disk transactions are handled well enough that the VM integrity is maintained.  If it were not then the VM would fail to boot up on the same or other ESXi in an HA environment.

0 Kudos