VMware Cloud Community
Benb_2007
Contributor
Contributor

VI3 HA and locked vmdk

Hello,

I need some help to understand HA process :

When you have a VM working with a vmdk on SAN, ESX puts a lock on this file to avoid any corruption from other process. If the host crash the vmdk file is still locked but the VM starts from an other host ...!

So, how HA works ? What does it do to free the vmdk ??

Thanks,

0 Kudos
3 Replies
MR-T
Immortal
Immortal

It has a method of handling this type of thing.

In VMFS, such cross-host synchronisation is handled by the distributed locking mechanism through the use of an ‘on-disk heartbeat structure’.

The heartbeat structure maintains the lock states as well as the pointer to the journal information for a host to be replayed in the event that a host crashes.

In order to deal with possible crashes of hosts, the distributed locks are implemented as lease-based. Each host that uses a LUN has its own heartbeat region. The idea behind heartbeat is to indicate “liveness” of a host, so that if a host dies whilst holding a lock, the lock can be released by another host.

esiebert7625
Immortal
Immortal

Here you go...if you find this post helpful, please award points using the Helpful/Correct buttons...thanks

How does the HA (High Availability) feature work?

VMware HA continuously monitors all ESX Server hosts in a cluster and detects failures. An agent placed on each host maintains a "heartbeat" with the other hosts in the cluster and loss of a heartbeat initiates the process of restarting all affected virtual machines on other hosts. You create and manage clusters using VirtualCenter. The VirtualCenter Management Server places an agent on each host in the cluster so each host can communicate with other hosts to maintain state information and know what to do in case of another host's failure. (The VirtualCenter Management Server does not provide a single point of failure.) If the VirtualCenter Management Server host goes down, HA functionality changes as follows. HA clusters can still restart virtual machines on other hosts in case of failure; however, the information about what extra resources are available will be based on the state of the cluster before the VirtualCenter Management Server went down. HA monitors whether sufficient resources are available in the cluster at all times in order to be able to restart virtual machines on different physical host machines in the event of host failure. Safe restart of virtual machines is made possible by the locking technology in the ESX Server storage stack, which allows multiple ESX Servers to have access to the same virtual machines file simultaneously.

Host failure detection occurs 15 seconds after the HA service on a host has stopped sending heartbeats to the other hosts in the cluster. A host stops sending heartbeats if it is isolated from the network. At that time, other hosts in the cluster treat this host as failed, while this host declares itself as isolated from the network. By default, the isolated host powers off its virtual machines. These virtual machines can then successfully fail over to other hosts in the cluster. If the isolated host has SAN access, it retains the disk lock on the virtual machine files, and attempts to fail over the virtual machine to another host fails. The virtual machine continues to run on the isolated host. VMFS disk locking prevents simultaneous write operations to the virtual machine disk files and potential corruption.

If the network connection is restored before 12 seconds have elapsed, other hosts in the cluster will not treat this as a host failure. In addition, the host with the transient network connection problem does not declare itself isolated from the network and continues running. In the window between 12 and 14 seconds, the clustering service on the isolated host declares itself as isolated and starts powering off virtual machines with default isolation response settings. If the network connection is restored during that time, the virtual machine that had been powered off is not restarted on other hosts because the HA services on the other hosts do not consider this host as failed yet. As a result, if the network connection is restored in this window between 12 and 14 seconds after the host has lost connectivity, the virtual machines are powered off but not failed over.

For more information on HA see http://download3.vmware.com/vmworld/2006/tac9413.pdf and http://kb.vmware.com/KanisaPlatform/Publishing/894/2956923_f.SAL_Public.html and http://www.vmware.com/pdf/vmware_ha_wp.pdf

Benb_2007
Contributor
Contributor

Thanks you both for theses answers.

-> MR-T, do you have any link or more information about locks leasing ?

-> I m also looking for vmfs-3 structure. Is the format open ?? Is there a way to get all the fields of the VMFS-3 structure ?

Thanks,

0 Kudos