dsolerdelcampo
Enthusiast
Enthusiast

Tuning HA

Jump to solution

We have two main buildings and an extended VLAN between them. We have two different group of ESX server in each building and we have defined a HA Cluster with the ESX server in each building.

We have had three big problems in the last month with this configuration and today we have undefined HA temporary. Our problem has occurred all the times after a LAN problem. The first time due to an upgrade in the DWDM used to connect the LAN and SAN in both building. The second and third time due to a problem with the main LAN Core in one building. The result has been that all the virtual machines in that building have been power off and power on again in a period of time between 10 and 30 minutes. But we have had also problems with the ESX in the other building where no problem has occurred. All the LAN adapters of an ESX in one building are connected to the LAN Core in the same building.

What I want? I would like to hear about your experience with a configuration similar to ours and also we would like to know if it possible to fine tune the parameters for the HA cluster in order to delay the failure detection the cluster does. With 2.5.x we have not the HA, but a problem in the LAN during 1 minute was almost harmless but now it is worst the solution than the problem. We have changed the Isolation Response to the Leave power on option, but I do not know if it is the best solution.

I do not know, maybe we have not the right configuration.

Any suggestion will be appreciated. Thanks in advanced

0 Kudos
1 Solution

Accepted Solutions
esiebert7625
Immortal
Immortal

Also if you know of a planned network outage you can always disable HA temporarily and then enable it again afterwards. If you are using iSCSI or NAS then your network connectivity is going to also effect the ESX servers from seeing shared storage, in a SAN environment that is not the case. Power off is the best isolation response to use for iSSCI/NAS.

See these post also...

http://www.vmware.com/community/thread.jspa?messageID=581281&#581281

fyi...if you find these posts helpful, please award points using the Helpful/Correct buttons...thanks

View solution in original post

0 Kudos
10 Replies
esiebert7625
Immortal
Immortal

No it it not possible to change the default connection timeout of 12 seconds that triggers HA. If you have frequent LAN problems that result in lost connectivity for more then 10 seconds then HA will not work that well for you. This is not a common scenario for Enterprise production environments that HA was designed for. Leaving the Isolation Response as Powered On you run the risk of having two machines on at the same time when connectivity is restored. Below is a description of how HA works.

Vmware HA continuously monitors all ESX Server hosts in a cluster and detects failures. An agent placed on each host maintains a "heartbeat" with the other hosts in the cluster and loss of a heartbeat initiates the process of restarting all affected virtual machines on other hosts. You create and manage clusters using VirtualCenter. The VirtualCenter Management Server places an agent on each host in the cluster so each host can communicate with other hosts to maintain state information and know what to do in case of another host's failure. (The VirtualCenter Management Server does not provide a single point of failure.) If the VirtualCenter Management Server host goes down, HA functionality changes as follows. HA clusters can still restart virtual machines on other hosts in case of failure; however, the information about what extra resources are available will be based on the state of the cluster before the VirtualCenter Management Server went down. HA monitors whether sufficient resources are available in the cluster at all times in order to be able to restart virtual machines on different physical host machines in the event of host failure. Safe restart of virtual machines is made possible by the locking technology in the ESX Server storage stack, which allows multiple ESX Servers to have access to the same virtual machines file simultaneously.

Host failure detection occurs 15 seconds after the HA service on a host has stopped sending heartbeats to the other hosts in the cluster. A host stops sending heartbeats if it is isolated from the network. At that time, other hosts in the cluster treat this host as failed, while this host declares itself as isolated from the network. By default, the isolated host powers off its virtual machines. These virtual machines can then successfully fail over to other hosts in the cluster. If the isolated host has SAN access, it retains the disk lock on the virtual machine files, and attempts to fail over the virtual machine to another host fails. The virtual machine continues to run on the isolated host. VMFS disk locking prevents simultaneous write operations to the virtual machine disk files and potential corruption.

If the network connection is restored before 12 seconds have elapsed, other hosts in the cluster will not treat this as a host failure. In addition, the host with the transient network connection problem does not declare itself isolated from the network and continues running. In the window between 12 and 14 seconds, the clustering service on the isolated host declares itself as isolated and starts powering off virtual machines with default isolation response settings. If the network connection is restored during that time, the virtual machine that had been powered off is not restarted on other hosts because the HA services on the other hosts do not consider this host as failed yet. As a result, if the network connection is restored in this window between 12 and 14 seconds after the host has lost connectivity, the virtual machines are powered off but not failed over.

For more information on HA see http://download3.vmware.com/vmworld/2006/tac9413.pdf and http://kb.vmware.com/KanisaPlatform/Publishing/894/2956923_f.SAL_Public.html and http://www.vmware.com/pdf/vmware_ha_wp.pdf

dsolerdelcampo
Enthusiast
Enthusiast

Hi,

Thank you for your answer. I supposed there was not a way to change the default parameter of 12 seconds.

I agree with you that the networks problems we have are not normal in a production environment and they have occured after two general planned blackouts, once we turno on the servers again (I need to explain it, because it looks like we have a pour network and it is not the case).

Back to my HA questions, I have read that the default configuration it is the recommended if you are usisn iscsi or a NAS, but I have not read something similar if you are using a SAN. I suppose that if I choose not to power off the vm and we have a LAN problem the HA cluster could not start the vm in other ESX. Is this option better than to get all your vm power off?. I am afraid not to be able to start a vm if it is directly power off.

I would like to be able to specify a longer period of time before HA initializes the vm power off process. I think it could be a nice feature in future releases.

Thanks again

0 Kudos
esiebert7625
Immortal
Immortal

Also if you know of a planned network outage you can always disable HA temporarily and then enable it again afterwards. If you are using iSCSI or NAS then your network connectivity is going to also effect the ESX servers from seeing shared storage, in a SAN environment that is not the case. Power off is the best isolation response to use for iSSCI/NAS.

See these post also...

http://www.vmware.com/community/thread.jspa?messageID=581281&#581281

fyi...if you find these posts helpful, please award points using the Helpful/Correct buttons...thanks

View solution in original post

0 Kudos
dsolerdelcampo
Enthusiast
Enthusiast

I have not seen that thread till now; thank you for the reference.

I also agree with you in disabling the HA feature whenever a network outage it is planned.

0 Kudos
Nicodemus
Contributor
Contributor

So how is it that HA is not using VMotion? My Installation & Configuration manual from class states that HA does not us HA but DRS does. I can easilt understand DRS using VMotion but how is it that HA is not if it is also starting Vm's on other Hosts?

Can someone clear this up for me?

Thanks, - Nicodemus

0 Kudos
mcowger
Immortal
Immortal

HA doesn't use vmotion because HA-moved VMs dont move 'live', they get fully rebooted on the new host. its like an automated cold migration.

--Matt

--Matt VCDX #52 blog.cowger.us
0 Kudos
korpy
Enthusiast
Enthusiast

Hi Nicodemus,

>So how is it that HA is not using VMotion?

Because the HA poweron response is a reaction on a failed host. And when a host is failed, there is no way to migrate a vm with vmotion....

regards -frank-

0 Kudos
Nicodemus
Contributor
Contributor

OK... so I think I understand now...

It only counts as VMotion if it is done 'live' .. i.e. moving the memory, switching the VM to another Host without an outage...

But HA is really a 'cold migration'.. being a restart of a VM on a new Host meaning it had to have been powered down...

That pretty much it?

- Nicodemus

0 Kudos
mike-p
Enthusiast
Enthusiast

But why not Shutdown Guest instead of Power OFF ?

Power OFF is a little bit faster but every OS will recognize the Failure und start local files system and database recovery algorythms.

At the end it takes longer to restart the VM's and in unlucky cases youll will have unrecoverable errors.

0 Kudos
jbogardus
Hot Shot
Hot Shot

Assuming that the host really is entirely isolated, then the VMs don't have network access. In this sort of a situation many network application may not shutdown properly/quickly and possibly hang and not shutdown at all. Doing a power off tends to be fairly safe. OS and applications have been designed to deal with unexpected power outages, since even in the best situations they sometimes happen.

0 Kudos