Solved: Re: The HA debate

jprevett · ‎10-17-2007

I wanted to get a consensus from people that use this site like I do for valuable insight into VMWare Virtualization.

I've not been able to find anything conclusive around how HA behaves in VCenter 2.0.1 when it comes to the moving of VM's and how the recovery happens. It's been my experience that when a host disconnects and the HA event begins, the VM's that were being managed by that host are DMotion'd off to another host in the cluster. In 1.4, we had a "swing" server or "HA" server that was spare capacity sitting in waiting within the cluster that held either test machines or nothing at all which was a huge waste of resource but a necessary evil to ensure a timely recovery. I'm lead to believe things are diffferent in VI3 and that you no longer need a swing server and that all hosts should have enough spare capacity to handle an HA event. Well, when I've experienced HA in 3.0.1 VC 2.0 and 2.0.1, all the VM's that are involved in the HA are moved to 1 Host in the cluster. That host usually doesn't have enough spare capacity to fire up all the VM's so the VM's that are looking for a home end up moving around the cluster until they do find spare capacity on a host. This is time consuming and while they eventually find a home, the customer has been impacted enough to start generating calls to the help desk.

I assume this has something to do with our lack of capacity on a per host basis but wonder if HA just isn't that smart and while we tell the customer that they should experience only a reboot when we loose a host, they can in all actuality experience a 15 - 30 outage and in some instances longer before the VM is connected again.

Is my logic correct with HA or am I missing something here?

Thanks for any feedback.

admin · ‎10-18-2007

You can specify a single default failover host using the HA advanced options (das.defaultFailoverHost = <host shortname>). HA will try failover all vms to this host first and if that host is not available or does not have enough resources, some other host will be chosen. When picking a host to failover to, HA picks a host which has the most unreserved capacity - this is the capacity of the host minus the reservations of any running vms on that host. To avoid flooding a host with all the vms being failed over you should assign reasonalbe reservations to the vms in your cluster.

Elisha

View solution in original post

virtualdud3 · ‎10-17-2007

Well, let's see where to start...

An important thing to keep in mind is that HA does NOT use/rely on VMotion; many people mistakenly state that it does.

By its very nature, HA comes into play with ESX host failure (or loss of Service Console connectivity). So, if an ESX host fails the VMs running on the host are also going to "go down". Or, if the ESX host's service console loses the abiility to communicate with the other ESX hosts in the HA-enabled cluster the VMs running on that host can be configured to power-down so that they can be powered-up on another ESX server (this is the default behavior). By default, the ESX "checks" if it has lost its ability to communicate with the other ESX hosts in the HA cluster by pinging its default gateway. If the ESX host is unable to ping its default gateway for 15 seconds (this timeframe can be changed in the latest version of VirtualCenter). Or, you can specify a different IP address via setting the "das.isolationaddress" parameter.

In this scenario, the reason the VMs need to be powered-down is so that their "original" ESX server will release its "lock" on the .vmdk files to free the VMs to be powered-up on another ESX host. The point to all of this is that since the VMs are NOT running/powered-on when migrated, by definition VMotion does not come into play.

Typically, if an ESX cluster is enabled HA the cluster is also enabled for DRS (Distributed Resource Scheduling). If an HA "failover" occurs within an ESX cluster that is DRS-enabled, then DRS can use VMotion to automatically migrate the running VMs onto the appropriate ESX host to maximize the use of available resources. Depending on the configuration, DRS can also be used to automatically migrate/VMotion running VMs as necessary. Keep in mind that while the two technologies (HA and DRS) compliment each other, they are separate technologies/features and of the two, only DRS relies on VMotion.

Your question on host capacity is a very "involved" question; the answer depends greatly on your environment, desires, and availability of resources.

For reference, I'd look at the Resource Management guide at

############### Under no circumstances are you to award me any points. Thanks!!!

jprevett · ‎10-17-2007

Thanks for your feedback and everything you said is true. My reference was not to VMotion but D Motion, however, that was also incorrectly stated. Perhaps I didn't do a good job of stating my question in all this. I do understand HA and the unlocking of .vmdk in order to move the VM to another host. I also understand that DRS will do it's best to load balance once the VM's are powered on. My question is in and around how HA recovers the VM's. As stated previously, it is my understanding that 2.5.x best practice for HA in a cluster was to have a "swing" host assigned.

With VI3, when you enable the cluster, there is no prompt for assigning the "primary" failover host for HA but one is assigned. How is that determined and/or manipulated?

GBromage · ‎10-17-2007

You generally can't specify a "primary" failover server - because you never know which server will be the one that goes down.

Intead, the servers will restart wherever there is sufficient capacity. So this is another area where DRS can complement it. You don't have to have a "spare" server sitting idle, it's resources can be contributed to the cluster, and the "spare" capacity distributed across all the hosts.

In regards to your DMotion question, no DMotion is needed (or possible). Like VMotion, HA does rely on the vmx and vmdk files being visible to all the hosts in the cluster. By definition, this must be SAN or NAS disk, not direct attached. If a host goes down and the the vm files are on local storage, it will not be restarted becuase none of the failover hosts will be able to see it.

I hope this information helps you. If it does, please consider awarding points with the 'Helpful' or 'Correct' buttons. If it doesn't help you, please ask for clarification!

mreferre · ‎10-17-2007

> Intead, the servers will restart wherever there is sufficient capacity

I think this is part of the problem. If I remember well I read somewhere that the HA service, when the failure occurs, will determine which host to migrate all vm's to (perhaps the less hammered at that point?). This assumes that DRS will then redistribute the vm's according to the hosts actual usage across the cluster. Of course, assuming this is the behaviour but I am not certain, what was the less hammered at that point.... becomes immediately the MOST hammered if DRS can't keep pace.

Again ... I vaguely remember this was the failover methodology being used. After all I think HA has never been (so far) the most sophisticated piece of tehcnology on earth.

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info

admin · ‎10-18-2007

You can specify a single default failover host using the HA advanced options (das.defaultFailoverHost = <host shortname>). HA will try failover all vms to this host first and if that host is not available or does not have enough resources, some other host will be chosen. When picking a host to failover to, HA picks a host which has the most unreserved capacity - this is the capacity of the host minus the reservations of any running vms on that host. To avoid flooding a host with all the vms being failed over you should assign reasonalbe reservations to the vms in your cluster.

Elisha

jprevett · ‎10-18-2007

Thanks for the comments. I believe you've answered my question.

All

The HA debate