VMware Cloud Community
mreferre
Champion
Champion

VMware HA: broken or limited ?

I was facing this issue in a customer deployment and I wanted to share it over here.

Without getting into the details of what we are doing in this space (I can get if you want) I wanted to share with you a very basic scenario that is a show-stopper for a HA / DR strategy we are implementing with a big customer.

In a situation where you for example have 4 servers in a cluster with a bunch of vm's running on each of these nodes, if for any reasons, these servers crash all of a sudden (due to mere power issues or similar) AND if one or more of these server will not come up again (i.e. due to a mere hardware failure or due to an ESX local disk corruption) the virtual machines running on these dead horses will not be restarted on the surviving nodes. So if you have 4 hosts running 10 vm's each ....... and all of them crash with 3 of them surviving, at the next reboot only 30 virtual machines will come up again while the 10 vm's hosted on the dead horse will stay down.

What do you think ? Would you be looking at this as a limitation or do you consider it to be broken ?

I will start sharing my opinion: it's completely broken! I see no reason why an HA product would not bring all managed objects up and running independently by the sequence of failures of the hosts. As far as I can tell every HA product does provide the capability to bring up and running objects after a complete cluster failure.

Thoughts / Comments ?

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
37 Replies
Ken_Cline
Champion
Champion

Interesting scenario...I agree with your assessment - it's not a limitation, it's a problem. I can understand where the problem likely lies (since all nodes died at the same time, there was no "master node" to remember the started state of the VMs in the cluster and they all lost their memory on reboot...so much for persistence!!)

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
0 Kudos
mreferre
Champion
Champion

Ken,

thanks. Yes I agree about the likely cause however I also think that a what a product does should be a function of the marketing/technical requirements and not a function of how they implemented it. The implementation should be the mean by which I provide the features that make sense and not viceversa.

Also if you consider that HA is not "free" ..... if you know what i mean .... Smiley Wink

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
STS
Enthusiast
Enthusiast

I agree its interesting but your scenario points to the fact that only four servers are in the HA farm. If it was 8 server farm and 4 failed then everything would be ok.

The likelyhood of 4 servers or a farm all failing at the same time especially for me is rare and especially at the same time. If they did we would call that DR and not HA Smiley Happy

0 Kudos
mreferre
Champion
Champion

I am not saying it's common, I guess I am saying it could happen. Smiley Happy

And of course in my scenario if you had 8 servers they would all go down (as in a power outage).

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
STS
Enthusiast
Enthusiast

then thats DR not HA Smiley Happy

0 Kudos
mreferre
Champion
Champion

Ok....

(as in a power outage).

should have read:

(as in a temporary power outage)

Smiley Happy

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
Ken_Cline
Champion
Champion

Yes...but if HA didn't understand the problem then it shouldn't have started ANY of the VMs rather than starting only a portion of them. Since HA did[/i]start the VMs on all but one host, that indicates that there's a problem in the implementation.

It should be all or nothing :smileygrin:

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
0 Kudos
admin
Immortal
Immortal

Would HA really restart any[/i] of the VMs in the scenario? I wouldn't have thought it would....isn't that down to the auto-startup/shurdown settings for the ESX host, rather than HA?

I mean if everything goes down, it's not High Avalability, as nothing is available to run any VMs, all we're talking about here is an automatic start of VMs when some resources do become available again.

0 Kudos
mreferre
Champion
Champion

Well I can tell you that this is what the customer told me and that I have replicated in the lab.

Actually in my lab I have configured 2 hosts (A and B) with 2 vm's on each. After shutting down both hosts all vm's were gone (obviously). Restarting Host A vm's A1 and A2 came on-line. Not B1 and B2.

After a couple of hours I restarted Host B and, surprisingly, only B1 came on-line. I had to start manually B2.

So this appear to be broken under multiple aspects ...... Smiley Happy

I have thought about using autostart as well but it is my understanding that autostart allows you to state which and how a given set of virtual machines can start on a given host (very fix relationship between a host and ITS vm's). Being now the vm's not bound to a specific host but rather to a cluster I didn't find a very intuitive way to do the same thing cluster-wise. Well .... actually I did find a way .... and that was HA .... that is (to me at least) HA should be doing for the cluster what Autostart was doing for the host.

Makes sense ?

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
jdaunt
Enthusiast
Enthusiast

In my opinion there are a few problems with both Auto-start as well as HA.

As already mentioned, auto-start is a very fixed relationship between the host and the VM(s). In fact, it has to be configured at the host level and not at the cluster. So as virtual machines get migrated via DRS, they lose the auto-start policy configured (this is especially a problem in the lab when we do occasionally take all hosts down, and I have to manually power on 100 selective virtual machines).

This being said, if you were to have a power outage, automatic startup would take over and not HA.

My first problem with the High Availability implementation in VI3 is the fact that a failure is detected by a Service Console network heartbeat. This can lead to way many false positives, and in an isolation response scenario, the default response is to power down all hosted virtual machines. The other thing that I have noticed, is if a host is to fail, all virtual machines are recovered on the same alternate ESX host, and then DRS is initiated. This leads to performance problems on the host recovering these systems as it begins to get bottlenecked prior to migrations taking place.

Of course, VI3 is an amazing product, but I feel there are still a few wrinkles to be ironed out. End of my rant.

0 Kudos
mreferre
Champion
Champion

Good rant.

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
sbeaver
Leadership
Leadership

I think that sums it up for alot of us Smiley Happy

Steve Beaver
VMware Communities User Moderator
VMware vExpert 2009 - 2020
VMware NSX vExpert - 2019 - 2020
====
Co-Author of "VMware ESX Essentials in the Virtual Data Center"
(ISBN:1420070274) from Auerbach
Come check out my blog: [www.virtualizationpractice.com/blog|http://www.virtualizationpractice.com/blog/]
Come follow me on twitter http://www.twitter.com/sbeaver

**The Cloud is a journey, not a project.**
0 Kudos
dpomeroy
Champion
Champion

I've had a few problems with HA and as right now we do not have it enabled. My initial testing with just two hosts and a couple of VMs was positive, but when it was enabled on production clusters with 4+ ESX servers I began to have issues such as:

1. An ESX server became isolated for no apparent reason, resulting in the VMs being shut down (default behavior). I could never find out why it thought it was isolated, we couldn't find any network issues and none of the other ESX servers on the same subnet had any issues.

2. A cluster would all of the sudden get an alert saying there was a problem with the HA agent. Sometimes reconfiguring for HA on every server worked, other times I had to disable HA and re-enable. This has happened 3-4 times, VMware support could not find any reason why. All DNS and network configs were triple checked.

3. ESX server crashed (PSOD) and none of the VMs were restarted on the other 3 hosts.

I know many people have it running without issue, but IMO its not quite enterprise ready.

0 Kudos
mreferre
Champion
Champion

Well this was not supposed to be a "VMware HA is completely crap" thread ....

Mr VMware I swear it ...... !

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
pdrace
Hot Shot
Hot Shot

>My first problem with the High Availability implementation in VI3 is the fact that >a failure is detected by a Service Console network heartbeat. This can lead to >way many false positives, and in an isolation response scenario, the default >response is to power down all hosted virtual machines.

Another issue is that the heartbeat may be fine but there could still be another problem that takes vms offline. I got a call at 2 am last week because a server had an issue with storage and all vms became inaccessible.

I had to shutdown the host to get the vms running on the other server.

That's not nearly robust enough to be considered HA in my opinion.

0 Kudos
sbeaver
Leadership
Leadership

Yes but at least it is out in the open now and hopefully we can expect some of our concerns to be addressed in future revisions.

Steve Beaver
VMware Communities User Moderator
VMware vExpert 2009 - 2020
VMware NSX vExpert - 2019 - 2020
====
Co-Author of "VMware ESX Essentials in the Virtual Data Center"
(ISBN:1420070274) from Auerbach
Come check out my blog: [www.virtualizationpractice.com/blog|http://www.virtualizationpractice.com/blog/]
Come follow me on twitter http://www.twitter.com/sbeaver

**The Cloud is a journey, not a project.**
0 Kudos
jdaunt
Enthusiast
Enthusiast

At this point I leave HA enabled with the Isolation response of "Leave Powered On". I will say that our environment did have a power issue, and all virtual machines were powered on it the alternate datacenter within 3 minutes (management was very pleased).

The problem lies in the fact that if you promise a high availability solution, it needs to be highly available and work as stated. If a host is to power off, or have faulty hardware leading to a crash, HA will work as promised. If you were to have a link failure on your HBA's, the network heartbeat is still up, so you are left with VM's unavailable.

Some of these items should definitely be brought to light, as VMware now knows that adjustments can be made to bring even more functionality to the product.

0 Kudos
admin
Immortal
Immortal

These are all valid and interesting points, I think most of the issues are because Legato AAM has been bought and made to fit ESX as best as possible in a short amount of time. A lot of the ideas mentioned in this thread should be quite easy to implement, but I'm guessing there wasn't time to do anything more with AAM for the initial VI3 release than get it working as it is.

At the London VMUG a few weeks back Richard Garsthagen (of www.run-virtual.com) said that they had purchased the code from Legato, and thus the rights to develop it, which they are doing, and that we would see some fairly big changes to HA in future releases. I know that's not much help right now but it does show that they are aware of some of the problems/limitations and will be working on improving it in the future. Smiley Happy

0 Kudos
Shawzer
Contributor
Contributor

king@it.ibm.com,

Nice thread burner. I usually stay low. However, I understand your scenario. And Yes, I’d like to know more detail on the config. Email, pm, etc.

Have you tested a cascading failure and see what happens?

For example,

3 host

Host A

Host B

Host C

2 VM’s a piece.

VMa1

VMa2

VMb1

VMb2

VMc1

VMc2

Host A dies, Host B takes on Host A’s load. Host B and Host C died and only Host B and C come back online. Does Host A VM’s start?

0 Kudos