VMware Communities > VMTN > VMware Infrastructure™ > VI: ESX 3.5 > Discussions

This Question is Possibly Answered

1 "correct" answer available (10 pts) 1 "helpful" answer available (6 pts)
2 Replies Last post: Nov 2, 2007 4:13 PM by Rumple
Reply

Questions from the CTO - HA related

Nov 2, 2007 1:42 PM

Click to view 311bobby311's profile Novice 311bobby311 18 posts since
Aug 14, 2006

So our CTO has some questions that I cannot find answers for regarding HA, they are kind of obscure, but valid none the less, so here we go:

1. How does HA determine the primary agent in the cluster? Does ESX determine this by which host is more powerful (cpu, ram)? Is it the first host of the cluster? Does this ever change or will the primary agent always be the primary agent?

2. What would happen in this occurance? If you have two hosts in a cluster, using HA. One ESX host (ESX1) gets a memory failure, but it does not take down the host, do all VM's on this ESX host (ESX1) migrate to the other host (ESX2) (hypothetically there are enough resources to use). Then the other ESX host (ESX2(with all of the VMs now on it)) has a hardware failure but not enough to completely take down the host. So both ESX hosts have some kinda of minor hardware issue on them, do the VM's contantly flip flop between hosts? Or does the one ESX host take itself out of the cluster at that point and HA would then be unavailable, so all the VM's would be stuck on ESX2?

3. Does ESX have a built in parameters of which it determines when to failover VMs to a new host (not DRS) ? Does ESX rates hardware failures at different levels? Like is a memory error rated higher than a CPU error, etc.?

I have searched the forum, looked through the book, etc. CTO would like answers, he pays my salary.....you know how it is!

thanks!!


Reply Re: Questions from the CTO - HA related Nov 2, 2007 2:08 PM
Click to view esiebert7625's profile Guru esiebert7625 6,786 posts since
Oct 23, 2006
Moderator

HA functions on a heartbeat service so any loss of that heartbeat will trigger it. Typically this is the loss of network communication on the service console which severs the heartbeat and isolates the host server. So things like memory errors or cpu errors that do not take the whole host down will not trigger HA, only a catastophic failure that causes the host to crash or lose network connectivity on the SC will trigger it. The below docs go into detail on HA...

Automating High Availability (HA) Services with Vmware HA - http://www.vmware.com/pdf/vmware_ha_wp.pdf
Effective DRS and HA in Production - http://download3.vmware.com/vmworld/2006/tac9413.pdf
Choosing the HA host destination - http://www.vmware.com/community/thread.jspa?messageID=563006򉜾
Vmware HA with 2 ESX hosts - http://www.vmware.com/community/thread.jspa?messageID=605107򓮳
Knocking Out Downtime with Two Punches: VMotion & VMware HA - http://www.vmware-tsx.com/download.php?asset_id=45
A Practical Guide to HA - http://www.vmware-tsx.com/download.php?asset_id=29
Das.isolationaddress - http://www.vmware.com/community/thread.jspa?messageID=719673
HA restart order of VM's with the same priority - http://www.vmware.com/community/thread.jspa?messageID=664502
Setting Failure and Isolation Detection Timeout and Multiple Isolation Response Addresses - http://kb.vmware.com/kb/1002080
HA Technical Best Practices - http://kb.vmware.com/Platform/Publishing/attachments/1002080_fHA_Tech_Best_Practices.pdf

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-
Thanks, Eric
Visit my website: http://vmware-land.com
-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

Reply Re: Questions from the CTO - HA related Nov 2, 2007 4:13 PM
Click to view Rumple's profile Master Rumple 1,125 posts since
Jan 6, 2005

3. Does ESX have a built in parameters of which it determines when to failover VMs to a new host (not DRS) ? Does ESX rates hardware failures at different levels? Like is a memory error rated higher than a CPU error, etc.?

In reality this is not going happen. Memory errors are not passive events that just show up and the system bypasses them. when you have memory errors you are gonig to have a multitude of issues like host crashes, VM crashes, extensive data corruption (typicalyl immediately leading ot host crashes).

Disk errors can be more benign and manifest themselves over a longer period before causing issues, but bad memory almost always causes havoc on a system, which is why the 72 hour burn-in is so important.

If you have a CPU error, your whole host is typically going to crash as ESX will not handle a physical CPU going away, even if the hardware is designed to let oyu hotswap the CPU's (ie some sun boxes will do that).

Actions