Hello all,
We experienced some issues with one of our ESX clusters and I'm hoping someone can either validate or correct my understanding.
Environment:
4 - ESX 4.0 Update 2 Servers
Cluster Features HA and DRS are both turned on.
vCenter is a VM running in the cluster.
HA settings are typically default: Enable Host Monitoring, Host Failure cluster tolerates = 1, Host Isolation response = shutdown, default
Here's what happened:
We experienced a network issue in which 1 of the 4 ESX hosts went into a Host Isolation response. On that ESX server, HA powered down all the vm's that were running. This included vCenter, which just happened to be running on that particular host. HA did not power up any of the vm's on any of the other ESX hosts. When the network issue was fixed about an hour later, the vCenter vm was started manually along with the other vm's. (Some vm's did appear to start automatically after vCenter was started, but I wasn't there at the time so I'm going by what I've been able to piece together.)
Prior to this, my understanding was that HA would operate independently of vCenter. But in this situation, it appears that HA wasn't going to restart anything until vCenter became available.
What am I missing?
Thanks for your help....
Correct.
James Bowling
If you have a network issue that causes an HA isolation event on a host in your cluster and you have HA configured to power off your VMs, HA will power off the VMs and the HA agent will be disabled. The HA agent will remain disabled until they can talk to vCenter which will reconfigure / start the HA agent on the host. Once HA is running it will start powering on your VMs.
If vCenter is a VM on the host that failed and is set to power down, your VMs won’t be powered back on by HA until vCenter is up and running so it can reconfigure/start the HA agent on the hosts. Depending on your environment, it may make sense to set the vCenter VM’s isolation response to ‘Leave powered On’.
James,
Thank you for the quick response and bear with me as I'm trying to make sure I understand this. So, with HA, if an ESX host becomes isolated (or just fails for some reason) and that ESX host has the vCenter vm, HA won't restart anything until vCenter is back up?
In other words, vCenter must be available for HA to operate?
Thanks again, for your help...
VMware vCenter is required for VMware HA configuration. During this process, HA agents are deployed to each of the hosts. Hence the agent and failover will work even if the VMware vCenter is down. That is why some group of people would put vCenter in a VM. It gets protected by VMware HA too.
Well, HA requires vCenter to configure the agent. I wouldn't say that it requires it for full operation. The HA agent just does what it is told and since you told it to shutdown VMs it did. When it was brought back online the HA agent couldn't be configured and so nothing started back up because the host couldn't get its orders from vCenter.
James Bowling
Thanks for your response and bear with me as I'm trying to make sure I understand this.
I came across these other posts as well, which also provided some good info:
http://communities.vmware.com/message/1102384
http://communities.vmware.com/thread/292203;jsessionid=8585124FE1294CC0F708F41431F7A342?tstart=0
I think the point I'm missing is the difference between a host isolation event and a host failure event. Based on what you've told me and what I've read:
- If a clustered ESX host that contains the vCenter vm experiences a failure (eg: loses power), HA will restart the vCenter vm on another ESX host in the cluster. This has nothing to do with the host isolation response setting.
- If a clustered ESX host that conatins the vCenter vm experiences a host isolation failure (with the response set to shutdown the vm's), the vCenter vm comes down with all the others and HA won't do anything until it (vCenter) comes back up.
So, in the scenario where vCenter is a vm within the cluster, either the Host Isolation response at the cluster level should be set to Leave Powered On or at the very least, the vCenter vm setting should be set to Leave Powered On.
Correct?
Correct.
James Bowling
Yes 😃
That's not quite correct. VC should not be required to restart vms due to a host isolation. VC is only required to reconfigure the isolated host when it rejoins the network,
Elisha
Actually, in his statements he is correct because if vCenter shutdown along with the other VMs on the isolated host then HA can't be reconfigured on the isolated host once it comes out of isolation and therefore the HA agent on the host will not know to restart the VMs that were shutdown on the isolated host until the vCenter VM is brought back online. If this was a failed host scenario then the VMs would be automatically restarted on another host in the cluster, yet this does not reflect his particular scenario.
James Bowling
And specifically, vCenter is required for the HA agent to be reconfigured on the isolated host, which in his case is not possible because vCenter is shutdown.
When host A is isolated its vms will be restarted on another host in the cluster before that host comes out of isolation. VC is not needed for this restart.The reconfigure of the host when it comes out of isolation is only needed so that HA can failover vms to that host if another host fails at some later point
Elisha
If you have four ESX servers and one of them got isolated, based on your settings:
1. The isolated hosts should have shutdown the running VMs
2. The three non-isolated hosts should have detected that the isolated host left the cluster and after a 15 second timeout should have attempted to restart the VMs
Check out the blog blow. Particularly the section on isolation response gotchas. Sounds like it may have been a timing issue.
http://www.yellow-bricks.com/vmware-high-availability-deepdiv/#HA-isolationresponse
HTH,
-Kyle
I was about to point out the same portion of Duncan's blog. Good show!
Thanks for referring to my blog! I have tested this many times and HA should even restart vCenter. HA works, after the initial configuration, independent of vCenter and should restart all VMs regardless of what happens... UNLESS your network issue was solved and the remaining hosts received a heartbeat or could ping the isolated host at the 15th second. Than they wouldn't trigger a restart and yes you would be sitting there left with all VMs down and no restart.... chances of that happening though are really slim,
Duncan (VCDX)
Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive
Thanks everyone for your replies. It definitely helps. Part of my confusion is that I couldn't find any logs to indicate HA was attempting to restart any of the vm's on any of the other hosts. /var/log/messages had an entry where the "bad" node experienced a Ping Isolation Address: Failure. Hostd.log showed the vm's being shutdown at the time of the isolation event. None seemed to indicate that anything came back on-line until about an hour later.
I did notice that there were additional entries in the /var/log/vmware/aam/vmware_ESX3.log file for several seconds after the VMware HA Agent Isolated message. (ESX3 is the ESX host that was isolated.) Could this indicate that it might in fact have been a timing issue? The initial network issue occurred at 10:43. Here's an excerpt:
===================================
Info FT Fri Dec 17 10:43:45 CST 2010
By: isolationScript on Node: ESX3
MESSAGE: user ESX3 VMware HA Agent Isolated, Notifying VPXA
===================================
Info FT Fri Dec 17 10:43:46 2010
By: ftProcMon on Node: ESX4
MESSAGE: Node (null) has started receiving heartbeats from node ESX1.
===================================
Info FT Fri Dec 17 10:43:47 2010
By: ftProcMon on Node: ESX2
MESSAGE: Node (null) has started receiving heartbeats from node ESX4.
===================================
Info FT Fri Dec 17 10:43:47 2010
By: ftProcMon on Node: ESX4
MESSAGE: Node (null) has started receiving heartbeats from node ESX2.
===================================
...
===================================
Warning FT Fri Dec 17 10:43:49 2010
By: ftProcMon on Node: ESX3
MESSAGE: Node (null) has stopped receiving heartbeats from Primary node ESX1 1/5. Declaring node as unresponsive.
===================================
Info PROC Fri Dec 17 10:43:49 2010
By: FT/Agent on Node: ESX3
MESSAGE: Shutdown stopping process VMAP/VMap_ESX3
===================================
Info RULE Fri Dec 17 10:43:49 2010
By: FT/Agent on Node: ESX2
MESSAGE: Rule RuleMonitor submitted to run on node ESX2.
===================================
Warning FT Fri Dec 17 10:43:49 2010
By: ftProcMon on Node: ESX3
MESSAGE: Node (null) has stopped receiving heartbeats from Primary node ESX4 4/5. Declaring node as unresponsive.
===================================
Info RULE Fri Dec 17 10:43:49 2010
By: FT/Rule Manager on Node: ESX2
MESSAGE: Rule RuleMonitor is enabled on ESX2.
===================================
Info RULE Fri Dec 17 10:43:51 2010
By: FT/Agent on Node: ESX3
MESSAGE: Rule VMWareClusterManager submitted to run on node ESX2.
===================================
Info PROC Fri Dec 17 10:43:51 2010
By: FT/Agent on Node: ESX3
MESSAGE: Process VMap_ESX3 on ESX3 stopped [pid = 9850]
===================================
Info RULE Fri Dec 17 10:43:51 2010
By: FT/Agent on Node: ESX3
MESSAGE: Rule RuleMonitor submitted to run on node ESX3.
===================================
Info RULE Fri Dec 17 10:43:51 2010
By: FT/Agent on Node: ESX3
MESSAGE: Rule VMWareClusterManager submitted to run on node ESX3.
===================================
Info RULE Fri Dec 17 10:43:51 2010
By: FT/Rule Manager on Node: ESX3
MESSAGE: Rule RuleMonitor is enabled on ESX3.
===================================
Info RULE Fri Dec 17 10:43:51 2010
By: FT/Rule Interpreter on Node: ESX3
MESSAGE: Rule RuleMonitor is enabled on ESX3.
===================================
Info RULE Fri Dec 17 10:43:52 2010
By: FT/Rule Manager on Node: ESX3
MESSAGE: Rule VMWareClusterManager is enabled on ESX3.
===================================
Info RULE Fri Dec 17 10:43:52 2010
By: FT/Rule Interpreter on Node: ESX3
MESSAGE: Rule VMWareClusterManager is enabled on ESX3.
=========================================================================
Primary Agent version 5.1 running on Linux 2.6
Restarted at Fri Dec 17 11:43:24(CST) 2010
Events posted before this process started may not be found in this log file.
Check other agent log files.
===================================
Info FT Fri Dec 17 11:43:26 2010
By: ftProcMon on Node: ESX3
MESSAGE: Node (null) has started receiving heartbeats from node ESX1.
===================================
Info FT Fri Dec 17 11:43:27 2010
By: ftProcMon on Node: ESX3
MESSAGE: Node (null) has started receiving heartbeats from node ESX4.
===================================
...
...
Thanks again for all your help...
it is difficult to say without the full logs and the tools to analyze it. Did you try to call support to see why you did not get the expected response?
Duncan (VCDX)
Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive