Hello VMware community ,
I'm testing new ESX 4 and its HA behavior ( 2 node cluster). I have a hard time with vmware HA - VM doesn't switch (failover) to other ESX server in cluster.
My environment:
hostA, hostB: 2x ProLiant DL580 G5 , each with Intel(R) Xeon(R) CPU X7460 @ 2.66GHz with 32 GB ram
1x shared storage - HP EVA - both servers see this share storage
client: 1x standalone windows machine (vSphere installed on it) (client)
ESX 4 - evaluation copy ( business licenses on the way)
All 3 servers are on the same NWs. I use vSpehre on client to mange both ESX servers. I put them into cluster ( HA has been configured, default values have been assigned). I have two test VMs (windows 2003) inside the cluster.I tested VMotion - worked perfectly - I was able to move VMs from hostA to hostB and vice versa.
Problem is, when i reboot either hostA or hostB ( to test failover) - no failover happens. I see hostA in "not responding" state , VM is in disconnected state. First time i tried this (after cluster configuration) I got no error on console. Second time i tried this i got an error of unsufficient resources to satisfy HA failover. I checked the HA configuration:
Current Failover Capacity: 1
Configured Failover Capacity: 1
VM machine has only 1024MB configured (no limits set) and it's located on shared storage ( 20GB lun, 5GB is allocated for system). I tried several things (reconfigure cluster again, turn off admission control, checked DNS between ESX/vShpere, checked if I have a VMware HA among license features, .. ) - nothing helped.
I'm king of stuck here and would appreciate and hints I've (probably) missed.
Thanks!
I am from engineering, and we have indentified a problem. I'd like to
confirm that the problem identified is the one that you are
experiencing.
If the management network (Service Console on ESX, Management checkbox
checked for port group on ESXi) is on a public network, and there is no
other service console or management network on a private network, this
may lead to a false conclusion that the failed node is still online,
preventing a failover.
Please let me know if this matches your configuration. If so, a
work-around would be to define a private network for management.
A KB article addressiing this issue is in the works.
Feel free to write to me with any questions.
Marc
Does the VM ever restart? Also do you confirm that it reamins on the original host - check the summary page of the VM which will show where the VM is registered -
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Yes, it does. I have attached cluster summary in acc. My cluster status is as follows:
hostA:
VM1
hostB:
VM2
As I mentioned I can move one VM to another and vice verse. When I reboot hostA I got an error message: HA agent on hostA is not responding (which is expected). Then I see hostA not responding and VM in disconnected state ( power status is "powered on" in 'Virtual Machines' tab). VM is not reachable. All gets fixed once hostA is up again - cluster reforms and VM is started on that host.
So this is also answer to the Virtual_Bee - HA agent seems to be working as the cluster reforms upon boot.
VM status page always shows the host it is running on before failover test (meaning that it never gets switched to another host in cluster and stays in disconnected state; it is available once the host is up again).
I also attached picture how it looks like when the server is rebooted (before that host is claimed as not responding).
Is it possible to see (find out) which resource is ESX complaining about ?
I'm not sure , but it seems to be a weird HA behavior to me. As i posted initial status of the cluster - all seems ok. Configured and current failover capacity is the same. When I initiate a reboot of a host, first I only see the host not responding and current failover capacity decreases to 0 ( vShpere still thinks that VM is ok) and I'm stuck with insufficient resources error. After a while, VM's state turns into disconnected.
I checked NW settings again, I was able to ping VMkernel IP's from vSphere client box, IP's of the consoles are pingable from the client, ESX can ping each other too.
According to /var/log/vmware/aam/vmware_hostA.log this is what happened during reboot:
Warning FT Fri Jun 19 13:39:51 2009
By: ftProcMon on Node: hostB
MESSAGE: Node hostB has stopped receiving heartbeats from Primary node hostA 1/31. Declaring node as unresponsive.
===================================
Info FT Fri Jun 19 13:39:51 2009
By: ftProcMon on Node: hostB
MESSAGE: This node is not network isolated.
===================================
Info RULE Fri Jun 19 13:39:51 2009
By: FT/Rule Manager on Node: hostB
MESSAGE: Rule RuleMonitor is enabled on hostB.
===================================
Info RULE Fri Jun 19 13:39:51 2009
By: FT/Rule Manager on Node: hostB
MESSAGE: Rule VMWareClusterManager is enabled on hostB.
===================================
Info RULE Fri Jun 19 13:39:52 2009
By: FT/Rule Interpreter on Node: hostB
MESSAGE: Rule RuleMonitor is enabled on hostB.
===================================
Info RULE Fri Jun 19 13:39:52 2009
By: FT/Rule Interpreter on Node: hostB
MESSAGE: Rule VMWareClusterManager is enabled on hostB.
===================================
Info NODE Fri Jun 19 13:43:22 2009
By: FT/Agent on Node: hostB
MESSAGE: Agent on hostA has started.
===================================
Info NODE Fri Jun 19 13:43:23 2009
By: FT/Agent on Node: hostB
MESSAGE: Node hostA is running.
===================================
Info FT Fri Jun 19 13:43:23 2009
By: ftProcMon on Node: hostB
MESSAGE: Node hostB has started receiving heartbeats from node hostA.
===================================
Info FT Fri Jun 19 13:43:27 2009
By: ftProcMon on Node: hostA
MESSAGE: Node hostA has started receiving heartbeats from node hostB
I can only see that it took hostA around 4mins to join the cluster, though there is no information about VMs itself. Pls is there a log where I can find the error regarding VM switch failure ?
I'm not sure , but it seems to be a weird HA behavior to me. As i posted initial status of the cluster - all seems ok. Configured and current failover capacity is the same. When I initiate a reboot of a host, first I only see the host not responding and current failover capacity decreases to 0 ( vShpere still thinks that VM is ok) and I'm stuck with insufficient resources error. After a while, VM's state turns into disconnected.
I checked NW settings again, I was able to ping VMkernel IP's from vSphere client box, IP's of the consoles are pingable from the client, ESX can ping each other too.
According to /var/log/vmware/aam/vmware_hostA.log this is what happened during reboot:
I can only see that it took hostA around 4mins to join the cluster, though there is no information about VMs itself. Pls is there a log where I can find the error regarding VM switch failure ?
This looks to be normal behavior for an HA cluster - the host status goes to not responding as well as the VM- capacity drops to zero because you now have a single host - and then the VM should restart on the remaining host - does anything show in the event tab for the cluster in the VI Client? If the VM is not restarting than it is relying on some component that is associated with the first ESX server -
This osunds like nnormal HA failover behavior - when rebooting a host in an HA cluster eventually vCenter realizes the host is no longer responding and the status chages to Disconnected as well as all the VMs running on that server and once the VM restarts on the remaining node(s) of the cluster it will change the VMs status - and once the rebooted host restarts it will return as well - is there anything recorded in the events tab of the cluster in the VI Clients? I am wondering if there is something that is stopping the VM from restarting on the other node like it is using some resource only assigned to that host -
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
thanks for reply,
I've attached the event log from the time period I rebooted hostA - there's no indication why VM was not switched over. I checked if VMs can be moved from one host to other by VMotion - as it worked ok i assumed (and checked) there are no 'local' resources assigned to VM, i.e. some ISO image in CDrom from local data store, etc..
Hi guys,
We have exactly the same issue over here. HA is configured properly (according to vcenter), but when one hosts becomes unavailable it doesn't start the HA protected VM on another host while there are more then enough failover hosts and resources available.
Thanks!
Hi matoo --
If you want to tar up and send me the contents of /var/log/vmware/ from hostA and hostB, I'll try and figure out what's going on, i.e. why the VM isn't failing over, as it should.
-- Ron Passerini
HA Team
Hi Ron,
Sure, thank you. I've attached logs (/var/log/vmware/) from hostA and hostB.
Due to our policy I had to substitute actual IP addresses/hostnames with a fake ones. ( done by a sed script, timestamps were preserved)
Naming goes as follows:
hostA,B: ESX servers
VM1,2: virtual machines
vsphere: vSphere server
A.B.C/16 - console NW
X.Y.Z/22 - vmkernel NW
D.N.S. - DNS servers
thanks
-m-
I am from engineering, and we have indentified a problem. I'd like to
confirm that the problem identified is the one that you are
experiencing.
If the management network (Service Console on ESX, Management checkbox
checked for port group on ESXi) is on a public network, and there is no
other service console or management network on a private network, this
may lead to a false conclusion that the failed node is still online,
preventing a failover.
Please let me know if this matches your configuration. If so, a
work-around would be to define a private network for management.
A KB article addressiing this issue is in the works.
Feel free to write to me with any questions.
Marc
Hi Matoo,
I had exactly the same behavior. See http://communities.vmware.com/message/1325539#1325539
The solution was to create a second service console on a private address. I did that and disabled/enabled HA at the cluster level to get HA working properly.
Hope that helps,
Brian
hi bfredette,
thanks, yop this solution worked .. i've created another service console on other subnet - worked perfectly. great success
A new patch for vCenter is available that addresses the issue. You can find the details here: http://kb.vmware.com/kb/1013013
From the KB:
Symptoms
You experience these symptoms:
In vCenter 4.0, VMware HA might not failover virtual machines when a host failure occurs.
When the ESX host's IP address in a VMware HA enabled cluster is configured with certain IP addresses, the node failure detection algorithm fails.
You are susceptible to this issue when all of your Service Console Port(s) or Management Network IP address(s) on your ESX host fall within the following range:
3.x.x.x - 9.x.x.x
26.x.x.x - 99.x.x.x
Note: You are not affected if one of Service Console Port(s) or Management Network IP address(s) on your ESX host falls outside of this range.
This is by design
Rebooting a host prompts you whether tyou are sure you want to reboot it, then accepts this as confirmation that you are aware of the impact.
I suspect that if you simply pulled the power on the ESX host, you would get different behaviour. Issuing a reboot alerts the other hosts in the cluster of your intentions
Sounds like a valid and logical thought.
However, the VMware Infrastructure 3: Install and Configure course (Module 10 - Lab for Lesson 2, "Using VMware HA") describes selecting "Reboot" as a valid way to test HA. Granted this may have changed with vSphere so then I'd have to ask why does the workaround resolve the problem?
I'm not sure how your comment fits into the thread, but I wanted to clarify something. There should be no difference (HA-wise) between powering off a host and rebooting it through the UI. The reboot option does not cause an annoucement to the other hosts in the other hosts in the cluster of the intention.
The same failover behavior should occur.