Dear Community, we have problems with two of our clusters,
we updated our virtual center 2.5 to Update 6 Build 227637. Now every morning and evening the two clusters HA fails on all hosts (10 and 11 Host per Cluster). After disabling an enabling the HA and some timeouts on some hosts while disabling everything goes back to normal. We also updated all our hosts to actual build (ESX Server 3.5.0, 238493) and did a firmware update on all our hosts (HP). We also evacuated all host and bring them back to the cluster. We have overall 9 clusters and only those two are doing the problems. Hardware and build in other clusters are the same. Those two clusters are the largest in our environment. Any ideas what we can do more? Should we split the clusters?
I dont think splitting the clusters is going to help you. We have
more than 12 node cluster and it runs fine.
It is interesting when you mention that they fail in morning and evening. Are your DNS Servers properly set?
Do you see HA problems on all the nodes OR on few specific nodes?
Thanks
I had the same problem and did the following
1. Disabled HA and DRS at the cluster level
2. Enabled HA only - Took several minutes with many hosts attached
3. Once this completed, went in and enabled DRS
That seemed to work for me, maybe it will work for you. However, it seems you've already pretty much done this.
Thx for the suggestions. We solved the problem with VMWare support. We had changed the advanced HA options and entered a "das.isolationaddress2 value". We have chosen a host/gateway in our storage network for this adress. Best practice from VMware is to use a second gateway in management network for this value or just leave the standard values for advanced HA options without an entry with "das.isolationaddress2". We changed the values to standard and our problems were gone.
FYI, this location is extremely helpful when tracking down HA issues - /var/log/vmware/aam.
This log - vmware_nameofesxserver.log has running entries on heartbeat information and will show you exactly when things went wrong.
We had sporadic HA issues on some clusters and we eventually pinned it down to back-end chassis, networking work going on at the time of HA failure.
You want to look for messages like this.
===================================
Info NODE Thu Jul 16 20:10:11 2009
By: FT/Agent on Node: servernameesx
MESSAGE: Agent on servernameesx has stopped
===================================
Info NODE Thu Jul 16 20:10:11 2009
By: FT/Agent on Node: servernameesx
MESSAGE: Agent on node servernameesx has been shutdown.
===================================
Error FT Thu Jul 16 20:10:13 2009
By: FT/Agent on Node: servernameesx
MESSAGE: ShutdownNode: Error from PPShutdowNode.