VMware Cloud Community
tjaster
Enthusiast
Enthusiast

HA Problems in Cluster after update to Virtual Center 2.5 Update 6

Dear Community, we have problems with two of our clusters,

we updated our virtual center 2.5 to Update 6 Build 227637. Now every morning and evening the two clusters HA fails on all hosts (10 and 11 Host per Cluster). After disabling an enabling the HA and some timeouts on some hosts while disabling everything goes back to normal. We also updated all our hosts to actual build (ESX Server 3.5.0, 238493) and did a firmware update on all our hosts (HP). We also evacuated all host and bring them back to the cluster. We have overall 9 clusters and only those two are doing the problems. Hardware and build in other clusters are the same. Those two clusters are the largest in our environment. Any ideas what we can do more? Should we split the clusters?

--------- VMware Certified Professional 3/4
Reply
0 Kudos
4 Replies
vcpguy
Expert
Expert

I dont think splitting the clusters is going to help you. We have

more than 12 node cluster and it runs fine.

It is interesting when you mention that they fail in morning and evening. Are your DNS Servers properly set?

Do you see HA problems on all the nodes OR on few specific nodes?

Thanks

----------------------------------------------------------------------------- Please don't forget to reward Points for helpful hints; answers; suggestions. My blog: http://vmwaredevotee.com
Reply
0 Kudos
gsommers
Contributor
Contributor

I had the same problem and did the following

1. Disabled HA and DRS at the cluster level

2. Enabled HA only - Took several minutes with many hosts attached

3. Once this completed, went in and enabled DRS

That seemed to work for me, maybe it will work for you. However, it seems you've already pretty much done this.

Reply
0 Kudos
tjaster
Enthusiast
Enthusiast

Thx for the suggestions. We solved the problem with VMWare support. We had changed the advanced HA options and entered a "das.isolationaddress2 value". We have chosen a host/gateway in our storage network for this adress. Best practice from VMware is to use a second gateway in management network for this value or just leave the standard values for advanced HA options without an entry with "das.isolationaddress2". We changed the values to standard and our problems were gone. Smiley Happy

--------- VMware Certified Professional 3/4
Reply
0 Kudos
mark_chuman
Hot Shot
Hot Shot

FYI, this location is extremely helpful when tracking down HA issues - /var/log/vmware/aam.

This log - vmware_nameofesxserver.log has running entries on heartbeat information and will show you exactly when things went wrong.

We had sporadic HA issues on some clusters and we eventually pinned it down to back-end chassis, networking work going on at the time of HA failure.

You want to look for messages like this.

===================================

Info NODE Thu Jul 16 20:10:11 2009

By: FT/Agent on Node: servernameesx

MESSAGE: Agent on servernameesx has stopped

===================================

Info NODE Thu Jul 16 20:10:11 2009

By: FT/Agent on Node: servernameesx

MESSAGE: Agent on node servernameesx has been shutdown.

===================================

Error FT Thu Jul 16 20:10:13 2009

By: FT/Agent on Node: servernameesx

MESSAGE: ShutdownNode: Error from PPShutdowNode.

Reply
0 Kudos