Apologies if there's a solution out there but I couldn't find one.
We are having an issue with DPM and HA in our environment. Currently we are growing into a new infrastructure and have a number of blade servers that DPM decides to turn off on a conservative setting. We are running HA on this cluster, with one blade dedicated to HA failover. Usually daily, say when backups occur (still a lot of host based backups going on, I know...), one host is turned back on. At this time we almost always see an HA configuration error which sometimes spreads to other hosts in the cluster. If we manually reconfigure HA it cleans it up. Then the host goes back to standby later, cycle repeats the next day.
Basically, when hosts come out of standby, HA freaks out. If we leave the hosts on for a week HA never goes bad. HA never seems to fail when we just go maintenance mode and reboot. It seems to only happen coming out of standby.
Obviously this should not be happening. Any suggestions on things to look at? I have a case open but sometimes the community wins out on the solution. Obviously one solution is to turn off DPM but, we'd like to solve it first
Hi there, did you ever got a solution for your problem?
We are experiencing the same issue with ESXi 4.0, DPM and HA.
Once a host exits standby, HA goes bananas.
It takes forever to configure the host for HA and it almost always ends with:
HA agent has an error : cmd addnode failed for primary node: Internal
AAM Error - agent could not start. : Unknown HA error
Without DPM enabled, HA doesn't seem to have this issue.
Any help would be appreciated.
I've had this problem from time to time. I only have five ESX hosts with two basically on permanently. When DPM was turning on a third, or when I had to do patching and have them all on, HA usually gave me fits. When I was at VMworld 2010, I asked a VMware engineer about this.
What I was told was that in any HA setup, there is a maximum of five primary hosts. Anything after that, there are no more. A primary holds some sort of configuration data and that is kept up to date on operational hosts. When hosts stay in standby for a long period of time, that information gets outdated and HA will complain about it.
Since I have five hosts and they are all primary, the way I fix it is to have all five hosts on and reconfigure HA at that time. I've done it with three or four, but the last host will cause the problem unless all are done at the same time.
If you have more than five hosts, you may need to find all five of your primaries and reconfigure HA (by turning it off then on) with all of them on. I don't know how to tell a primary from not since I haven't needed to, nor do I know what the expiration time is.
Sorry, I don't have any documentation to link to for verification, but given my experience, it makes sense.