Aggy
Contributor
Contributor

DPM Breaks HA

Hi everyone

I'm running 4 hosts each on esxi 4.1, with vsphere enterprise licensing and a seperate physical vcentre box.

I'm experimenting with DPM as we don't need all 4 hosts running all the time, however whenever one of the hosts comes out of standby it gets stuck on enabling HA, the exact error in the logs says:

HA agent has an error : cmd addnode failed for
primary node: Internal AAM Error - agent could
not start. : Unknown HA error
error
02/02/2011 10:29:35
Exit standby mode
After this occurs, the other 3 hosts then have HA errors which in turn is then disabled.
To solve this i have to diable HA on the cluster and then re enable it.
Any ideas?
0 Kudos
12 Replies
depping
Leadership
Leadership

It seems that more and more are experiencing these types of issues. Would you be so kind to create a support request so that our Engineers can look in to this?

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

Aggy
Contributor
Contributor

Ok I've submitted a support requested and disabled DPM for the moment as it breaks HA whenever the server comes back up.

Thanks for the advice

0 Kudos

In the meantime, if you want to do any analysis on that, follow this:

1. Remove the ESX from VC, and add it back again. By doing this we are uninstalling and reinstalling AAM agent automatically.

2. Once you removed ESX from VC, the AAM agent should be uninstalled. If not, you can uninstall manually.

The command “rpm -qa | grep -i aam” will give you the list of agents installed.
The command “rpm –-erase <agent>” will uninstall them.

3. While you are adding back to VC, the AAM agent will install automatically.

Thanks,

Ganesh

~GaneshNetworks™~ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
Aggy
Contributor
Contributor

I found this on another forum post, and it seems to have solved the issue:

Turn off HA for the cluster

then

From ESXi 4.1 SSH console: (You can enable SSH from Configuration > Security Profiles > Properties)

run the uninstall script

./opt/vmware/aam/VMware-aam-ha-uninstall.sh

services.sh stop

services.sh start

re-enable HA for the cluster and click "reconfigure for VMware HA" if it doesn't do it automatically.

refer:  kb.vmware.com/kb/1007234

0 Kudos
alecprior
Enthusiast
Enthusiast

We have the same problem.  I raised a support call and after a week of no contact I was then told by a disinterested tech that it was a bug in VC and to wait for a patch to be released.  Call closed.

0 Kudos
Aggy
Contributor
Contributor

I've had the same response form vmware and they said they'll patch it soon so keep an eye out.

However the above method does seem to work, i'm not sure if i'm willing to trust it for production thought!

0 Kudos
depping
Leadership
Leadership

Can all of you please reply with the SR number so that I can give it to the engineers? Thanks,

Duncan

0 Kudos
kesparlat
Enthusiast
Enthusiast

Hi all,

I'm experiencing the same issue. I though it could be originated by a primary/secondary misconfiguration when hosts became up again, so I''ve tried to apply the Duncan's primary node election Smiley Wink in "Advanced Settings". Unfortunately didn't worked Smiley Sad.

Regards,

Jose Manuel Carballo

0 Kudos
alecprior
Enthusiast
Enthusiast

Just to update; I re-raised this with support again after 4.1u1 didn't fix it.  Have had a very good tech this time and after spending some time looking at network configurations and hosts files, it's been forwarded to engineering (PR?).

0 Kudos
depping
Leadership
Leadership

Thanks for that!

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

0 Kudos
alecprior
Enthusiast
Enthusiast

I've just had this through from Support as a possible resolution to this issue:

http://kb.vmware.com/kb/1037773

0 Kudos
alecprior
Enthusiast
Enthusiast

Applied the fix in that KB and we still have the same problem.

0 Kudos