VMware Cloud Community
mattwilson
Contributor
Contributor

HA Agent Errors

Just curious if anyone else has been experiencing issues where their cluster nodes randomly experiencing an error with HA. As a result the Cluster is degraded. Right now I have multiple nodes with the same error in a cluster of 3 nodes thus nullifying HA for the entire cluster. Ironically DRS still appears to work if for the nodes experiencing HA problems. The unfortunate part is that the interface simply states the agent has an error but doesn't actually provide information beyond that.

Thus far I have been unable to correct the issue (other than rebuilding the node).

I have an open case with VMware Support but we haven't found a resolution and I have just been handed off to another technician.

0 Kudos
25 Replies
KnowItAll
Hot Shot
Hot Shot

On the esx host:

tail -f /var/log/vmware/vpx/vpxa.log

In VC, right click on the host that shows the HA problem and click reconfigure for HA.

Watch the logs post the error message.

0 Kudos
admin
Immortal
Immortal

The usual problems with HA configuration are:

1) dns issues - make sure you can resolve the short hostname (without domain name) of each ESX host from each other ESX host in the cluster.

2) fqdn too long - make sure the fully qualified domain name of all hosts is less than 39 characters.

Could you look at the /opt/LGTOaam512/log/aam_config_util_addnode.log. That may give an indication about what went wrong.

0 Kudos
mstahl75
Virtuoso
Virtuoso

I think that is shorter than 29 characters. That issue though should be fixed in the first patch according to a SR I filed on it.

0 Kudos
Keith_Aulson
Contributor
Contributor

In case anyone searches and finds this.

I was having the same issue, and I tried restarting VC, and the ESX boxes that were affected, to no avail. For some reason I couldn't even put the hosts in maint. mode. Reconfiguing HA on each host didn't work either.

However, before calling support, I decided to uncheck HA and DRS on this cluster, and then check them again, and shockingly, everything was happy again.

When I did the tail -f /blahblahblah what I was seeing was a VMap error - not sure if that's generic or what, but since this worked for me, I figured I'd post it.

\- Keith

0 Kudos
Illaire
Hot Shot
Hot Shot

However, before calling support, I decided to uncheck

HA and DRS on this cluster, and then check them

again, and shockingly, everything was happy again.

Same issue with one of my ESX 3.0.1 (35804) host.

I've tried a lot of things (no results). I read your post and unchecked "HA" on the cluster, than add it again after the reconfiguration. HA is working again.

Somehow, HA seems a wild child.

0 Kudos
lhedrick
Enthusiast
Enthusiast

I have seen the same issue a couple of times. If I turn HA off then back on everything magically works again...

0 Kudos
lightfighter
Enthusiast
Enthusiast

yeah most of the time its DNS issues

0 Kudos
nonnau
Contributor
Contributor

Thank you for your suggestion. It is so simple. Just click reconfigure for HA!

Much better then disabling and enabling HA on a cluster.

It seems that after host reboot it does not automatically joins HA cluster.

Thanks again

0 Kudos
Christopher_J__
Contributor
Contributor

This is a ridiculous problem to be honest. I see it all the time. And we do NOT have any problems with out DNS setup. I have hosts drop off HA quite frequently. Oh sure, reconfiguring for HA seems to resolve the issue, but ??? this isn't how this should be. I shouldn't have to reconfigure every time one host seems to take a HA nap.

0 Kudos
netlinecg
Contributor
Contributor

We had this error when the DNS server entry of one of our ESX hosts was not set correct. I the put the fqdn and the short names of our ESX hosts to the /etc/hosts files of all hosts. After that everythig worked fine.

0 Kudos
Algernon
Enthusiast
Enthusiast

I agree, our DNS is working just fine and this Legato agentry is buggy. We \*just* put this cluster into production and these agents are already creating problems; great feature, but I'm' not thrilled to be beta-testing it in production. I've just disabled it for now.

0 Kudos
jimmydarts
Contributor
Contributor

I can recreate and break this error.

Error configuring HA.

" Internalerror : vmap_HSLESX01 process failed to stop "

Not much help is available for this, but it consistantly occurs when the public (data) network cable is unpluged. I tried to reconfigure HA and it fails at 72% with the error above.

Remove HS and DRS from the cluster. (takes forever.....) then add the services back in.

This has worked consistantly for me. I can recreate the issue and fix it again this way......

Crazy...

0 Kudos
Jae_Ellers
Virtuoso
Virtuoso

We had a planned network outage last week. When I checked on the cluster after the outage 3 of 4 hosts in the cluster were showing the same HA Configuration Error or similar generic issue. Uncheck and recheck of HA did the trick, but still doesn't seem reliable. I mean, if a network cable being unplugged causes the cluster to fail it's pretty useless.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=- http://blog.mr-vm.com http://www.vmprofessional.com -=-=-=-=-=-=-=-=-=-=-=-=-=-=-
0 Kudos
ring_zero
Contributor
Contributor

We are running ESX 3.0.1, 34176 on 12 Hosts and we also see this issue quite often. In our environment, it is definitely not a name resolution issue. To fix the issue I usually just reconfigure HA on the host it has failed on and wait. The host eventually (4 hrs or so) clears the issue itself, like magic?

Anyway, while we are waiting for VMware to fix this issue, it would be nice if they would give use the ability to setup an Alarm to notify us if HA fails on a given host. Currently, I do not see a trigger for HA?

0 Kudos
CWedge
Enthusiast
Enthusiast

We are running ESX 3.0.1, 34176 on 12 Hosts and we

also see this issue quite often. In our environment,

it is definitely not a name resolution issue. To fix

the issue I usually just reconfigure HA on the host

it has failed on and wait. The host eventually (4 hrs

or so) clears the issue itself, like magic?

Anyway, while we are waiting for VMware to fix this

issue, it would be nice if they would give use the

ability to setup an Alarm to notify us if HA fails on

a given host. Currently, I do not see a trigger for

HA?

Setup the hosts file and you'll be all set.

I have a 12 node cluster and i haven't had a blip since I set that up many months ago.

0 Kudos
ring_zero
Contributor
Contributor

Thanks for the reply, but the hosts files are already populated with both the FQDN (and short names) of all the other ESX servers as well as all the DNS Servers FQDNs and short names.

And the '/etc/nsswitch.conf' is setup right.

So, it is definitely not a name resolution issue.

0 Kudos
CWedge
Enthusiast
Enthusiast

Ok My DR site had problems i couldn't resolve either..

I just installed VC Patch 2 (although it sucked to install)

My HA errors went away in my DR site..

0 Kudos
ring_zero
Contributor
Contributor

I'll upgrade to VMware VirtualCenter 2.0.1 Patch 2 and see what happens.

thanks ...

0 Kudos
mcvmwaresupport
Contributor
Contributor

I just set up a new production cluster and was seeing a lot of the same issues discussed here (e.g. HA activation fails with AAM agent unable to start, unable to reach primary cluster server, VMAP errors, etc.)

I configured the /etc/hosts file with VIS, DNS and cluster members to no avail.

I seem to have the problem limited at this point. I loaded DNS on my VIS and set that as my primary DNS on the cluster members, rebooted members and added to HA cluster.

I had updated to Patch 2 earlier, and was still seeing the problem until this point. I don't know if it was a coincidence, but there it is. Hope it helps.

Message was edited by:

mcvmwaresupport

0 Kudos