Solved: Re: HA Agent has an Error - Failed attempting to l...

bwhouse · ‎05-07-2008

Hi,

I am experiencing the "HA agent on ESX_HOST in cluster CLUSTER_NAME has an error event in VC, it clears almost instantly, however re-ocours almost 10 times a day on each host in the cluster.

I have tried all the steps listed that I can find to fix this problem, including remove all hosts from the cluster and remove the cluster and re-create from scratch. This seemed to solve teh problem for 3 days however it is now back.

I have double checked DNS and ever host reloves fine both with shortname and FQDN, I have also added each host to each others hosts file in the following format;

~~root@coa-esx03 vpx~~# cat /etc/hosts

#Do not remove the following line, or various programs

#that require network functionality will fail.

127.0.0.1 localhost.localdomain localhost

Each time I check the /var/log/vmware/vpx/vpxa.log after the error the following entry appears. I have had no luck finding any reference to this on the forums so far.

Error trying to perform list Error interacting with configuration file /etc/vmware/esx.conf: Failed attempting to lock file. Another process has locked the file for more than 10 se

conds. The process holding the lock is /usr/bin/perl-w/usr/sbin/esxcfg-boot-o (1014). This operation will complete if it is run again after the lock is released.

admin · ‎05-14-2008

Fair enough to be wary. The verification of the network configuration is most important at the time of configuring HA. But it gets called for every operation, including the HA Agent liveness testing. You can safely comment out that call, or better yet, move it under the statement

elsif ( $cmd eq "addnode") {

This way, it gets called where it is needed.

There is already a bug reported for this transient error, and the permanent fix to the problem will be to properly deal with the locked config file when this condition occurs..

View solution in original post

kjb007 · ‎05-08-2008

Make sure you have at least the hostname and IP address of the local machine in your /etc/hosts file. From the error message, it appears that the esx hsot does not know who it is.

Make sure your hostname is correct when you type 'hostname' and in the file /etc/sysconfig/network, and also possibly (it does not have to be in this file) /etc/sysconfig/network-scripts/ifcfg-vswif0

That hostname should be in /etc/hosts file with the correct IP address in both short and fully qualified form.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

bwhouse · ‎05-08-2008

Sorry I should have provided more detail. My hosts file already contains the shortname and FQDN of every esx host. I have also checked the sysconfig/network file. I have included the outputs below. This one is strating to really confuse me any help woudl be gratefully appreciated.

~~root@coa-esx03 vpx~~# cat /etc/hosts

Do not remove the following line, or various programs
that require network functionality will fail.

127.0.0.1 localhost.localdomain localhost

10.168.4.10 coa-esx01.domain.com coa-esx01

10.168.4.11 coa-esx02.domain.com coa-esx02

10.168.4.12 coa-esx03.domain.com coa-esx03

10.168.4.13 coa-esx04.domain.com coa-esx04

10.168.4.14 coa-esx05.domain.com coa-esx05

10.168.4.15 coa-esx06.domain.com coa-esx06

10.168.4.16 coa-esx07.domain.com coa-esx07

10.168.4.17 coa-esx08.domain.com coa-esx08

10.168.4.18 coa-esx09.domain.com coa-esx09

10.168.4.19 coa-esx10.domain.com coa-esx10

10.168.4.20 coa-esx11.domain.com coa-esx11

10.168.4.21 coa-esx12.domain.com coa-esx12

10.168.4.22 coa-esx13.domain.com coa-esx13

10.168.4.23 coa-esx14.domain.com coa-esx14

10.168.4.24 coa-esx15.domain.com coa-esx15

10.168.4.25 coa-esx16.domain.com coa-esx16

10.168.4.26 coa-esx17.domain.com coa-esx17

10.168.4.27 coa-esx18.domain.com coa-esx18

10.168.4.28 coa-esx19.domain.com coa-esx19

10.168.4.29 coa-esx20.domain.com coa-esx20

10.168.4.30 coa-esx21.domain.com coa-esx21

10.168.4.31 coa-esx22.domain.com coa-esx22

10.168.4.32 coa-esx23.domain.com coa-esx23

10.168.4.33 coa-esx24.domain.com coa-esx24

10.168.4.34 coa-esx25.domain.com coa-esx25

10.168.4.35 coa-esx26.domain.com coa-esx26

10.168.4.36 coa-esx27.domain.com coa-esx27

10.168.4.37 coa-esx28.domain.com coa-esx28

~~root@coa-esx03 vpx~~# hostname

coa-esx03.domain.com

~~root@coa-esx03 vpx~~# cat /etc/sysconfig/network

NETWORKING=yes

GATEWAYDEV=vswif0

HOSTNAME=coa-esx03.domain.com

GATEWAY=10.168.4.8

~~root@coa-esx03 vpx~~#

kjb007 · ‎05-08-2008

What ESX version are you using? Are all those hosts part of your cluster?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

mike_laspina · ‎05-08-2008

Hi,

This is an interesting issue.

The esxcfg-boot command runs every hour in the cron jobs and you can find it in /etc/cron.hourly/refreshrd

It's used to update the system configuration. It should be the only process doing this operation.

You need to find out if anything else is accessing the esx.conf file or if the previous esxcfg-boot cron process locked up and held the file open.

You can use lsof to see what files are open.

Are all of the hosts in a single cluster?

The HA process needs to read the esx.conf file to find the host name, with that many HA hosts its going to hit that file heavy with read requests.

You can do a tcpdump on the active console interface and monitor the HA polling activity to get a feel for the amount of load it has.

Message was edited by: mike.laspina - added question.

http://blog.laspina.ca/ vExpert 2009

bwhouse · ‎05-08-2008

Hi,

ESX 3.5 Update 1 and VC 2.5 update 1. The patches released on the 10th have also been installed on each host.

The hosts listed above are split up in too 4 clusters, 2 x 10 host and 2 x 4 host clusters. The problem seems to radomly appear on both the 10 host clusters (HA is not running on the 4 host clusters).

I so far have not seen anything else that is accessing the esx.conf file however, I am not too familar with the lsof command.

kjb007 · ‎05-09-2008

Can you check this file (/etc/opt/vmware/aam/FT_HOSTS) on all your hosts, to make sure the contents include only those servers that are part of the same cluster?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

mike_laspina · ‎05-09-2008

You could run the lsof command every second like this.

lsof -r 1 | grep "/etc/vmware/esx.conf"

This will reveal anything that accessed the file for more than 1 second.

http://blog.laspina.ca/ vExpert 2009

admin · ‎05-13-2008

HA appears to indirectly access the esx.conf file. There is a test for network configuration prior to the HA liveness test. It makes a call to "esxcfg-vswif -l" which might be the reason for the attempt to grab the lock.

If you want to try an experiment, you can try to edit a file called /opt/vmware/aam/ha/aam_config_util.pl and comment out the line that invokes "&verify_network_configuration()" and see if you no longer get the error. If so, please file a support request to look into this further.

bwhouse · ‎05-13-2008

Interetsting the /etc/opt/vmware/aam/FT_HOSTS file conatins some of the hosts in the cluster but not all, this seems to vary depending on which host I check it on.

What generated this file? Can I manually add entries?

mike_laspina · ‎05-13-2008

That file should only contain the IP's of each SC for the hosts that belong in a cluster.

I dont think you should edit that file as it is generated automatically.

The HA configuration scripts create it using ftcli.

http://blog.laspina.ca/ vExpert 2009

bwhouse · ‎05-13-2008

The odd things is some of the hosts in the cluster are missing from each others FTHOSTS file.

Is there anyway to kick off the automatic generation of these files?

kjb007 · ‎05-13-2008

That would be a problem. If the hosts don't know of each other, that could cause problems. You've probably already heard this already, but make sure resolution works for all hosts. You can disable HA, make sure that file is gone from all hosts. Then re-enable HA, make sure the installation succeeds on all cluster members, and then re-check this file to make sure all hosts are listed in it.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

admin · ‎05-13-2008

There may be hosts missing from this file if they were added to the cluster since that HA agent was last restarted. If you restart the HA agent, the file will get repopulated with all nodes in the cluster. You can kill the ftAgent process, and it will get re-generated.

This file is a hostname-to-IP cache that is only needed if you don't have DNS configured, or entries in the /etc/hosts file for the other nodes in the cluster.

The intermittent HA agent error reported is not related to the contents of this file. It is due to the cron job that runs hourly which may cause the HA liveness test to incorrectly report that the agent has a problem, when really the liveness test failed due to the concurrent execution of the cron job. You can ignore that transient error.

mike_laspina · ‎05-13-2008

For ESX 3.5 you can just right click the host in the VC GUI and select Reconfigure for VMware HA.

I can not remember if that is availble on 3.0.2

http://blog.laspina.ca/ vExpert 2009

admin · ‎05-14-2008

Let's not loose track of the problem we're trying to solve. Reconfigure HA will not populate the FT_HOSTS file with every node in the cluster each host, since reconfigure is basically a shortcut for "remove from cluster" + "add to cluster", with the same end result... the most recently added node to the cluster will not be in the FT_HOSTS file for other agents until those agents have been restarted.

Regardless, the root cause of the "HA Agent Has an Error" transient condition is the timing of the hourly backup of the esx.conf file interfering with the HA liveness test on occasion. This can be remediedby removing the &verify_network_configuration() call at the beginning of /opt/vmware/aam/ha/aam_config_util.pl script.

kjb007 · ‎05-14-2008

I would be very weary of removing any part of a standard script that exists on everyone's system, yet causes problems for only some. If there is a problem of timing, then we need to work on correcting timing problem, not removing pieces of a script. If it were done for testing purposes, that's one thing, but I disagree with removing it completely. If the job is running hourly via cron, then we can adjust the cron job to see if this problem goes away, but again, is anyone else having this issue? Maybe they are and just ignoring it, which would further your thought of it beign at transient error.

Meanwhile, you should be able to restart your aam agent, to see if we can throw off the timing with that alone.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

admin · ‎05-14-2008

Fair enough to be wary. The verification of the network configuration is most important at the time of configuring HA. But it gets called for every operation, including the HA Agent liveness testing. You can safely comment out that call, or better yet, move it under the statement

elsif ( $cmd eq "addnode") {

This way, it gets called where it is needed.

There is already a bug reported for this transient error, and the permanent fix to the problem will be to properly deal with the locked config file when this condition occurs..

bwhouse · ‎05-15-2008

I would be interested to know if other people are experiencing this issue if it is infact a bug within ESX3.5

I probably should have mentioned in more detail that it is very much a transient error that resolves itself almost instantaneously, unless you check the log files of the events tab in VC you would never know the problem exists.

msevigny is this the solution that vmware are currently recommending to solve the problem?

admin · ‎05-15-2008

There is no official recommendation at this time. We just became aware of the issue and will work at a resolution. If you are affected by it, if you could please file a Support Request so that you can be notified of the status of a resolution....

All

HA Agent has an Error - Failed attempting to lock file esx.conf