Hi,
I am experiencing the "HA agent on ESX_HOST in cluster CLUSTER_NAME has an error event in VC, it clears almost instantly, however re-ocours almost 10 times a day on each host in the cluster.
I have tried all the steps listed that I can find to fix this problem, including remove all hosts from the cluster and remove the cluster and re-create from scratch. This seemed to solve teh problem for 3 days however it is now back.
I have double checked DNS and ever host reloves fine both with shortname and FQDN, I have also added each host to each others hosts file in the following format;
root@coa-esx03 vpx# cat /etc/hosts
#Do not remove the following line, or various programs
#that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
Each time I check the /var/log/vmware/vpx/vpxa.log after the error the following entry appears. I have had no luck finding any reference to this on the forums so far.
Error trying to perform list Error interacting with configuration file /etc/vmware/esx.conf: Failed attempting to lock file. Another process has locked the file for more than 10 se
conds. The process holding the lock is /usr/bin/perl-w/usr/sbin/esxcfg-boot-o (1014). This operation will complete if it is run again after the lock is released.
Fair enough to be wary. The verification of the network configuration is most important at the time of configuring HA. But it gets called for every operation, including the HA Agent liveness testing. You can safely comment out that call, or better yet, move it under the statement
elsif ( $cmd eq "addnode") {
This way, it gets called where it is needed.
There is already a bug reported for this transient error, and the permanent fix to the problem will be to properly deal with the locked config file when this condition occurs..
Make sure you have at least the hostname and IP address of the local machine in your /etc/hosts file. From the error message, it appears that the esx hsot does not know who it is.
Make sure your hostname is correct when you type 'hostname' and in the file /etc/sysconfig/network, and also possibly (it does not have to be in this file) /etc/sysconfig/network-scripts/ifcfg-vswif0
That hostname should be in /etc/hosts file with the correct IP address in both short and fully qualified form.
-KjB
Sorry I should have provided more detail. My hosts file already contains the shortname and FQDN of every esx host. I have also checked the sysconfig/network file. I have included the outputs below. This one is strating to really confuse me any help woudl be gratefully appreciated.
root@coa-esx03 vpx# cat /etc/hosts
Do not remove the following line, or various programs
that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
10.168.4.10 coa-esx01.domain.com coa-esx01
10.168.4.11 coa-esx02.domain.com coa-esx02
10.168.4.12 coa-esx03.domain.com coa-esx03
10.168.4.13 coa-esx04.domain.com coa-esx04
10.168.4.14 coa-esx05.domain.com coa-esx05
10.168.4.15 coa-esx06.domain.com coa-esx06
10.168.4.16 coa-esx07.domain.com coa-esx07
10.168.4.17 coa-esx08.domain.com coa-esx08
10.168.4.18 coa-esx09.domain.com coa-esx09
10.168.4.19 coa-esx10.domain.com coa-esx10
10.168.4.20 coa-esx11.domain.com coa-esx11
10.168.4.21 coa-esx12.domain.com coa-esx12
10.168.4.22 coa-esx13.domain.com coa-esx13
10.168.4.23 coa-esx14.domain.com coa-esx14
10.168.4.24 coa-esx15.domain.com coa-esx15
10.168.4.25 coa-esx16.domain.com coa-esx16
10.168.4.26 coa-esx17.domain.com coa-esx17
10.168.4.27 coa-esx18.domain.com coa-esx18
10.168.4.28 coa-esx19.domain.com coa-esx19
10.168.4.29 coa-esx20.domain.com coa-esx20
10.168.4.30 coa-esx21.domain.com coa-esx21
10.168.4.31 coa-esx22.domain.com coa-esx22
10.168.4.32 coa-esx23.domain.com coa-esx23
10.168.4.33 coa-esx24.domain.com coa-esx24
10.168.4.34 coa-esx25.domain.com coa-esx25
10.168.4.35 coa-esx26.domain.com coa-esx26
10.168.4.36 coa-esx27.domain.com coa-esx27
10.168.4.37 coa-esx28.domain.com coa-esx28
root@coa-esx03 vpx# hostname
coa-esx03.domain.com
root@coa-esx03 vpx# cat /etc/sysconfig/network
NETWORKING=yes
GATEWAYDEV=vswif0
HOSTNAME=coa-esx03.domain.com
GATEWAY=10.168.4.8
root@coa-esx03 vpx#
What ESX version are you using? Are all those hosts part of your cluster?
-KjB
Hi,
This is an interesting issue.
The esxcfg-boot command runs every hour in the cron jobs and you can find it in /etc/cron.hourly/refreshrd
It's used to update the system configuration. It should be the only process doing this operation.
You need to find out if anything else is accessing the esx.conf file or if the previous esxcfg-boot cron process locked up and held the file open.
You can use lsof to see what files are open.
Are all of the hosts in a single cluster?
The HA process needs to read the esx.conf file to find the host name, with that many HA hosts its going to hit that file heavy with read requests.
You can do a tcpdump on the active console interface and monitor the HA polling activity to get a feel for the amount of load it has.
Message was edited by: mike.laspina - added question.
Hi,
ESX 3.5 Update 1 and VC 2.5 update 1. The patches released on the 10th have also been installed on each host.
The hosts listed above are split up in too 4 clusters, 2 x 10 host and 2 x 4 host clusters. The problem seems to radomly appear on both the 10 host clusters (HA is not running on the 4 host clusters).
I so far have not seen anything else that is accessing the esx.conf file however, I am not too familar with the lsof command.
Can you check this file (/etc/opt/vmware/aam/FT_HOSTS) on all your hosts, to make sure the contents include only those servers that are part of the same cluster?
-KjB
You could run the lsof command every second like this.
lsof -r 1 | grep "/etc/vmware/esx.conf"
This will reveal anything that accessed the file for more than 1 second.
HA appears to indirectly access the esx.conf file. There is a test for network configuration prior to the HA liveness test. It makes a call to "esxcfg-vswif -l" which might be the reason for the attempt to grab the lock.
If you want to try an experiment, you can try to edit a file called /opt/vmware/aam/ha/aam_config_util.pl and comment out the line that invokes "&verify_network_configuration()" and see if you no longer get the error. If so, please file a support request to look into this further.
Interetsting the /etc/opt/vmware/aam/FT_HOSTS file conatins some of the hosts in the cluster but not all, this seems to vary depending on which host I check it on.
What generated this file? Can I manually add entries?
That file should only contain the IP's of each SC for the hosts that belong in a cluster.
I dont think you should edit that file as it is generated automatically.
The HA configuration scripts create it using ftcli.
The odd things is some of the hosts in the cluster are missing from each others FTHOSTS file.
Is there anyway to kick off the automatic generation of these files?
That would be a problem. If the hosts don't know of each other, that could cause problems. You've probably already heard this already, but make sure resolution works for all hosts. You can disable HA, make sure that file is gone from all hosts. Then re-enable HA, make sure the installation succeeds on all cluster members, and then re-check this file to make sure all hosts are listed in it.
-KjB
There may be hosts missing from this file if they were added to the cluster since that HA agent was last restarted. If you restart the HA agent, the file will get repopulated with all nodes in the cluster. You can kill the ftAgent process, and it will get re-generated.
This file is a hostname-to-IP cache that is only needed if you don't have DNS configured, or entries in the /etc/hosts file for the other nodes in the cluster.
The intermittent HA agent error reported is not related to the contents of this file. It is due to the cron job that runs hourly which may cause the HA liveness test to incorrectly report that the agent has a problem, when really the liveness test failed due to the concurrent execution of the cron job. You can ignore that transient error.
For ESX 3.5 you can just right click the host in the VC GUI and select Reconfigure for VMware HA.
I can not remember if that is availble on 3.0.2
Let's not loose track of the problem we're trying to solve. Reconfigure HA will not populate the FT_HOSTS file with every node in the cluster each host, since reconfigure is basically a shortcut for "remove from cluster" + "add to cluster", with the same end result... the most recently added node to the cluster will not be in the FT_HOSTS file for other agents until those agents have been restarted.
Regardless, the root cause of the "HA Agent Has an Error" transient condition is the timing of the hourly backup of the esx.conf file interfering with the HA liveness test on occasion. This can be remediedby removing the &verify_network_configuration() call at the beginning of /opt/vmware/aam/ha/aam_config_util.pl script.
I would be very weary of removing any part of a standard script that exists on everyone's system, yet causes problems for only some. If there is a problem of timing, then we need to work on correcting timing problem, not removing pieces of a script. If it were done for testing purposes, that's one thing, but I disagree with removing it completely. If the job is running hourly via cron, then we can adjust the cron job to see if this problem goes away, but again, is anyone else having this issue? Maybe they are and just ignoring it, which would further your thought of it beign at transient error.
Meanwhile, you should be able to restart your aam agent, to see if we can throw off the timing with that alone.
-KjB
Fair enough to be wary. The verification of the network configuration is most important at the time of configuring HA. But it gets called for every operation, including the HA Agent liveness testing. You can safely comment out that call, or better yet, move it under the statement
elsif ( $cmd eq "addnode") {
This way, it gets called where it is needed.
There is already a bug reported for this transient error, and the permanent fix to the problem will be to properly deal with the locked config file when this condition occurs..
I would be interested to know if other people are experiencing this issue if it is infact a bug within ESX3.5
I probably should have mentioned in more detail that it is very much a transient error that resolves itself almost instantaneously, unless you check the log files of the events tab in VC you would never know the problem exists.
msevigny is this the solution that vmware are currently recommending to solve the problem?
There is no official recommendation at this time. We just became aware of the issue and will work at a resolution. If you are affected by it, if you could please file a Support Request so that you can be notified of the status of a resolution....