VMware Cloud Community
mattwilson
Contributor
Contributor

HA Agent Errors

Just curious if anyone else has been experiencing issues where their cluster nodes randomly experiencing an error with HA. As a result the Cluster is degraded. Right now I have multiple nodes with the same error in a cluster of 3 nodes thus nullifying HA for the entire cluster. Ironically DRS still appears to work if for the nodes experiencing HA problems. The unfortunate part is that the interface simply states the agent has an error but doesn't actually provide information beyond that.

Thus far I have been unable to correct the issue (other than rebuilding the node).

I have an open case with VMware Support but we haven't found a resolution and I have just been handed off to another technician.

Reply
0 Kudos
25 Replies
Svante
Enthusiast
Enthusiast

Keith,

Thank you for you posting that! I was indeed searching, and were going through the exact same thing; trying rebooting, checking for DNS issues, "reconfigure for HA" on the hosts etc. Simply disabling HA on the cluster and reenabling solved this issue for me!

EDIT: This was on VI 3, latest patches applied, but it seems the issue is still there..

Message was edited by:

Svante

Reply
0 Kudos
Bill_Morton
Contributor
Contributor

Not trying to thread-jack, but I have a very similar issue:

Running into a very similar issue ... our ESX is pre-production while I wait on a new switch for the iSCSI traffic, vlans etc.

Anyway, I get the errors described above when trying to setup HA and I am pretty sure that it is a DNS issue, and I need some quick help on the resolv.conf and nsswitch.conf.

I do not have a DNS server setup in the test environment (yes I know it is a 'requirement' for HA) and I want to set everything up so that it will work without a DNS server so that in the future it can withstand a DNS server failure. I only have 2 ESX boxes, so it is very manageable.

I have added short & FQDN for each ESX box and Virtual center in both ESX servers. They can all ping each other.

The resolv.conf only has one line "search xxx.edu"

The nsswitch.conf has "hosts: files dns"

Networking wise, I have the Console and Kernel traffic both on their own NIC and separate IP block with no routing between them. I have added a VMKernel port on the Console switch for NFS access.

In the logs I see a few errors:

\[2007-07-23 05:40:47.491 'App' 12782512 warning] Fault Msg: "Unable to change license state as the license server is not available."

\[2007-07-23 05:41:07.859 'App' 9935792 error] \[VpxaVMAP::Invoke] Command /usr/bin/perl /opt/LGTOaam512/vmware/aam_config_util.pl -z -cmd=listnodes -domain=vmware failed with error 1

\[2007-07-23 05:41:07.859 'App' 9935792 error] \[VpxaVMAP::Invoke] Command /usr/bin/perl /opt/LGTOaam512/vmware/aam_config_util.pl -z -cmd=listnodes -domain=vmware failed with error 1

\[2007-07-23 05:41:07.859 'App' 9935792 error] \[VpxaVMAP::Invoke] Command output:

KEY: -z VAL: 1

KEY: domain VAL: vmware

KEY: cmd VAL: listnodes

CMD: hostname -s

RESULT:

\----


vmesxdl3950

CMD: /opt/LGTOaam512/bin/ft_gethostbyname vmesxdl3950 |grep FAILED

RESULT:

\----


ft_gethostbyname(vmesxdl3950) FAILED!

VMwareerrortext=failed to resolve hostname/ip using short hostname vmesxdl3950

VMwareerrorcat=hostmisconfigured

VMwareresult=failure

Total time for script to complete: 0 minute(s) and 0 second(s)

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

So can any guru's spot the problem right off? I have no clue why the licensing server issue is listed ...

Reply
0 Kudos
Bill_Morton
Contributor
Contributor

Aww Frign @#$!

I just realized the hostname on one of the servers is wrong ....

It should be vmesxdl3850 not vmesxdl3950 ..... any good way to change the hostname? remove and re-add to VC?

Reply
0 Kudos
Bill_Morton
Contributor
Contributor

Alright it was the hostname issue =(

Solved by reconfiguring DNS and Routing on the server & rebooting.

Reply
0 Kudos
VMdawg
Enthusiast
Enthusiast

Whats the command in root to check the namesearch, I am having a brain cramp?

Reply
0 Kudos
Timber_Wolf
Contributor
Contributor

i had the same this is what i did, try it maybe it will help

I made a list of everything i need to check prior to clustering some of my servers, this is to standardise my network configuration on all my hosts, hope this helps somebody

Ensure network configuration is correct in the following config files

1. Putty into ESX host

2. Logon as Root

The following you can copy and past as is

3. vi /etc/sysconfig/network

4. vi /etc/hosts

5. vi /etc/resolv.conf

6. service network restart

7. In the VI client go to configuration, Software - Routing.. Ensure that configuration matches what you changed above.

PS it would seem that case does matter.

8. Migrate all guests off the host.

9. Restart esx server

Any suggestions of comments will be welcome

Reply
0 Kudos