VMware Cloud Community
Stephen_Murphy
Contributor
Contributor

HA Agents

I cannot get HA to work.

I have two DL585 both connected to a MSA1000 SAN.

DNS works on both ESX Servers.

If i try to enable HA in a newly created Cluster i get the following error...

opt/LGTOaam512/bin/ft_startup failed

on both ESX Maschines!

so i searched the form but found nothing but only this command

perl /opt/LGTOaam512/vmware/aam_config_util.pl -z -cmd=addnode -traceon=1 > addnode_output.txt[/i]

so i get an txt file but i sill don't know why i can't start the HA Agents.

here is the output of the txt file:

CMD: hostname -s

RESULT:

\----


acn049ffmesx301

CMD: /opt/LGTOaam512/bin/ft_gethostbyname acn049ffmesx301 |grep FAILED

RESULT:

\----


CMD: /opt/LGTOaam512/bin/ftcli -domain vmware -connect acn049ffmesx301 -port 8042 -timeout 60 -cmd "listnodes"

RESULT:

\----


add_aam_node

CMD: cp -f /opt/LGTOaam512/samples/host.cfg /opt/LGTOaam512/config/acn049ffmesx301.cfg

RESULT:

\----


This is the primary agent -- 1st node in cluster.

Primary agent: acn049ffmesx301

CMD: cp /opt/LGTOaam512/vmware/vmware_first_node.pl /opt/LGTOaam512/bin/runInit

RESULT:

\----


CMD: /opt/LGTOaam512/bin/ft_setup -domain=vmware -upgrade=n -noprompt=y -hostname=acn049ffmesx301 -port1=8042 -licensekey=AMCFNEET-4YRDDN53CTHMBDSJ -mailserver=none -primaryagent=acn049ffmesx301

RESULT:

\----


Legato Automated Availability Manager setup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Setting up the Legato Automated Availability Manager agent for domain vmware

Welcome to Automated Availability Manager. (Release 5.1 )

Configuring Agent for current node: acn049ffmesx301

Enter the name of your domain \[vmware]:

Using comand line argument domain of : vmware

A previous installation has been detected in this directory.

Is this a software upgrade? (y/n) :

Upgrade command line argument: n

WARNING: your previous configuration and database will be overwritten.

Do you want to continue? (y/n) :

Configuration requires the node name of a primary agent. If you

are configuring the first node in the domain, enter the name

of this node. (i.e. acn049ffmesx301) If this is a subsequent installation

enter the name of an existing primary agent node.

Enter the name of a Primary Agent Node:

Using input argument of acn049ffmesx301 for Primary Agent

Performing a primary node configuration.

Agents require the use of 4 network ports through which to

communicate. These port numbers must be available and consistent

across each of the nodes in the domain. If you are unsure about

specifying port numbers or defining primary nodes please read the

appropriate sections of the user documentation provided with this

product.

Specify the first of the 4 port numbers: \[8042]

Using argument for port1: 8042

Ports 8042, 8043, 8044 and 8045 will be used.

Enter your license key: Version: 51

Expires: Permanent License

Features: Site Permanent

Enter the name of your SMTP mail server (optional):

Installation for this node is complete.

To start the Agent run the "ft_startup" command.

VMwareprogress=0

CMD: cp /tmp/aam/*.incarn /opt/LGTOaam512/log/backbone/

RESULT:

\----


VMwareprogress=20

VMwareprogress=22

VMwareprogress=22

VMwareprogress=25

CMD: cp -f /opt/LGTOaam512/config/ftbb.prm /opt/LGTOaam512/config/ftbb.prm.bck

RESULT:

\----


Waiting for /opt/LGTOaam512/bin/ft_startup to complete

VMwareprogress=25

VMwareprogress=25

CMD: /opt/LGTOaam512/bin/ft_startup

RESULT:

\----


Legato Automated Availability Manager startup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Starting agent for domain vmware

Starting Backbone...

...

Backbone started successfully.

Starting Agent...

Agent startup failed.

Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.

VMwareprogress=39

ft_startup_monitor: elasped time 0 minute(s) and 22 second(s)

VMwareprogress=39

Waiting for /opt/LGTOaam512/bin/ft_startup to complete

VMwareprogress=39

CMD: /opt/LGTOaam512/bin/ft_startup

RESULT:

\----


Legato Automated Availability Manager startup script.

Setting environment from /opt/LGTOaam512/config/agent_env.Linux

Starting agent for domain vmware

Bind info: Address already in use

Backbone's network ports are in use.

Assuming the backbone is running.

Starting Agent...

Agent startup failed.

Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.

val: 14228 root 14228 1 0 03:28 pts/0 00:00:00 /opt/LGTOaam512/bin/ftbb -S/opt/LGTOaam512/config/vmware-sites -R/opt/LGTOaam512/config/ftbb.rc

val: 14230 root 14230 14228 0 03:28 pts/0 00:00:00 -d. -P1:2:50 -S/opt/LGTOaam512/config/vmware-sites

List: 14228 14230

VMwareerrortext=/opt/LGTOaam512/bin/ft_startup failed

VMwareerrorcat=internalerror

Copying /opt/LGTOaam512/config/vmware-sites to /opt/LGTOaam512/log/aam_config_util_addnode.log

VMwareresult=failure

Total time for script to complete: 0 minute(s) and 27 second(s)

0 Kudos
40 Replies
mirko_vogel
Contributor
Contributor

I had the same issue in a test environment. after i plugged in a device with the ip of my gateway, everything went fine. hope this is helpful!

0 Kudos
SergioB
Contributor
Contributor

I hade the same Error on my test install ESX 3.01 after joning 2 ESX Servers to a cluster setup, I got the "failed to resolve hostname/ip by using short hostname" error.

my solution.

add short hostname of all cluster mambers to the /etc/hosts File after doing so, every thing worked out well!

0 Kudos
slobodandjordje
Contributor
Contributor

I do same, it's help me.

0 Kudos
DHD
Contributor
Contributor

I had the same problem, tried to connect a disconnected ESX host and got the error "An error occurred during the configuration of the HA Agent on the host."

Solution, somebody changed the firewall rules so I was unable to ping the default GW Smiley Sad simple solution, finding it cost me almost a day Smiley Sad

0 Kudos
CWedge
Enthusiast
Enthusiast

I had the same problem..

I fixed it all with Added ALL ESX hosts to the HOSTS file in the ETC directory on each host.

For some reason DNS wasn't fast enough for the ESX hosts...any delay cause HA errors...I've been HA error free for 3 months and counting.

0 Kudos
kyoo
Contributor
Contributor

I had this problem, and after running through the standard checks, I played around with one of the commands logged in the /opt/LGTOaam512/log directory..

/opt/LGTOaam512/bin/ft_gethostbyname

This command should return the same results on all hosts.. I found that on one of my clustered ESX hosts (that wasn't having a problem enabling HA) this command was resolving the host that was having a problem to it's old IP address. I had to tell this host to reconfigure it's HA, and then it started resolving the other host properly. Once that worked, then the host with the initial problem was reconfigured for HA just fine.

usage:

/opt/LGTOaam512/bin/ft_gethostbyname

0 Kudos
Goliath222
Contributor
Contributor

Great help, thanks a lot...

Best regards

Oliver

0 Kudos
Josb
Contributor
Contributor

I had the same problem with several installations, i resolved all my cases till now with editing the hosts file and configure the dns and gateway properly. Greetings!!

0 Kudos
lprotti
Contributor
Contributor

We had the same problem, HA configuration impossible.

We just added a PTR record in the reverse lookup zone for our servers names and that did the trick, HA configuration went fine.

Hope this helps

0 Kudos
admin
Immortal
Immortal

For each host in the cluster, do this:

1) hostname -s should return the right short host name of that host

2) hostname -i should return the right ip address

3) /opt/LGTOaam512/bin/ft_gethostbyname Advanced Options (as an ip address). Note that this has to be some reliable pingable address, that is not too many hops away from the hosts in the cluster (since you are using the ping to test for network connectivity).

6) The 29 character FQDN limitation has been resolved in VC2.0.1. You don't need to muck around with /etc/hosts, unless you are not using DNS for name resolution. If you are using /etc/hosts, ensure that the name resolution works alright from all hosts, using the above tests.

0 Kudos
billy05
Contributor
Contributor

Can anyone tell me if hostname -i does not return the correct IP what to check? I've checked /etc/hosts, ../network-scripts as well.

Thanks

0 Kudos
admin
Immortal
Immortal

You'll want to check

/etc/hosts

/etc/resolv.conf

/etc/sysconfig/network

/etc/sysconfig/network-scripts/ifcfg-vswif0 (and/or ifcfg-vswif1,2,etc)

0 Kudos
Flan5ter
Contributor
Contributor

Removing the cluster and recreating, using the short dns name instead of the hosts ip address worked for me. You will see thats its working as when you add a host it expands its name to the FQDN.

Regards

0 Kudos
mitvix
Enthusiast
Enthusiast

I have the same problem, and resolve it removing virtual center management server completed and reinstalling.

Alexander Manfrin VCP - VMware Certified Professional Owner www.vmworld.com.br +55 61 8110 2665 - Brasilia - Brazil
0 Kudos
Oczkov
Enthusiast
Enthusiast

Hello,

Did you guys read this doc: http://download3.vmware.com/vmworld/2006/tac9413.pdf

It might be of some help for this case.

Cheers,

Oczkov

0 Kudos
nick1234
Contributor
Contributor

To solve this issue I put the server names into the hosts file (/etc/hosts) and it came up just fine.

PS. in v3.5 the log files mntioned earlier may have moved to /var/log/vmware/aam

0 Kudos
rainer_schumach
Contributor
Contributor

Hi,

after months of running HA without any issues one of my servers went out of the cluster. After trying many things I decided to disable HA on the cluster and reenable it after a few minutes. After this all my servers joined the HA-Cluster without any errors. Hope this will help.

Rainer

0 Kudos
jc-rush
Contributor
Contributor

I was having the same problem as described above.

We had changed the IP address of the server due to a misconfig of it. Apparently, when you change the IP of the service console, it doesn't update /etc/hosts. I changed the entry in /etc/hosts and after that, HA was all good.

0 Kudos
Kerberos49
Contributor
Contributor

I had the "Internalerror: Internal AAM error - agent could not start" Upon further investigation, I found a log entry when on one of my ESX servers that pointed to DNS. I only found this when I used VC to connect directly to the ESX server versus the cluster. In the logs for the particular ESX server, I found it could not resolve its own name. After moving my VMs over to the other ESX server, I simply put the troubled ESX server into maintenance mode and then restarted it. I watched the screen on the actual ESX box to make sure no errors or problems, and then stated up the VC to my license server. Once the troubled ESX server was seen again, took it out of maintenance mode and then the VC determined it needed to reconfigure it for HA. Once that completed, there was no further log entries that HA agent could not start.

Unlike what one of the earlier posters stated, I have not found HA to be "highly unavailable." I have been very happy with it. It has been very good on balancing the resources between all my VMs and suggesting migrations when needed. Other than this recent issue with the agent, it has been very smooth and stable. The agent issue wasn't even the HA's problem, but rather was the result of not restarting the troubled ESX server after we changed DNS servers.

0 Kudos
burdweiser
Enthusiast
Enthusiast

I followed a lot of the steps in this post, but nothing worked for me. What did it for me was this:

I had two host servers in an exsiting cluster. I needed to add two new hosts to the cluster. I had everything configured correctly, but nothing I did would let the HA agent run on the two new hosts in the cluster. What I did was: created a new cluster (without EVC mode) and added both of the hosts. I had to turn off HA. I then took a guest server from my first cluster (keep in mind that all 4 of my host servers are connect to the same shared storage) and vMotion'ed it to my second cluster. Once the guest server migrated, I then enabled HA and PPOOOWWW!!! HA worked on both of the host servers. Weird, I know, but that's the steps that worked for me.

0 Kudos