I cannot get HA to work.
I have two DL585 both connected to a MSA1000 SAN.
DNS works on both ESX Servers.
If i try to enable HA in a newly created Cluster i get the following error...
opt/LGTOaam512/bin/ft_startup failed
on both ESX Maschines!
so i searched the form but found nothing but only this command
perl /opt/LGTOaam512/vmware/aam_config_util.pl -z -cmd=addnode -traceon=1 > addnode_output.txt[/i]
so i get an txt file but i sill don't know why i can't start the HA Agents.
here is the output of the txt file:
CMD: hostname -s
RESULT:
\----
acn049ffmesx301
CMD: /opt/LGTOaam512/bin/ft_gethostbyname acn049ffmesx301 |grep FAILED
RESULT:
\----
CMD: /opt/LGTOaam512/bin/ftcli -domain vmware -connect acn049ffmesx301 -port 8042 -timeout 60 -cmd "listnodes"
RESULT:
\----
add_aam_node
CMD: cp -f /opt/LGTOaam512/samples/host.cfg /opt/LGTOaam512/config/acn049ffmesx301.cfg
RESULT:
\----
This is the primary agent -- 1st node in cluster.
Primary agent: acn049ffmesx301
CMD: cp /opt/LGTOaam512/vmware/vmware_first_node.pl /opt/LGTOaam512/bin/runInit
RESULT:
\----
CMD: /opt/LGTOaam512/bin/ft_setup -domain=vmware -upgrade=n -noprompt=y -hostname=acn049ffmesx301 -port1=8042 -licensekey=AMCFNEET-4YRDDN53CTHMBDSJ -mailserver=none -primaryagent=acn049ffmesx301
RESULT:
\----
Legato Automated Availability Manager setup script.
Setting environment from /opt/LGTOaam512/config/agent_env.Linux
Setting up the Legato Automated Availability Manager agent for domain vmware
Welcome to Automated Availability Manager. (Release 5.1 )
Configuring Agent for current node: acn049ffmesx301
Enter the name of your domain \[vmware]:
Using comand line argument domain of : vmware
A previous installation has been detected in this directory.
Is this a software upgrade? (y/n) :
Upgrade command line argument: n
WARNING: your previous configuration and database will be overwritten.
Do you want to continue? (y/n) :
Configuration requires the node name of a primary agent. If you
are configuring the first node in the domain, enter the name
of this node. (i.e. acn049ffmesx301) If this is a subsequent installation
enter the name of an existing primary agent node.
Enter the name of a Primary Agent Node:
Using input argument of acn049ffmesx301 for Primary Agent
Performing a primary node configuration.
Agents require the use of 4 network ports through which to
communicate. These port numbers must be available and consistent
across each of the nodes in the domain. If you are unsure about
specifying port numbers or defining primary nodes please read the
appropriate sections of the user documentation provided with this
product.
Specify the first of the 4 port numbers: \[8042]
Using argument for port1: 8042
Ports 8042, 8043, 8044 and 8045 will be used.
Enter your license key: Version: 51
Expires: Permanent License
Features: Site Permanent
Enter the name of your SMTP mail server (optional):
Installation for this node is complete.
To start the Agent run the "ft_startup" command.
VMwareprogress=0
CMD: cp /tmp/aam/*.incarn /opt/LGTOaam512/log/backbone/
RESULT:
\----
VMwareprogress=20
VMwareprogress=22
VMwareprogress=22
VMwareprogress=25
CMD: cp -f /opt/LGTOaam512/config/ftbb.prm /opt/LGTOaam512/config/ftbb.prm.bck
RESULT:
\----
Waiting for /opt/LGTOaam512/bin/ft_startup to complete
VMwareprogress=25
VMwareprogress=25
CMD: /opt/LGTOaam512/bin/ft_startup
RESULT:
\----
Legato Automated Availability Manager startup script.
Setting environment from /opt/LGTOaam512/config/agent_env.Linux
Starting agent for domain vmware
Starting Backbone...
...
Backbone started successfully.
Starting Agent...
Agent startup failed.
Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.
VMwareprogress=39
ft_startup_monitor: elasped time 0 minute(s) and 22 second(s)
VMwareprogress=39
Waiting for /opt/LGTOaam512/bin/ft_startup to complete
VMwareprogress=39
CMD: /opt/LGTOaam512/bin/ft_startup
RESULT:
\----
Legato Automated Availability Manager startup script.
Setting environment from /opt/LGTOaam512/config/agent_env.Linux
Starting agent for domain vmware
Bind info: Address already in use
Backbone's network ports are in use.
Assuming the backbone is running.
Starting Agent...
Agent startup failed.
Unexplained fatal error. No $FT_DIR/log/agent/acn049ffmesx301_fatal.out file found.
val: 14228 root 14228 1 0 03:28 pts/0 00:00:00 /opt/LGTOaam512/bin/ftbb -S/opt/LGTOaam512/config/vmware-sites -R/opt/LGTOaam512/config/ftbb.rc
val: 14230 root 14230 14228 0 03:28 pts/0 00:00:00 -d. -P1:2:50 -S/opt/LGTOaam512/config/vmware-sites
List: 14228 14230
VMwareerrortext=/opt/LGTOaam512/bin/ft_startup failed
VMwareerrorcat=internalerror
Copying /opt/LGTOaam512/config/vmware-sites to /opt/LGTOaam512/log/aam_config_util_addnode.log
VMwareresult=failure
Total time for script to complete: 0 minute(s) and 27 second(s)
I had the same issue in a test environment. after i plugged in a device with the ip of my gateway, everything went fine. hope this is helpful!
I hade the same Error on my test install ESX 3.01 after joning 2 ESX Servers to a cluster setup, I got the "failed to resolve hostname/ip by using short hostname" error.
my solution.
add short hostname of all cluster mambers to the /etc/hosts File after doing so, every thing worked out well!
I do same, it's help me.
I had the same problem, tried to connect a disconnected ESX host and got the error "An error occurred during the configuration of the HA Agent on the host."
Solution, somebody changed the firewall rules so I was unable to ping the default GW simple solution, finding it cost me almost a day
I had the same problem..
I fixed it all with Added ALL ESX hosts to the HOSTS file in the ETC directory on each host.
For some reason DNS wasn't fast enough for the ESX hosts...any delay cause HA errors...I've been HA error free for 3 months and counting.
I had this problem, and after running through the standard checks, I played around with one of the commands logged in the /opt/LGTOaam512/log directory..
/opt/LGTOaam512/bin/ft_gethostbyname
This command should return the same results on all hosts.. I found that on one of my clustered ESX hosts (that wasn't having a problem enabling HA) this command was resolving the host that was having a problem to it's old IP address. I had to tell this host to reconfigure it's HA, and then it started resolving the other host properly. Once that worked, then the host with the initial problem was reconfigured for HA just fine.
usage:
/opt/LGTOaam512/bin/ft_gethostbyname
Great help, thanks a lot...
Best regards
Oliver
I had the same problem with several installations, i resolved all my cases till now with editing the hosts file and configure the dns and gateway properly. Greetings!!
We had the same problem, HA configuration impossible.
We just added a PTR record in the reverse lookup zone for our servers names and that did the trick, HA configuration went fine.
Hope this helps
For each host in the cluster, do this:
1) hostname -s should return the right short host name of that host
2) hostname -i should return the right ip address
3) /opt/LGTOaam512/bin/ft_gethostbyname Advanced Options (as an ip address). Note that this has to be some reliable pingable address, that is not too many hops away from the hosts in the cluster (since you are using the ping to test for network connectivity).
6) The 29 character FQDN limitation has been resolved in VC2.0.1. You don't need to muck around with /etc/hosts, unless you are not using DNS for name resolution. If you are using /etc/hosts, ensure that the name resolution works alright from all hosts, using the above tests.
Can anyone tell me if hostname -i does not return the correct IP what to check? I've checked /etc/hosts, ../network-scripts as well.
Thanks
You'll want to check
/etc/hosts
/etc/resolv.conf
/etc/sysconfig/network
/etc/sysconfig/network-scripts/ifcfg-vswif0 (and/or ifcfg-vswif1,2,etc)
Removing the cluster and recreating, using the short dns name instead of the hosts ip address worked for me. You will see thats its working as when you add a host it expands its name to the FQDN.
Regards
I have the same problem, and resolve it removing virtual center management server completed and reinstalling.
Hello,
Did you guys read this doc: http://download3.vmware.com/vmworld/2006/tac9413.pdf
It might be of some help for this case.
Cheers,
Oczkov
To solve this issue I put the server names into the hosts file (/etc/hosts) and it came up just fine.
PS. in v3.5 the log files mntioned earlier may have moved to /var/log/vmware/aam
Hi,
after months of running HA without any issues one of my servers went out of the cluster. After trying many things I decided to disable HA on the cluster and reenable it after a few minutes. After this all my servers joined the HA-Cluster without any errors. Hope this will help.
Rainer
I was having the same problem as described above.
We had changed the IP address of the server due to a misconfig of it. Apparently, when you change the IP of the service console, it doesn't update /etc/hosts. I changed the entry in /etc/hosts and after that, HA was all good.
I had the "Internalerror: Internal AAM error - agent could not start" Upon further investigation, I found a log entry when on one of my ESX servers that pointed to DNS. I only found this when I used VC to connect directly to the ESX server versus the cluster. In the logs for the particular ESX server, I found it could not resolve its own name. After moving my VMs over to the other ESX server, I simply put the troubled ESX server into maintenance mode and then restarted it. I watched the screen on the actual ESX box to make sure no errors or problems, and then stated up the VC to my license server. Once the troubled ESX server was seen again, took it out of maintenance mode and then the VC determined it needed to reconfigure it for HA. Once that completed, there was no further log entries that HA agent could not start.
Unlike what one of the earlier posters stated, I have not found HA to be "highly unavailable." I have been very happy with it. It has been very good on balancing the resources between all my VMs and suggesting migrations when needed. Other than this recent issue with the agent, it has been very smooth and stable. The agent issue wasn't even the HA's problem, but rather was the result of not restarting the troubled ESX server after we changed DNS servers.
I followed a lot of the steps in this post, but nothing worked for me. What did it for me was this:
I had two host servers in an exsiting cluster. I needed to add two new hosts to the cluster. I had everything configured correctly, but nothing I did would let the HA agent run on the two new hosts in the cluster. What I did was: created a new cluster (without EVC mode) and added both of the hosts. I had to turn off HA. I then took a guest server from my first cluster (keep in mind that all 4 of my host servers are connect to the same shared storage) and vMotion'ed it to my second cluster. Once the guest server migrated, I then enabled HA and PPOOOWWW!!! HA worked on both of the host servers. Weird, I know, but that's the steps that worked for me.