Hello,
We've an issue with a ESXi 3.5 Cluster with 6 nodes.
One node had an issue with HA agent, and we have an error on the Reconfigure for HA tasks :
HA agent has an error : cmd addnode failed for primary node: /opt/vmware/aam/bin/ft_startup failed to complete within 3 minutes : Unknown HA error
We try to reboot the host, but after it was rebooted, we exit the maintenance mode, and the same error shown again
(I've posted the addnode and listnodes logs from the host with issues)
aam_config_util_listnodes.log
... 11/02/10 11:11:28 [print_args ] Invoked command: 11/02/10 11:11:28 [print_args ] /opt/vmware/aam/bin/ftPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname= x06 -uname=VMkernel -cmd=listnodes -domain=vmware 11/02/10 11:11:28 [print_args ] Environment: 11/02/10 11:11:28 [print_args ] FT_DIR=/opt/vmware/aam 11/02/10 11:11:28 [print_args ] FT_ISOLATION_TIME=1 11/02/10 11:11:28 [print_args ] GREP=/bin/grep 11/02/10 11:11:28 [print_args ] FT_CONFIG_DIR=/var/lib/vmware/aam 11/02/10 11:11:28 [print_args ] RPCINFO=/bin/rpcinfo 11/02/10 11:11:28 [print_args ] LD_LIBRARY_PATH=/lib:/usr/lib:/opt/vmware/aam/lib:/opt/vmware/vpxa/vpx: 11/02/10 11:11:28 [print_args ] PS=/bin/ps 11/02/10 11:11:28 [print_args ] FT_PERSISTED_CONFIG_DIR=/etc/opt/vmware/aam 11/02/10 11:11:28 [print_args ] PWD=/var/log/vmware/vpx 11/02/10 11:11:28 [print_args ] PS_OPTIONS= 11/02/10 11:11:28 [print_args ] FT_NO_CONSOLE_TRACE=1 11/02/10 11:11:28 [print_args ] PATH=/sbin:/usr/sbin:/bin:/usr/bin:/opt/vmware/aam/bin:/bin 11/02/10 11:11:28 [print_args ] FT_LOG_DIR=/var/log/vmware/aam 11/02/10 11:11:28 [print_args ] FT_DOMAIN=vmware 11/02/10 11:11:28 [print_args ] Parsed arguments: 11/02/10 11:11:28 [print_args ] cmd=listnodes 11/02/10 11:11:28 [print_args ] -z=1 11/02/10 11:11:28 [print_args ] uname=VMkernel 11/02/10 11:11:28 [print_args ] shortname= x06 11/02/10 11:11:28 [print_args ] domain=vmware 11/02/10 11:11:28 [issue_cmd ] CMD: /opt/vmware/aam/bin/ft_gethostbyname x06 |grep FAILED 11/02/10 11:11:28 [issue_cmd ] STATUS: 1 11/02/10 11:11:28 [issue_cmd ] RESULT: 11/02/10 11:11:28 [issue_cmd ] 11/02/10 11:11:28 [list_nodes ] 11/02/10 11:11:28 [issue_cli_cmd ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -port 8042 -timeout 15 -cmd listnodes' 11/02/10 11:11:29 [issue_cmd ] CMD: /opt/vmware/aam/bin/ftcli -domain vmware -port 8042 -timeout 15 -cmd listnodes 11/02/10 11:11:29 [issue_cmd ] STATUS: 0 11/02/10 11:11:29 [issue_cmd ] RESULT: 11/02/10 11:11:29 [issue_cmd ] Node Type State 11/02/10 11:11:29 [issue_cmd ] ----------------------- ------------ -------------- 11/02/10 11:11:29 [issue_cmd ] x01 Primary Agent Failed 11/02/10 11:11:29 [issue_cmd ] x02 Primary Agent Running 11/02/10 11:11:29 [issue_cmd ] x03 Primary Agent Running 11/02/10 11:11:29 [issue_cmd ] x04 Primary Agent Running 11/02/10 11:11:29 [issue_cmd ] x05 Primary Agent Running 11/02/10 11:11:29 [issue_cmd ] x06 Secondary No Agent Run 11/02/10 11:11:29 [issue_cmd ] 11/02/10 11:11:29 [myexit ] VMwareresult=success 11/02/10 11:11:29 [elapsed_time ] Total time for script to complete: 0 minute(s) and 1 second(s) ...
aam_config_util_addnode.log
... 11/02/10 11:10:47 [print_args ] Invoked command: 11/02/10 11:10:47 [print_args ] /opt/vmware/aam/bin/ftPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname=x06 -uname=VMkernel -cmd=addnode -iso=10.172.186.254 -primaryagent=x02 -hostNet=10.172.186.99/255.255.252.0 -primaryNet=10.172.190.175/255.255.255.0: 11/02/10 11:10:47 [print_args ] Environment: 11/02/10 11:10:47 [print_args ] FT_DIR=/opt/vmware/aam 11/02/10 11:10:47 [print_args ] FT_ISOLATION_TIME=1 11/02/10 11:10:47 [print_args ] GREP=/bin/grep 11/02/10 11:10:47 [print_args ] FT_CONFIG_DIR=/var/lib/vmware/aam 11/02/10 11:10:47 [print_args ] RPCINFO=/bin/rpcinfo 11/02/10 11:10:47 [print_args ] LD_LIBRARY_PATH=/lib:/usr/lib:/opt/vmware/aam/lib:/opt/vmware/vpxa/vpx: 11/02/10 11:10:47 [print_args ] PS=/bin/ps 11/02/10 11:10:47 [print_args ] FT_PERSISTED_CONFIG_DIR=/etc/opt/vmware/aam 11/02/10 11:10:47 [print_args ] PWD=/var/log/vmware/vpx 11/02/10 11:10:47 [print_args ] PS_OPTIONS= 11/02/10 11:10:47 [print_args ] FT_NO_CONSOLE_TRACE=1 11/02/10 11:10:47 [print_args ] PATH=/sbin:/usr/sbin:/bin:/usr/bin:/opt/vmware/aam/bin:/bin 11/02/10 11:10:47 [print_args ] FT_LOG_DIR=/var/log/vmware/aam 11/02/10 11:10:47 [print_args ] FT_DOMAIN=vmware 11/02/10 11:10:47 [print_args ] Parsed arguments: 11/02/10 11:10:47 [print_args ] cmd=addnode 11/02/10 11:10:47 [print_args ] hostnet=10.172.186.99/255.255.252.0 11/02/10 11:10:47 [print_args ] uname=VMkernel 11/02/10 11:10:47 [print_args ] primarynet=10.172.190.175/255.255.255.0: 11/02/10 11:10:47 [print_args ] primaryagent=x02 11/02/10 11:10:47 [print_args ] -z=1 11/02/10 11:10:47 [print_args ] iso=10.172.186.254 11/02/10 11:10:47 [print_args ] shortname=x06 11/02/10 11:10:47 [print_args ] domain=vmware 11/02/10 11:10:47 [active_primary_ftcli] active primary is '' 11/02/10 11:10:47 [active_primary_ftcli] command is 'listnodes' 11/02/10 11:10:47 [find_active_primary ] attempting to find an active primary. 11/02/10 11:10:47 [issue_cli_cmd ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -cmd "la -l"' 11/02/10 11:10:47 [issue_cli_cmd ] Trying to talk to self ('x06'), yet there is no sites file. 11/02/10 11:10:47 [find_active_primary ] Could not connect to any node to find an active primary . 11/02/10 11:10:47 [active_primary_ftcli] could not find an active primary on which to execute the command. 11/02/10 11:10:47 [add_aam_node ] 11/02/10 11:10:47 [issue_cmd ] CMD: cp /opt/vmware/aam/bin/generateConfigBackup.pl /var/log/vmware/aam/zgenerateConfigBackup.pl 11/02/10 11:10:47 [issue_cmd ] STATUS: 0 11/02/10 11:10:47 [issue_cmd ] RESULT: 11/02/10 11:10:47 [issue_cmd ] 11/02/10 11:10:47 [issue_cmd ] CMD: cp /var/log/vmware/aam/zgenerateConfigBackup.pl /var/lib/vmware/aam/runOnce 11/02/10 11:10:47 [issue_cmd ] STATUS: 0 11/02/10 11:10:47 [issue_cmd ] RESULT: 11/02/10 11:10:47 [issue_cmd ] 11/02/10 11:10:47 [setup_ft_hosts ] setup_ft_hosts adding host x02 IP info: 10.172.190.175/255.255.255.0: 11/02/10 11:10:47 [setup_ft_hosts ] Added shortname 'x02' to /etc/hosts 11/02/10 11:10:47 [setup_ft_hosts ] Correct 'order' line found in /etc/host.conf 11/02/10 11:10:47 [setup_ft_hosts ] Correct 'hosts' line found in /etc/nsswitch.conf 11/02/10 11:10:47 [issue_cmd ] CMD: cp /opt/vmware/aam/ha/vmware_subsequent_node.pl /var/lib/vmware/aam/runOnce 11/02/10 11:10:47 [issue_cmd ] STATUS: 0 11/02/10 11:10:47 [issue_cmd ] RESULT: 11/02/10 11:10:47 [issue_cmd ] 11/02/10 11:10:47 [setup_ft_hosts ] setup_ft_hosts adding host x06 IP info: 10.172.186.99/255.255.252.0 11/02/10 11:10:47 [setup_ft_hosts ] Shortname 'x06' was found in /etc/hosts 11/02/10 11:10:47 [setup_ft_hosts ] Correct 'order' line found in /etc/host.conf 11/02/10 11:10:47 [setup_ft_hosts ] Correct 'hosts' line found in /etc/nsswitch.conf 11/02/10 11:10:48 [issue_cmd ] CMD: /opt/vmware/aam/bin/ft_gethostbyname x06 |grep FAILED 11/02/10 11:10:48 [issue_cmd ] STATUS: 1 11/02/10 11:10:48 [issue_cmd ] RESULT: 11/02/10 11:10:48 [issue_cmd ] 11/02/10 11:10:48 [issue_cmd ] CMD: /opt/vmware/aam/bin/ft_gethostbyname x02 |grep FAILED 11/02/10 11:10:48 [issue_cmd ] STATUS: 1 11/02/10 11:10:48 [issue_cmd ] RESULT: 11/02/10 11:10:48 [issue_cmd ] 11/02/10 11:10:48 [add_aam_node ] ports 8042-8045 are free for use. 11/02/10 11:10:51 [issue_cmd ] CMD: /opt/vmware/aam/bin/ft_setup -domain=vmware -upgrade=n -noprompt=y -hostname=x06 -port1=8042 -licensekey=AMCFNEET-4YRDDN53CTHMBDSJ -mailserver=none -primaryagent=x02 -isos=10.172.186.254 -nics=10.172.186.99/255.255.252.0 11/02/10 11:10:51 [issue_cmd ] STATUS: 0 11/02/10 11:10:51 [issue_cmd ] RESULT: 11/02/10 11:10:51 [issue_cmd ] AAM setup script. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Setting environment from /opt/vmware/aam/config/agent_env.VMkernel 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Setting up the AAM agent for domain vmware 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Welcome to VMware HA Agent. (Release 5.1 ) 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Configuring Agent for current node: x06 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Configuration requires the node name of a primary agent. If you 11/02/10 11:10:51 [issue_cmd ] are configuring the first node in the domain, enter the name 11/02/10 11:10:51 [issue_cmd ] of this node. (i.e. x06) If this is a subsequent installation 11/02/10 11:10:51 [issue_cmd ] enter the name of an existing primary agent node. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Enter the name of a Primary Agent Node [x06] : 11/02/10 11:10:51 [issue_cmd ] Using input argument of x02 for Primary Agent 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Agents require the use of 4 network ports through which to 11/02/10 11:10:51 [issue_cmd ] communicate. These port numbers must be available and consistent 11/02/10 11:10:51 [issue_cmd ] across each of the nodes in the domain. If you are unsure about 11/02/10 11:10:51 [issue_cmd ] specifying port numbers or defining primary nodes please read the 11/02/10 11:10:51 [issue_cmd ] appropriate sections of the user documentation provided with this 11/02/10 11:10:51 [issue_cmd ] product. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Specify the first of the 4 port numbers: [8042] 11/02/10 11:10:51 [issue_cmd ] Using argument for port1: 8042 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Ports 8042, 8043, 8044 and 8045 will be used. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] Installation for this node is complete. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] To start the Agent run the "ft_startup" command. 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] CMD: /bin/ping -c 1 10.172.186.254 11/02/10 11:10:51 [issue_cmd ] STATUS: 0 11/02/10 11:10:51 [issue_cmd ] RESULT: 11/02/10 11:10:51 [issue_cmd ] PING 10.172.186.254 (10.172.186.254): 56 data bytes 11/02/10 11:10:51 [issue_cmd ] 64 bytes from 10.172.186.254: icmp_seq=0 ttl=64 time=1.0 ms 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] --- 10.172.186.254 ping statistics --- 11/02/10 11:10:51 [issue_cmd ] 1 packets transmitted, 1 packets received, 0% packet loss 11/02/10 11:10:51 [issue_cmd ] round-trip min/avg/max = 1.0/1.0/1.0 ms 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [issue_cmd ] CMD: cp -f /var/lib/vmware/aam/ftbb.prm /var/lib/vmware/aam/ftbb.prm.bck 11/02/10 11:10:51 [issue_cmd ] STATUS: 0 11/02/10 11:10:51 [issue_cmd ] RESULT: 11/02/10 11:10:51 [issue_cmd ] 11/02/10 11:10:51 [ft_startup_monitor ] Waiting for /opt/vmware/aam/bin/ft_startup to complete 11/02/10 11:15:55 [issue_cli_cmd ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -cmd "stop VMap_x06 -wait=15"' 11/02/10 11:15:55 [issue_cmd ] CMD: /opt/vmware/aam/bin/ftcli -domain vmware -cmd "stop VMap_x06 -wait=15" 11/02/10 11:15:55 [issue_cmd ] STATUS: 1 11/02/10 11:15:55 [issue_cmd ] RESULT: 11/02/10 11:15:55 [issue_cmd ] Connection with AAM Agent 10.172.190.174 has been lost. 11/02/10 11:15:55 [issue_cmd ] [Err:5] Input/output error 11/02/10 11:15:55 [issue_cmd ] [Err:13013] Error receiving data 11/02/10 11:15:55 [issue_cmd ] 11/02/10 11:15:55 [issue_cli_cmd ] command is '/opt/vmware/aam/bin/ftcli -domain vmware -cmd "forceProcState VMap_x06 stopped"' 11/02/10 11:15:55 [issue_cmd ] CMD: /opt/vmware/aam/bin/ftcli -domain vmware -cmd "forceProcState VMap_x06 stopped" 11/02/10 11:15:55 [issue_cmd ] STATUS: 1 11/02/10 11:15:55 [issue_cmd ] RESULT: 11/02/10 11:15:55 [issue_cmd ] Connection with AAM Agent 10.172.190.174 has been lost. 11/02/10 11:15:55 [issue_cmd ] [Err:5] Input/output error 11/02/10 11:15:55 [issue_cmd ] [Err:13013] Error receiving data 11/02/10 11:15:55 [issue_cmd ] 11/02/10 11:15:55 [shut_down_vmap_proce] attempt to stop VMap_x06 failed. 11/02/10 11:16:03 [issue_cmd ] CMD: /opt/vmware/aam/bin/ft_shutdown -b -ppid=2988 11/02/10 11:16:03 [issue_cmd ] STATUS: 0 11/02/10 11:16:03 [issue_cmd ] RESULT: 11/02/10 11:16:03 [issue_cmd ] AAM setup script. 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [issue_cmd ] Setting environment from /opt/vmware/aam/config/agent_env.VMkernel 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [issue_cmd ] Shutting down the AAM agent for domain vmware 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [issue_cmd ] Shutting down the Agent... 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [issue_cmd ] Shutting down the Backbone... 11/02/10 11:16:03 [issue_cmd ] Skipping /opt/vmware/aam/bin/ftPerl (4527) because it's this script. 11/02/10 11:16:03 [issue_cmd ] Adding /opt/vmware/aam/bin/ftPerl (3138) to the kill list (AAM) 11/02/10 11:16:03 [issue_cmd ] Skipping /opt/vmware/aam/bin/ftPerl (2988) because it is a parent script. 11/02/10 11:16:03 [issue_cmd ] Killing PIDs: 3138 11/02/10 11:16:03 [issue_cmd ] After kills issued: 11/02/10 11:16:03 [issue_cmd ] PID: PROC: 11/02/10 11:16:03 [issue_cmd ] PID: PROC: 11/02/10 11:16:03 [issue_cmd ] PID: 4546 PROC: ps 11/02/10 11:16:03 [issue_cmd ] PID: 4527 PROC: /opt/vmware/aam/bin/ftPerl 11/02/10 11:16:03 [issue_cmd ] PID: 4451 PROC: sleep 11/02/10 11:16:03 [issue_cmd ] PID: 2988 PROC: /opt/vmware/aam/bin/ftPerl 11/02/10 11:16:03 [issue_cmd ] PID: 2786 PROC: getty 11/02/10 11:16:03 [issue_cmd ] PID: 2782 PROC: /sbin/dcui 11/02/10 11:16:03 [issue_cmd ] PID: 2781 PROC: /bin/ash 11/02/10 11:16:03 [issue_cmd ] PID: 2600 PROC: -ash 11/02/10 11:16:03 [issue_cmd ] PID: 2581 PROC: dropbear 11/02/10 11:16:03 [issue_cmd ] PID: 1972 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1968 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1817 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1816 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1659 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1658 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1639 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1637 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1635 PROC: /sbin/sfcbd 11/02/10 11:16:03 [issue_cmd ] PID: 1577 PROC: /opt/vmware/vpxa/vpx/vpxa 11/02/10 11:16:03 [issue_cmd ] PID: 1559 PROC: /bin/sh 11/02/10 11:16:03 [issue_cmd ] PID: 1529 PROC: /bin/sh 11/02/10 11:16:03 [issue_cmd ] PID: 1496 PROC: /sbin/openwsmand 11/02/10 11:16:03 [issue_cmd ] PID: 1486 PROC: /bin/sh 11/02/10 11:16:03 [issue_cmd ] PID: 1428 PROC: /sbin/slpd 11/02/10 11:16:03 [issue_cmd ] PID: 1414 PROC: hostd 11/02/10 11:16:03 [issue_cmd ] PID: 1404 PROC: /bin/sh 11/02/10 11:16:03 [issue_cmd ] PID: 1398 PROC: /sbin/ntpd 11/02/10 11:16:03 [issue_cmd ] PID: 1383 PROC: crond 11/02/10 11:16:03 [issue_cmd ] PID: 1377 PROC: inetd 11/02/10 11:16:03 [issue_cmd ] PID: 1343 PROC: /usr/sbin/vmkiscsid 11/02/10 11:16:03 [issue_cmd ] PID: 1321 PROC: /usr/sbin/vmkiscsid 11/02/10 11:16:03 [issue_cmd ] PID: 1164 PROC: vmklogger 11/02/10 11:16:03 [issue_cmd ] PID: 1162 PROC: syslogd 11/02/10 11:16:03 [issue_cmd ] PID: 1076 PROC: /sbin/init 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [issue_cmd ] Backing up the AAM configuration to persistent storage 11/02/10 11:16:03 [issue_cmd ] 11/02/10 11:16:03 [stop_aam ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log FULLTIME_SITES_TID 00000036 - 1:8042,8042,8043 x06 vmware #FT_Agent_Port=8045 + 2:8042,8042,8043 x05 vmware + 3:8042,8042,8043 x02 vmware + 4:8042,8042,8043 x03 vmware + 5:8042,8042,8043 x01 vmware + 6:8042,8042,8043 x04 vmware 11/02/10 11:16:03 [vpxa_respond ] VMwareerrortext=/opt/vmware/aam/bin/ft_startup failed to complete within 3 minutes 11/02/10 11:16:03 [vpxa_respond ] VMwareerrorcat=internalerror 11/02/10 11:16:03 [myexit ] copying /var/lib/vmware/aam/vmware-sites to /var/log/vmware/aam/aam_config_util_addnode.log FULLTIME_SITES_TID 00000036 - 1:8042,8042,8043 x06 vmware #FT_Agent_Port=8045 + 2:8042,8042,8043 x05 vmware + 3:8042,8042,8043 x02 vmware + 4:8042,8042,8043 x03 vmware + 5:8042,8042,8043 x01 vmware + 6:8042,8042,8043 x04 vmware 11/02/10 11:16:03 [myexit ] Failure location: 11/02/10 11:16:03 [myexit ] function main::myexit called from line 1835 11/02/10 11:16:03 [myexit ] function main::ft_startup_monitor called from line 2154 11/02/10 11:16:03 [myexit ] function main::start_agent called from line 1193 11/02/10 11:16:03 [myexit ] function main::add_aam_node called from line 192 11/02/10 11:16:03 [myexit ] VMwareresult=failure 11/02/10 11:16:03 [elapsed_time ] Total time for script to complete: 5 minute(s) and 16 second(s) ...
Ce message a été modifié par: fredericmartin
Has HA ever worked with this host, or did it suddenly stop? Some things to consider by following this kb:
The issue seems to be that host x01 had aam agent failed.
When we reboot the x01 host, the Reconfigure HA tasks successfully ended on host x06
What we can't explain is that on the addnode log, we see that x06 host uses the x02 hosts as primary aam agent but yet he still try to connect on the x01 host...
Remove all the hosts from the cluster and add them one by one.
If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Mohammad Wasim