OK so this particular cluster has two nodes, with default HA settings. One (or maybe even both) of the hosts became isolated and all VMs were shut down. I want to try and understand what happened. I have included the logs from /var/log/vmware/aam/vmware_servername.log and as you can see they are very different:
On server1
===================================
Info FT Fri Dec 10 09:55:12 UTC 2010
By: isolationScript on Node: ESXserver01
MESSAGE: user ESXserver01 VMware HA Agent Isolated, Notifying VPXA
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Rule RuleMonitor submitted to run on node ESXserver01.
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager submitted to run on node ESXserver01.
===================================
Error RULE Fri Dec 10 09:55:13 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Unable to activate all triggers for rule VMWareClusterManager Address of target node is unknown. Agent may not be running.
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Rule Manager on Node: ESXserver01
MESSAGE: Rule RuleMonitor is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Rule Manager on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Rule Interpreter on Node: ESXserver01
MESSAGE: Rule RuleMonitor is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 09:55:13 2010
By: FT/Rule Interpreter on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.
===================================
Info PROC Fri Dec 10 09:55:14 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Shutdown stopping process VMAP/VMap_ESXserver01
===================================
Info PROC Fri Dec 10 09:55:15 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Process VMap_ESXserver01 on ESXserver01 stopped
===================================
Error NODE Fri Dec 10 09:55:16 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Node ESXserver02 has failed. Ping Node results: 192.168.10.3=DEAD
=========================================================================
Primary Agent version 5.1 running on Unknown 4.1
Restarted at Fri Dec 10 11:26:58(UTC) 2010
Events posted before this process started may not be found in this log file.
Check other agent log files.
===================================
Info RULE Fri Dec 10 11:26:59 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Rule RuleMonitor submitted to run on node ESXserver01.
===================================
Info RULE Fri Dec 10 11:26:59 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager submitted to run on node ESXserver01.
===================================
Info RULE Fri Dec 10 11:26:59 2010
By: FT/Rule Manager on Node: ESXserver01
MESSAGE: Rule RuleMonitor is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 11:27:00 2010
By: FT/Rule Manager on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 11:27:00 2010
By: FT/Rule Interpreter on Node: ESXserver01
MESSAGE: Rule RuleMonitor is enabled on ESXserver01.
===================================
Info RULE Fri Dec 10 11:27:00 2010
By: FT/Rule Interpreter on Node: ESXserver01
MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.
===================================
Info NODE Fri Dec 10 11:27:00 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Node ESXserver01 is running.
===================================
Info StateMon Fri Dec 10 11:27:00 2010
By: ftStateMon on Node: ESXserver01
MESSAGE: Node ESXserver01 ftStateMon initialized.
===================================
Info PROC Fri Dec 10 11:27:03 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Started process VMap_ESXserver01 on ESXserver01
===================================
Info FT Fri Dec 10 11:27:17 2010
By: ftProcMon on Node: ESXserver02
MESSAGE: Node (null) has started receiving heartbeats from node ESXserver01.
===================================
Info NODE Fri Dec 10 11:27:17 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Node ESXserver02 is running.
===================================
Info FT Fri Dec 10 11:27:18 2010
By: ftProcMon on Node: ESXserver01
MESSAGE: Node (null) has started receiving heartbeats from node ESXserver02.
On server2
Info FT Fri Dec 10 09:55:12 UTC 2010
By: isolationScript on Node: ESXserver02
MESSAGE: user ESXserver02 VMware HA Agent Isolated, Notifying VPXA
=========================================================================
Primary Agent version 5.1 running on Unknown 4.1
Restarted at Fri Dec 10 11:27:17(UTC) 2010
Events posted before this process started may not be found in this log file.
Check other agent log files.
===================================
Info FT Fri Dec 10 11:27:17 2010
By: ftProcMon on Node: ESXserver02
MESSAGE: Node (null) has started receiving heartbeats from node ESXserver01.
===================================
Info NODE Fri Dec 10 11:27:17 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Node ESXserver02 is running.
===================================
Info FT Fri Dec 10 11:27:18 2010
By: ftProcMon on Node: ESXserver01
MESSAGE: Node (null) has started receiving heartbeats from node ESXserver02.
I also found the following in /var/log/vmware/aam/aam_config_util_listprimaries.log which seems to indicate that ESXserver02 is the failed host at this moment in time but I still can't be sure ESXserver01 didn't fail as well....
12/10/10 11:27:06 Invoked command:
12/10/10 11:27:06 /opt/vmware/aam/bin/ftPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname=ESXserver01 -uname=VMkernel -cmd=listprimaries -domain=vmware
12/10/10 11:27:06 Environment:
12/10/10 11:27:06 FT_DIR=/opt/vmware/aam
12/10/10 11:27:06 FT_ISOLATION_TIME=1
12/10/10 11:27:06 GREP=/bin/grep
12/10/10 11:27:06 FT_CONFIG_DIR=/var/lib/vmware/aam
12/10/10 11:27:06 RPCINFO=/bin/rpcinfo
12/10/10 11:27:06 LD_LIBRARY_PATH=/lib:/usr/lib:/opt/vmware/aam/lib:/opt/vmware/vpxa/vpx:
12/10/10 11:27:06 FT_PERSISTED_CONFIG_DIR=/etc/opt/vmware/aam
12/10/10 11:27:06 PWD=/var/log/vmware/vpx
12/10/10 11:27:06 FT_NO_CONSOLE_TRACE=1
12/10/10 11:27:06 PATH=/sbin:/usr/sbin:/bin:/usr/bin:/opt/vmware/aam/bin:/bin
12/10/10 11:27:06 FT_LOG_DIR=/var/log/vmware/aam
12/10/10 11:27:06 FT_DOMAIN=vmware
12/10/10 11:27:06 Parsed arguments:
12/10/10 11:27:06 cmd=listprimaries
12/10/10 11:27:06 uname=VMkernel
12/10/10 11:27:06 shortname=ESXserver01
12/10/10 11:27:06 domain=vmware
12/10/10 11:27:06 CMD: /opt/vmware/aam/bin/ft_gethostbyname ESXserver01 |grep FAILED
12/10/10 11:27:06 command is '/opt/vmware/aam/bin/ftcli -domain vmware -timeout 15 -cmd listnodes'
12/10/10 11:27:06 CMD: /opt/vmware/aam/bin/ftcli -domain vmware -timeout 15 -cmd listnodes
12/10/10 11:27:06 *** Node ESXserver01 is the master primary ***
12/10/10 11:27:06 Node Type State
-
-
12/10/10 11:27:06 ESXserver01 Primary Agent Running
12/10/10 11:27:06 ESXserver02 Primary Node Failed
12/10/10 11:27:06 VMwareresult=success
12/10/10 11:27:06 Total time for script to complete: 0 minute(s) and 0 second(s)
The annoying thing is I don't know which machine the VMs were running on, it is more than possible they were all running on one host as DRS is not enabled.
Can anyone help me decipher what happened or point me where to look elsewhere?
Thanks in advance.
if I were to say, I would say ESXServer02 failed based on:
=================================== Error NODE Fri Dec 10 09:55:16 2010 By: FT/Agent on Node: ESXserver01 MESSAGE: Node ESXserver02 has failed. Ping Node results: 192.168.10.3=DEAD =========================================================================
....but I don't know why all of your VM's would have been powered down.