fatbobsufc
Contributor
Contributor

Determining which host became isolated in a two host cluster

OK so this particular cluster has two nodes, with default HA settings. One (or maybe even both) of the hosts became isolated and all VMs were shut down. I want to try and understand what happened. I have included the logs from /var/log/vmware/aam/vmware_servername.log and as you can see they are very different:

On server1

===================================

Info FT Fri Dec 10 09:55:12 UTC 2010

By: isolationScript on Node: ESXserver01

MESSAGE: user ESXserver01 VMware HA Agent Isolated, Notifying VPXA

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Rule RuleMonitor submitted to run on node ESXserver01.

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager submitted to run on node ESXserver01.

===================================

Error RULE Fri Dec 10 09:55:13 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Unable to activate all triggers for rule VMWareClusterManager Address of target node is unknown. Agent may not be running.

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Rule Manager on Node: ESXserver01

MESSAGE: Rule RuleMonitor is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Rule Manager on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Rule Interpreter on Node: ESXserver01

MESSAGE: Rule RuleMonitor is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 09:55:13 2010

By: FT/Rule Interpreter on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.

===================================

Info PROC Fri Dec 10 09:55:14 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Shutdown stopping process VMAP/VMap_ESXserver01

===================================

Info PROC Fri Dec 10 09:55:15 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Process VMap_ESXserver01 on ESXserver01 stopped

===================================

Error NODE Fri Dec 10 09:55:16 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Node ESXserver02 has failed. Ping Node results: 192.168.10.3=DEAD

=========================================================================

Primary Agent version 5.1 running on Unknown 4.1

Restarted at Fri Dec 10 11:26:58(UTC) 2010

Events posted before this process started may not be found in this log file.

Check other agent log files.

===================================

Info RULE Fri Dec 10 11:26:59 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Rule RuleMonitor submitted to run on node ESXserver01.

===================================

Info RULE Fri Dec 10 11:26:59 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager submitted to run on node ESXserver01.

===================================

Info RULE Fri Dec 10 11:26:59 2010

By: FT/Rule Manager on Node: ESXserver01

MESSAGE: Rule RuleMonitor is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 11:27:00 2010

By: FT/Rule Manager on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 11:27:00 2010

By: FT/Rule Interpreter on Node: ESXserver01

MESSAGE: Rule RuleMonitor is enabled on ESXserver01.

===================================

Info RULE Fri Dec 10 11:27:00 2010

By: FT/Rule Interpreter on Node: ESXserver01

MESSAGE: Rule VMWareClusterManager is enabled on ESXserver01.

===================================

Info NODE Fri Dec 10 11:27:00 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Node ESXserver01 is running.

===================================

Info StateMon Fri Dec 10 11:27:00 2010

By: ftStateMon on Node: ESXserver01

MESSAGE: Node ESXserver01 ftStateMon initialized.

===================================

Info PROC Fri Dec 10 11:27:03 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Started process VMap_ESXserver01 on ESXserver01

===================================

Info FT Fri Dec 10 11:27:17 2010

By: ftProcMon on Node: ESXserver02

MESSAGE: Node (null) has started receiving heartbeats from node ESXserver01.

===================================

Info NODE Fri Dec 10 11:27:17 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Node ESXserver02 is running.

===================================

Info FT Fri Dec 10 11:27:18 2010

By: ftProcMon on Node: ESXserver01

MESSAGE: Node (null) has started receiving heartbeats from node ESXserver02.

On server2

Info FT Fri Dec 10 09:55:12 UTC 2010

By: isolationScript on Node: ESXserver02

MESSAGE: user ESXserver02 VMware HA Agent Isolated, Notifying VPXA

=========================================================================

Primary Agent version 5.1 running on Unknown 4.1

Restarted at Fri Dec 10 11:27:17(UTC) 2010

Events posted before this process started may not be found in this log file.

Check other agent log files.

===================================

Info FT Fri Dec 10 11:27:17 2010

By: ftProcMon on Node: ESXserver02

MESSAGE: Node (null) has started receiving heartbeats from node ESXserver01.

===================================

Info NODE Fri Dec 10 11:27:17 2010

By: FT/Agent on Node: ESXserver01

MESSAGE: Node ESXserver02 is running.

===================================

Info FT Fri Dec 10 11:27:18 2010

By: ftProcMon on Node: ESXserver01

MESSAGE: Node (null) has started receiving heartbeats from node ESXserver02.

I also found the following in /var/log/vmware/aam/aam_config_util_listprimaries.log which seems to indicate that ESXserver02 is the failed host at this moment in time but I still can't be sure ESXserver01 didn't fail as well....

12/10/10 11:27:06 Invoked command:

12/10/10 11:27:06 /opt/vmware/aam/bin/ftPerl /opt/vmware/aam/ha/aam_config_util.pl -z -shortname=ESXserver01 -uname=VMkernel -cmd=listprimaries -domain=vmware

12/10/10 11:27:06 Environment:

12/10/10 11:27:06 FT_DIR=/opt/vmware/aam

12/10/10 11:27:06 FT_ISOLATION_TIME=1

12/10/10 11:27:06 GREP=/bin/grep

12/10/10 11:27:06 FT_CONFIG_DIR=/var/lib/vmware/aam

12/10/10 11:27:06 RPCINFO=/bin/rpcinfo

12/10/10 11:27:06 LD_LIBRARY_PATH=/lib:/usr/lib:/opt/vmware/aam/lib:/opt/vmware/vpxa/vpx:

12/10/10 11:27:06 PS=/bin/ps

12/10/10 11:27:06 FT_PERSISTED_CONFIG_DIR=/etc/opt/vmware/aam

12/10/10 11:27:06 PWD=/var/log/vmware/vpx

12/10/10 11:27:06 PS_OPTIONS=

12/10/10 11:27:06 FT_NO_CONSOLE_TRACE=1

12/10/10 11:27:06 PATH=/sbin:/usr/sbin:/bin:/usr/bin:/opt/vmware/aam/bin:/bin

12/10/10 11:27:06 FT_LOG_DIR=/var/log/vmware/aam

12/10/10 11:27:06 FT_DOMAIN=vmware

12/10/10 11:27:06 Parsed arguments:

12/10/10 11:27:06 cmd=listprimaries

12/10/10 11:27:06 -z=1

12/10/10 11:27:06 uname=VMkernel

12/10/10 11:27:06 shortname=ESXserver01

12/10/10 11:27:06 domain=vmware

12/10/10 11:27:06 CMD: /opt/vmware/aam/bin/ft_gethostbyname ESXserver01 |grep FAILED

12/10/10 11:27:06 STATUS: 1

12/10/10 11:27:06 RESULT:

12/10/10 11:27:06

12/10/10 11:27:06

12/10/10 11:27:06 command is '/opt/vmware/aam/bin/ftcli -domain vmware -timeout 15 -cmd listnodes'

12/10/10 11:27:06 CMD: /opt/vmware/aam/bin/ftcli -domain vmware -timeout 15 -cmd listnodes

12/10/10 11:27:06 STATUS: 0

12/10/10 11:27:06 RESULT:

12/10/10 11:27:06 *** Node ESXserver01 is the master primary ***

12/10/10 11:27:06 Node Type State

12/10/10 11:27:06 -


-


-


12/10/10 11:27:06 ESXserver01 Primary Agent Running

12/10/10 11:27:06 ESXserver02 Primary Node Failed

12/10/10 11:27:06

12/10/10 11:27:06 VMwareresult=success

12/10/10 11:27:06 Total time for script to complete: 0 minute(s) and 0 second(s)

The annoying thing is I don't know which machine the VMs were running on, it is more than possible they were all running on one host as DRS is not enabled.

Can anyone help me decipher what happened or point me where to look elsewhere?

Thanks in advance.

0 Kudos
1 Reply
Troy_Clavell
Immortal
Immortal

if I were to say, I would say ESXServer02 failed based on:

===================================
Error NODE Fri Dec 10 09:55:16 2010
By: FT/Agent on Node: ESXserver01
MESSAGE: Node ESXserver02 has failed. Ping Node results: 192.168.10.3=DEAD

=========================================================================

....but I don't know why all of your VM's would have been powered down.