Solved: Re: HA cluster failed over an ESX (why?)

SBaldridge · ‎02-20-2008

This morning one of my ESX 3.5.0 hosts in a HA Cluster (VC2.5) failed and recovered immediately. The guests were vacated to other ESX which is good.

I'd like to learn more about why the host failed, where is the best place to look to see why the host was considered "failed"? We noticed a very brief connectivity loss but I'd like to be sure of why VC determined the host "went down".

Thanks!!

Scott

mikepodoherty · ‎02-21-2008

On the host look in /var/log/vmware. vmkwarning and vmkernel are both logs that offer a good information that should help you track down the loss of communications.

View solution in original post

chukarma · ‎02-20-2008

Scott,

Check all your vmware-log. Usually HA is considered failed when the heartbeat can be monitored between the hosts. You should check your host network connection during the time of outage. This would be the SC connection that you need to check as heartbeat is check from that connection.

HTH,

Daniel

SBaldridge · ‎02-21-2008

The vmware-log is found on the ESX host? I searched through my vpxd-x log (Virtual Center>Administration>System logs) but I don't find a clear indication of what happened.

Thanks

mikepodoherty · ‎02-21-2008

On the host look in /var/log/vmware. vmkwarning and vmkernel are both logs that offer a good information that should help you track down the loss of communications.

SBaldridge · ‎02-21-2008

Thanks Mike.

I found that the host 'cerberus' had an iSCSI connectivity issue with our netapp SAN. That's pretty bad

Question: On 'cerberus', I have my main SC on one subnet (192.168.60.x) and my iSCSI and its associated SC on another subnet (192.168.61.x). If the iSCSI subnet of 192.168.61.x has a connectivity issue, is this event likely to be severe enough for VC to remove the ESX from the HA cluster and take action to move or shutdown ALL 'cerberus' guests ?

Thanks!!

Log entry:

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe2c1b0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe180a0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe402c0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: session 0xbe2c1b0 to iqn.1992-08.com.netapp:sn.101190946 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: session 0xbe2c1b0 for (1 0 1 *) rx thread 1076handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: session 0xbe180a0 to iqn.1992-08.com.netapp:sn.101190790 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: session 0xbe180a0 for (1 0 2 *) rx thread 1074handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: bus 0 target 1 trying to establish session 0xbe2c1b0 to portal 0, address 192.168.61.86 port 3260 group 3

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: bus 0 target 2 trying to establish session 0xbe180a0 to portal 0, address 192.168.61.85 port 3260 group 1

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: session 0xbe402c0 to iqn.1992-08.com.netapp:sn.101190946 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: session 0xbe402c0 for (1 0 3 *) rx thread 1078handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: bus 0 target 3 trying to establish session 0xbe402c0 to portal 0, address 192.168.61.84 port 3260 group 1

Feb 20 09:27:48 cerberus vmkernel: 62:11:31:08.971 cpu2:1077)iSCSI: session 0xbe402c0 timed out:drop 539826801, now 539826901, failing all commands

Feb 20 09:27:48 cerberus vmkernel: 62:11:31:08.971 cpu0:1075)iSCSI: session 0xbe2c1b0 timed out:drop 539826801, now 539826901, failing all commands

SBaldridge · ‎02-25-2008

Any thoughts on the above question?

All

HA cluster failed over an ESX (why?)