VMware Cloud Community
SBaldridge
Contributor
Contributor
Jump to solution

HA cluster failed over an ESX (why?)

This morning one of my ESX 3.5.0 hosts in a HA Cluster (VC2.5) failed and recovered immediately. The guests were vacated to other ESX which is good.

I'd like to learn more about why the host failed, where is the best place to look to see why the host was considered "failed"? We noticed a very brief connectivity loss but I'd like to be sure of why VC determined the host "went down".

Thanks!!

Scott

0 Kudos
1 Solution

Accepted Solutions
mikepodoherty
Expert
Expert
Jump to solution

On the host look in /var/log/vmware. vmkwarning and vmkernel are both logs that offer a good information that should help you track down the loss of communications.

View solution in original post

0 Kudos
5 Replies
chukarma
Enthusiast
Enthusiast
Jump to solution

Scott,

Check all your vmware-log. Usually HA is considered failed when the heartbeat can be monitored between the hosts. You should check your host network connection during the time of outage. This would be the SC connection that you need to check as heartbeat is check from that connection.

HTH,

Daniel

SBaldridge
Contributor
Contributor
Jump to solution

The vmware-log is found on the ESX host? I searched through my vpxd-x log (Virtual Center>Administration>System logs) but I don't find a clear indication of what happened.

Thanks

0 Kudos
mikepodoherty
Expert
Expert
Jump to solution

On the host look in /var/log/vmware. vmkwarning and vmkernel are both logs that offer a good information that should help you track down the loss of communications.

0 Kudos
SBaldridge
Contributor
Contributor
Jump to solution

Thanks Mike.

I found that the host 'cerberus' had an iSCSI connectivity issue with our netapp SAN. That's pretty bad Smiley Happy

Question: On 'cerberus', I have my main SC on one subnet (192.168.60.x) and my iSCSI and its associated SC on another subnet (192.168.61.x). If the iSCSI subnet of 192.168.61.x has a connectivity issue, is this event likely to be severe enough for VC to remove the ESX from the HA cluster and take action to move or shutdown ALL 'cerberus' guests ?

Thanks!!

Log entry:

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe2c1b0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe180a0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1070)iSCSI: 65 second timeout expired for session 0xbe402c0, rx 539820301, ping 539826300, now 539826801

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: session 0xbe2c1b0 to iqn.1992-08.com.netapp:sn.101190946 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: session 0xbe2c1b0 for (1 0 1 *) rx thread 1076handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: session 0xbe180a0 to iqn.1992-08.com.netapp:sn.101190790 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: session 0xbe180a0 for (1 0 2 *) rx thread 1074handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu0:1076)iSCSI: bus 0 target 1 trying to establish session 0xbe2c1b0 to portal 0, address 192.168.61.86 port 3260 group 3

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu3:1074)iSCSI: bus 0 target 2 trying to establish session 0xbe180a0 to portal 0, address 192.168.61.85 port 3260 group 1

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: session 0xbe402c0 to iqn.1992-08.com.netapp:sn.101190946 dropped

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: session 0xbe402c0 for (1 0 3 *) rx thread 1078handled timeout, notify tx

Feb 20 09:27:47 cerberus vmkernel: 62:11:31:07.971 cpu2:1078)iSCSI: bus 0 target 3 trying to establish session 0xbe402c0 to portal 0, address 192.168.61.84 port 3260 group 1

Feb 20 09:27:48 cerberus vmkernel: 62:11:31:08.971 cpu2:1077)iSCSI: session 0xbe402c0 timed out:drop 539826801, now 539826901, failing all commands

Feb 20 09:27:48 cerberus vmkernel: 62:11:31:08.971 cpu0:1075)iSCSI: session 0xbe2c1b0 timed out:drop 539826801, now 539826901, failing all commands

0 Kudos
SBaldridge
Contributor
Contributor
Jump to solution

Any thoughts on the above question?

0 Kudos