VMware Cloud Community
nelo_1990
Contributor
Contributor

Re: Hosts lost connection from iSCSI SAN storage

Hi all, I'm now encountering a weird issue that the hosts lost connection persistently. I really need help for this issue. The whole system is mentioned as follows.

ESXi 5.0

2 hosts: HP Proliant DL380G7

iSCSI SAN storage: Netgear ReadyNAS 3100

4 virtual machines are installed separately in the hosts: Database server, Application server, Web server and vcenter server.

When this issue comes, I can see that the vmhba35 is down on both hosts and the hosts lose connections from iSCSI SAN storage. And from the "Recent Tasks", the virtual machines keep resetting by themselves. Though the vSphere shows that the virtual machines are running but I cannot access to them by remote desktop or console. After I reboot the hosts, then the system can run again. I wonder if this is an issue regarding Vmware. Please help! Thanks a lot!

Tags (3)
0 Kudos
30 Replies
nielse
Expert
Expert

Have you checked the ESXi host logs? Could you attach any logs if possible?

@nielsengelen - http://foonet.be - VCP4/5
0 Kudos
nelo_1990
Contributor
Contributor

yes, here are the event logs, see whether it's useful or not. Thanks!

0 Kudos
nielse
Expert
Expert

Could you provide us the host logs (File -> Export -> Export System Logs). These will provide more information Smiley Happy

@nielsengelen - http://foonet.be - VCP4/5
0 Kudos
Gkeerthy
Expert
Expert

from the screen shots i can see there is lot of latency errors..

- what is your multipathing policy

- teaming policy

- any etherchannell configured

- howmany nics are there for the iscsi

- what is the MTU settings

-

also check is there any APD/PDL conditions there in the the environment refer the below links

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103098...

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&e...

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200468...

http://blogs.vmware.com/vsphere/2011/08/all-path-down-apd-handling-in-50.html

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)
0 Kudos
nelo_1990
Contributor
Contributor

Here are the system log files attached. Thanks!

0 Kudos
nelo_1990
Contributor
Contributor

The teaming policy is not enabled.There is only one Broadcom iSCSI Adapter used for the iSCSI, which is vmhba35. MTU is set as 1500. I have checked from the vmkernal.log that there should be a APL condition in the system.

0 Kudos
gman18480
Enthusiast
Enthusiast

I see that your iscsi interface is one the same subnet as you managment network and vmotion network. I would highly recommend you atleast seperate iscsi and vmotion traffic to seperate vlans if not ideally put iscsi on it's on network or it's own dedicated switches.

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos
nelo_1990
Contributor
Contributor

I think the management network and iSCSI interface are on the different subnet. 

0 Kudos
gman18480
Enthusiast
Enthusiast

My mistake nelo I missed your subnet mask and see it is different subnet. How about running vmkping "your iscsi target" from the console. Are you able to atleast get connectivity? vmkping will use your vmkernel interface to ping your array. 

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos
nielse
Expert
Expert

Could you provide us some more infiormation on the network?

I can see in the logs that you have these errors:

2013-01-28T05:15:24Z iscsid: Login Failed: iqn.2012-11.SAN01:san if=bnx2i-441ea1380118@vmk1 addr=10.121.253.70:3260 (TPGT:1 ISID:0x2) Reason: 00040000 (Initiator Connection Failure)
2013-01-28T05:15:24Z iscsid: Notice: Reclaimed Channel (H35 T0 C1 oid=1)
2013-01-28T05:15:25Z iscsid: DISCOVERY: Pending=1 Failed=1
2013-01-28T05:15:26Z iscsid: DISCOVERY: Pending=1 Failed=1
2013-01-28T05:15:28Z iscsid: Login Failed: iqn.2012-11.SAN01:san if=bnx2i-441ea1380118@vmk1 addr=10.121.253.66:3260 (TPGT:1 ISID:0x1) Reason: 00080000 (Initiator Connection Failure)
2013-01-28T05:15:28Z iscsid: Notice: Reclaimed Channel (H35 T0 C0 oid=1)
2013-01-28T05:15:28Z iscsid: Notice: Reclaimed Target (H35 T0 oid=1)

According http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=201217... this means there is a network timeout/problem.

Maybe you have double IP's or wrong subnet...

@nielsengelen - http://foonet.be - VCP4/5
0 Kudos
nelo_1990
Contributor
Contributor

The SAN has two ethernet ports,and they're configured as 10.121.253.66 and 10.121.253.70. For the iSCSI ports of thw hosts, they are configured as 10.121.253.67 and 10.121.253.68. resp. The SAN is connected to a switch which is connected to two hosts respectively. I think there should be no double IPs and wrong subnet.I just wonder why the latency is so high for both hosts when they are trying to access data from the SAN. I think that's the reason which causes the iSCSI adapters down and the hosts lost connection from the SAN. Is it the problem of the SAN or other else? And is there any solution to shorten the latency? Thanks a lot!

0 Kudos
nelo_1990
Contributor
Contributor

I tried to run vmkping 10.121.253.66 or 10.121.253.70 (SAN) from the host. But I am not able to get the connectivity.

0 Kudos
gman18480
Enthusiast
Enthusiast

Are the iscsi NICs and San on their own dedicated switch or sharing with VM data, mgmt, and motion? If so, what model is your switch?

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos
gman18480
Enthusiast
Enthusiast

Usually means the San is not accessible on your network. Are you able to ping the San IP from your workstation?

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos
nelo_1990
Contributor
Contributor

Now usually only one of the hosts is down, and the other host can still access the SAN via the switch.

0 Kudos
nelo_1990
Contributor
Contributor

Actually I thought that it's the problem of the switch (Cisco Gigabit 8-port non-manageable switch), but after I set up a vlan on another switch. This issue still exists. So I think it's not the problem of the switch. Here is my system diagram attached.

0 Kudos
gman18480
Enthusiast
Enthusiast

I'm not familiar with ready nas. However, I do now on some iscsi arrays you have to enable multiple initiator connections access to the volume for more than one initiator to connect at the same time. Ideally its best practice to have a separate physical switch dedicated for just iscsi. However, if that is not possible you can drop down to seperate vlans. Are you only running that 8 port on just iscsi?

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos
nelo_1990
Contributor
Contributor

Yes, that 8 port switch is only running for iSCSI.

0 Kudos
gman18480
Enthusiast
Enthusiast

253.66 and 253.70 are both not able to respond to vmkping from both host or just one of the two? Also, what native multipath  policy are you currently using?

Garret DeWulf Professional Services / VMware Consultant / VCP 4&5 / www.veristor.com
0 Kudos