GeneNZ
Enthusiast
Enthusiast

Intermittent Dropouts to iSCSI Storage

Hi there,

I have a Dell PowerEdge 2950 (running ESX4 Classic with all latest patches) connecting to an MD3000i that intermittently drops its connection to its storage, but after a moment regains connection. I know this because it sends out the alarms that it has lost connection, but when I go to check the vCenter, I notice that there is no problem other than the errors in the event log.

Further investigation in /var/log shows the following:

vmkiscsid.log

2009-08-26-13:03:02: iscsid: Nop-out timedout after 10 seconds on connection 3:0 state (3). Dropping session.

2009-08-26-13:03:06: iscsid: connection3:0 is operational after recovery (2 attempts)

vmkernel:

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn

Aug 26 13:03:05 esx-beta vmkernel: 1:21:23:42.748 cpu2:4098)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x410002019e40) to NMP device "mpx.vmhba1:C0:T0:L0" failed on physical path "vmhba1:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Aug 26 13:03:05 esx-beta vmkernel: 1:21:23:42.748 cpu2:4098)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba1:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "ONLINE"

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn

Aug 26 13:03:08 esx-beta vmkernel: 1:21:23:45.920 cpu3:4099)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x41000205fe80) to NMP device "mpx.vmhba1:C0:T0:L0" failed on physical path "vmhba1:C0:T0:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Aug 26 13:03:08 esx-beta vmkernel: 1:21:23:45.920 cpu3:4099)ScsiDeviceIO: 747: Command 0x12 to device "mpx.vmhba1:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

vmkwarning:

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "OFFLINE"

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess

Aug 26 13:03:02 esx-beta vmkernel: 1:21:23:40.354 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba33:CH:4 T:0 CN:0: iSCSI connection is being marked "ONLINE"

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: Sess

Aug 26 13:03:06 esx-beta vmkernel: 1:21:23:43.712 cpu3:4238)WARNING: iscsi_vmk: iscsivmk_StartConnection: Conn

Effectively, from what I can gather, it looses connection and regains connection 4 seconds later. Problem is I don't know why its doing this? Is there any way I can get further information regarding these drop outs? I don't think its the MD3000i since we currently have two other PE 2950's configured identically to this problem PE 2950 that are connected to the same MD3000i, that are displaying absolutely no faults at the same time that this machine shows an error. My configuration is identical to how is shown here: and I have jumbo frames enabled.

Thanks in advance for your help.

Gene

0 Kudos
8 Replies
neurosis89
Contributor
Contributor

Hi,

I have same problem last Thursday, this is resolve but i don't know why this problem arrive.

I use HP DL380 G6 server with Dell Equallogic PS6000. I configure 2 pNICs for the iSCSI traffic, round robin with 6 vmkernel and jumbo frames are enable. I apply best practice to do configuration and I have 2 others ESX4 server with the config that work fine.

When I have the problem I lost 4 connections over 6 and I have some error and warning in Equallogic and VMware. I put host in maintenance mode, remove vmkernel SAN and recreate it.

It's works fine for the moment.

Thanks for reply.

0 Kudos
actixsupport
Contributor
Contributor

Hi guys,

Did you find a solution to this? I'm seeing the same thing with vSphere on 2 x HP Blades and 3 x Equalogic boxes. I've set them up with the multiple VMKernel connections as per the EQL doc.

Cheers

Ray

0 Kudos
GeneNZ
Enthusiast
Enthusiast

In the end we solved the issue by replacing the switches entirely.

We were using very low end network switches (Dlink switches - see above). We purchased new Cisco C3650's as replacements, and so far haven't seen the problem reoccur. I suspect that as we started to use the virtual infrastructure more heavily, the switches started reaching their tipping point, particular in terms of memory usage. What I think happened, is the Dlink switch couldn't cope with the amount traffic, and would just drop the interface momentarily (about 5ish seconds), resulting in vCenter firing a datastore drop out and the host servers failing over to their secondary interface. The Dlink switches have about 512kb of buffer memory vs the 128mb on the C3560's.

I can't prove the above, since the Dlink switches are unmanaged, but I can say that with the new Cisco switches, still only averages 1% Bandwidth Usage and about 20% memory usage.

Its one thing we overlooked when we setup our infrastructure initially. We had spent so much on servers, licenses and SANs already that we didn't really consider the switches. On retrospect it was stupid, given they are just as important the other components. One thing I have noted from my research is that the C3560 switches aren't even considered 'good enough' for an enterprise VMWare solution. Our virtualisation infrastructure is small, so its not a big deal, but I know the Cisco C3750's and above were considered the minimum (in a perfect world).

0 Kudos
Andy_Banta
Hot Shot
Hot Shot

Ray, Folks,

There was a networking issue affecting iSCSI this way that was recently fixed and should be available as a patch soon. If you're working with VMware support, mention that these might be symptoms of PR #484220.

Enjoy,

Andy

0 Kudos
Edificom
Contributor
Contributor

There is also another thread following this, with some suggested workaround:

http://communities.vmware.com/message/1460882#1460882

0 Kudos
s1xth
VMware Employee
VMware Employee

Andy.... Do you have any more information as when this patch is going to be available? We have been hearing about a patch for 3 + months now...

All others...please read the above thread with the workaround configuration.

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
J1mbo
Virtuoso
Virtuoso

Are the ESX hosts configured for queue depth throttling?

Please award points to any useful answer.

0 Kudos
sunshine
Contributor
Contributor

I know this post is aging a little bit, but I'm having similiar issues and we plan to upgrade our switches. We are currently using 2 Cisco 2650G switches with 4 GB uplinks between the switches. Does anyone know of the guide that gives the minimum requirements for the switches to use for iSCSI?

Thanks,

Sunshine Baines

0 Kudos