mcwill
Expert
Expert

ESX4 swiscsi MPIO to Equallogic dropping

We've updated to ESX4 and have implemented round robin MPIO to our EQL boxes (we didn't use round robin under 3.5), however I'm seeing 3 - 4 entries per day on the EQL log that indicate a dropped connection. See logs below for EQL & vCenter views on the event.

EQL Log Entry

INFO 10/06/09 23:50:32 EQL-Array-1

iSCSI session to target '192.168.2.240:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.111:58281, iqn.1998-01.com.vmware:esxborga-2b57cd4e' was closed.

iSCSI initiator connection failure.

Connection was closed by peer.

vCenter Event

Lost path redundancy to storage device naa.6090a018005964bc9384643a2a0060cf.

Path vmhba34:C1:T3:L0 is down. Affected datastores: "VM_Exchange".

warning

6/10/2009 11:54:47 PM

I'm aware the the EQL box will shuffle connections from time to time, but these appear in the logs as follows, (although vCenter will still display a Lost path redunancy event.)

INFO 10/06/09 23:54:47 EQL-Array-1

iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-bc6459001-cf60002a3a648493-vm-exchange' from initiator '192.168.2.126:59880, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed.

Load balancing request was received on the array.

Should we be concerned or is it now normal operations for the ESX iscsi initiator to drop and re-establish connections?

0 Kudos
179 Replies
paithal
VMware Employee
VMware Employee

If the initiator gets the load balancing event (i.e async logout request) from the array, then the initiator has to honor by dropping and re-establishing the connection. If the connection drop is not due to async logout event, then it is a problem.

mcwill
Expert
Expert

Thanks for the response, I have reverted from round robin to fixed and will monitor to see if that solves the problem.

I understand Equallogic are developing their own MPIO module for vsphere so if the above works I will probably wait for that to be released.

Regards,

Iain

0 Kudos
AndreTheGiant
Immortal
Immortal

I understand Equallogic are developing their own MPIO module for vsphere so if the above works I will probably wait for that to be released.

True. See this thead for some info:

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
depping
Leadership
Leadership

Yes they are. Beta has just started. Received an invitations a week ago Smiley Happy

Duncan

VMware Communities User Moderator | VCP | VCDX

-


Blogging:

Twitter:

If you find this information useful, please award points for "correct" or "helpful".

0 Kudos
AndreTheGiant
Immortal
Immortal

Beta has just started. Received an invitations a week ago

Good for you Smiley Wink

I've sent a mail last week for ask to beta program...

But still no reply...

Andre

**if you found this or any other answer useful please consider allocating points for helpful or correct answers

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
carrmic
Contributor
Contributor

I was told the new EqualLogic MPIO module for VMware will require the new VMware Enterprise Plus license though. That could limit the usage to those willing to upgrade though there do seem to be a couple of nice features with the Plus license.

*Edit*

Thanks.

0 Kudos
AndreTheGiant
Immortal
Immortal

I was told the new EqualLogic MPIO module for VMware will require the new VMware Enterprise Plus license though.

Actually the only 3th part module is Cisco Nexus, but in this case there is also a techical requirement: the distribuite vSwitch support (available only in Enterprise Plus).

We have to wait the product to see where it could be applied Smiley Wink

Andre

Andre | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro
0 Kudos
dwilliam62
Enthusiast
Enthusiast

Are the clocks sync'd? The vCenter event lines up with the load balancing event. However, the connection failure shows up earlier.

Best thing is to open a case with Equallogic and let them look at the array diags and try to line up the events. Having the servers and array sync'd to an NTP server would be helpful as well.

Don

0 Kudos
mcwill
Expert
Expert

Are the clocks sync'd? The vCenter event lines up with the load balancing event. However, the connection failure shows up earlier.

Thanks, yes the clocks are NTP synced. The vCenter event is always the same whether it is a load balancing or connection failure.

I've pulled more detailed vmkernel logs (attached) from one of the hosts that relate to the following sequence of EQL logs...

INFO 14/06/09 14:06:00 EQL-Array-1 iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-350eb8f01-25d000000484a27a-vm-vcenter' from initiator '192.168.2.126:61651, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed. iSCSI initiator connection failure. No response on connection for 6 seconds.

INFO 14/06/09 14:06:11 EQL-Array-1 iSCSI session to target '192.168.2.245:3260, iqn.2001-05.com.equallogic:0-8a0906-07a459001-9cc0005391b48e48-vm-store-workstation' from initiator '192.168.2.126:61160, iqn.1998-01.com.vmware:esxborgb-6d1c1540' was closed. iSCSI initiator connection failure. Connection was closed by peer.

INFO 14/06/09 14:06:35 EQL-Array-1 iSCSI login to target '192.168.2.241:3260, iqn.2001-05.com.equallogic:0-8a0906-07a459001-9cc0005391b48e48-vm-store-workstation' from initiator '192.168.2.126:55046, iqn.1998-01.com.vmware:esxborgb-6d1c1540' successful, using standard frame length.

INFO 14/06/09 14:06:53 EQL-Array-1 iSCSI login to target '192.168.2.242:3260, iqn.2001-05.com.equallogic:0-8a0906-350eb8f01-25d000000484a27a-vm-vcenter' from initiator '192.168.2.126:57993, iqn.1998-01.com.vmware:esxborgb-6d1c1540' successful using standard-sized frames. NOTE: More than one initiator is now logged in to the target.

0 Kudos
dwilliam62
Enthusiast
Enthusiast

Thanks.

The '6 second timeouts' mean that the array and server couldn't communicate and the initiator didn't respond to the EQL Keepalive Packets. That typically is a problem on the network.

What kind of switches are you using? If more than one, how are they interconnected?

Best thing to do is open a case and they'll review the diags from those arrays.

mcwill
Expert
Expert

The '6 second timeouts' mean that the array and server couldn't communicate and the initiator didn't respond to the EQL Keepalive Packets. That typically is a problem on the network.

What kind of switches are you using? If more than one, how are they interconnected?

Its a stack of 2 Procurve 2900's with a single connection from each ESX host to each switch (2 iSCSI ports per ESX host) and then 2 EQL boxes connected to switches.

Yes probably best to throw it at EQL.

Thanks.

0 Kudos
carrmic
Contributor
Contributor

Its a stack of 2 Procurve 2900's with a single connection from each ESX host to each switch (2 iSCSI ports per ESX host) and then 2 EQL boxes connected to switches.

Yes probably best to throw it at EQL.

Thanks.

I have a very similar setup with 2 ProCurve 2824 switches. I have 2 PS100E arrays and 4 NIC ports connected to the switches from each of my 3 ESX servers. Most likely overkill at this point. We also have a 3 GB trunk setup between the switches (though it should be larger). I just implemented the multipathing last night and I am not seeing any problems with connections dropping yet.

This may be a dumb question but do you have your 2 2900 switches linked together? The EqualLogic team also has lots of good tips for how to optimize the ProCurve configurations but a 6 second timeout would seem to be more than just a lack of an optimized configuration. I could compare configs with you though.

0 Kudos
dwilliam62
Enthusiast
Enthusiast

The interswitch trunking is VERY important for proper operation. How many ports have you trunked between the two 2900's? With two arrays, four would be minimum.

Also is flowcontrol enabled on the switch ports used in the SAN? (Array and servers)

0 Kudos
dwilliam62
Enthusiast
Enthusiast

On your 2824's have you enabled the #qos-passthrough-mode one-queue on them? If you have the latest HP firmware, that setting will improve performance. By default, the 2824/2848's divde up the buffer memory into four pools for QoS. So by default you only get 1/4 of the available buffers for iSCSI. Enabling passthrough realigns them into one large memory pool. You have to reboot to make the change effective. (same with firmware upgrades) Increasing the trunked ports would definintely be a good idea.

Don

0 Kudos
mcwill
Expert
Expert

No problem with trunk capacity, the 2900's are stacked via 2 x 10Gb connections.

And yes flow control is enabled on all ESX & EQL ports, but no jumbo.

Regards,

Iain

0 Kudos
dwilliam62
Enthusiast
Enthusiast

Then definitely, please open a case. They'll need diags from both members. If you have a map of the connections that would be helpful.

Don

0 Kudos
mcwill
Expert
Expert

Posted in case anyone with a similar problem discovers this thread. We believe we have finally resolved this problem thanks to Arnaud in VMWare tech support.

It appears that Dynamic Discovery is the source of the problem. No errors have appeared in the 48 hours following the removal of the EQL host ip from the dynamic discovery configuration screen.

Regards,

Iain

0 Kudos
johnz333
Contributor
Contributor

Hi Lain,

Did you leave your MPIO setting to fixed or back to round robin? I am having simlar issues but removing the Dynamic Discovery san ip did not help.

thx

0 Kudos
Riku100
Contributor
Contributor

Hi

We are having the same issues here also. Yesterday VMware tech support changed our MPIO to fixed but they left SAN ip in Dynamic Discovery.

This didn't fix our problem.

Have you experienced any data loss because of this?

0 Kudos