kwolton
Contributor
Contributor

iSCSI Luns getting dropped under heavy load

Jump to solution

Hi There,


I have found a few forums on this...but not an answer as yet.

I have a Hp Lefthand 4300 cluster for storage, connected by iSCSI to 2 x DL360 G7 ESXi 4.1 (Hp edition) hosts and a gigabit Procurve switch dedicated for iSCSI. Whenever i run a backup the LUN'S drop off for a few seconds (typically 5, however it has been as much a a minute).

Things i have tried...

Updated the LeftHand software to the latest version

Updated all HP firmware on the hosts

Updated to the latest HP Boradcom drivers on the ESXi hosts

Changed the timout to 1400 (esxcfg-advcfg -s 14000 /VMFS3/HBTokenTimeout)

These servers go into production in the new year....please please can someone help me fix this?

I have calls open with both Vmware and HP, neither have been particually helpfull, Vmware suggested i called HP and updated the NIC drivers, HP simply said it was a network issue and closed the call....thanks!

I have intalled monitoring on the switches and cannot see any dropped ports etc. All the equipment is brand new and the LAN cables are Cat6.

Really don't know where to go from here!

The exact error i get is...

Lost access to volume 4cf7b1ac-5fdcc218-5c2e-1cc1dee48e30 (
S1-SP1OS) due to connectivity issues. Recovery attempt is in
progress and outcome will be reported shortly.
info
12/27/2010 1:06:06 AM
192.168.32.2

Please help me!

Kind Regards

Kris

0 Kudos
1 Solution

Accepted Solutions
sevenp
Enthusiast
Enthusiast

We recently changed some hosts that also had issues on heavy load with their iscsi connections to the SAN. We used hwiscsi (TOE) from the Broadcom adapters (was a recommendation from different vendors) before, and changed this recently to swiscsi (with Jumbo).

No dropped / resets on connections anymore from this host. And a much faster throughtput from vm's (tested with IOmeter).

We also seen a much beter performance (throughput) on hosts with NIC's configured for swiscsi on comparisation to hosts with an iSCSI HBA (Qlogic 4062).

We use VMware ESXi 4.1 (latest patches).

Issues with broadcom TOE (hwiscsi) on HP DL360G6 also on Dell M610 blades.

SAN: EqualLogic PS6500, PS6000, PS5500, PS5000

Switches: Juniper weer Jumbo Frames and Flow Control.

So I advise you to use also swiscsi on those broadcom adapters.

If you search for broadcom issues with vmware, you found others with same experiences.

View solution in original post

0 Kudos
37 Replies
idle-jam
Immortal
Immortal

Did you setup load balancing or multipathing for iSCSi as per shown below:

http://www.yellow-bricks.com/2009/03/18/iscsi-multipathing-with-esxcliexploring-the-next-version-of-...

0 Kudos
JRink
Enthusiast
Enthusiast

Might be of some help if you can detail your iSCSI setup in VMWare and on the SAN.... (single path, multipath, IPs, etc.)

The good news is, if its not production, you can do your tests during the day.  Make it fail while talking with Procurve tech, have them diagnose whether its a network issue or not.  (probably not).  If they say the network piece is fine, then go back to HP san tech and tell them you don't have issues at the network level, and ask to escalate the issue there.. ?

HP Procurve tech support is awesome.  HP server/san support (IMHO) is horrendous.

JayCardin
Contributor
Contributor

What model HP switch are you using for iSCSI?

Are you using Jumbo Frames and/or Flow Control?

I have a similar setup (p4500 and BL490s connected to a Procurve 2810-24G) and am having similar problems.  While I haven't got to the point of LUNs dropping, I get iscsi latency errors from vFoglight (VM monitoring software) when running backups.

I have also been monitoring the switches and don't see packets being dropped either.  While I am inclined to point the finger at the 2810-24G switch, I find it odd that these problems cropped up after upgrading to SANIQ 8.5 and ESXi 4.1

kwolton
Contributor
Contributor

Thanks for all your replies, sorry i have been away for a couple of days seeing family (i am meant to be off this week, however i am lumbered with trying to get this working 😞

Please find attached a visio diagram of the setup, i will upload some screen shots of the network config shortly.

0 Kudos
kwolton
Contributor
Contributor

Hi Jay,

I am using an 1810G switch, however these are simular i believe. We are not using jumbo frames, the switch firmware is the current version and i have reset to factory defaults.

I am glad i am not the only one with this problem, what have you done for troubleshooting?

FYI - i have San IQ 9.0 and Vmware Vsphere 4.1.

Were Hp/Vmware any help with you? Have you tried Procurve?

Kris

0 Kudos
kwolton
Contributor
Contributor

Hi Guys,

I am attaching some screen shots and a network schematic. Hopefully this will answer a few questions.

We are using hardware iSCSI and fixed path selection on the vm hosts. On the Left Hands we are using active load balancing.

Kris

0 Kudos
kwolton
Contributor
Contributor

More Screen shots

0 Kudos
kwolton
Contributor
Contributor

Network Schematic.

Would really appreciate your help on this one, beginning to struggle! I logged a case with Procurve this morning as well just to see if they could shred any light on it, however i cannot imagine it being the switch....

Kris

0 Kudos
JayCardin
Contributor
Contributor

Hi Kris,

In my case, Vmware techs blamed the Procurve and the lefthand; HP techs blamed Vmware.  Typical...

The one problem with the switches that you and I are using is the lack of Flow Control and Jumbo Frame support at the same time.  The other problem is the small packet buffer size.  The 1810G only has 512KB of buffer memory, the 2810 only has 750KB.

It might be worth verifying that Flow Control is on end-to-end (Lefthand, 1810G and the ESX Nics) but leaving Jumbo frames off.  I have the opposite configuration now (Flow Control off but Jumbo On) and it is not performing great.

How are you doing your backups (when the drop out occurs)?

I have latency errors when a VM is copying 40GB of data from a LUN on SAN1 to a LUN on SAN2.  These Luns are presented to the VMs via iscsi using the Microsoft initator.

0 Kudos
kwolton
Contributor
Contributor

Haha, yeah i had exactly the same responce from both of them!! Very helpfull! Thanks for your posts, it's comforting to know i am not the only person with this problem, now all we have to do is solve it! Is yours running better with Jumbo frames enabled?

I have enabled flow control on the 3 devices and tested again by running a disk2disk backup in Tapeware. Problem still exists, however it seems better.

The more i get into this, the more i think it is actually a switch issue. Looking at the switch i found the support file download option. The results are attached, however i see a large number of  "Transmitted Pause Frames" which looks like the switch unable to transmit the data fast enough.

What are your thoughts?

0 Kudos
JayCardin
Contributor
Contributor

That sure looks like the switch is telling the devices on ports 2, 14 & 17 to slow down.  Are those your storage nodes or the ESX hosts?

It's odd that HP shows all that info but doesn't list Dropped Packets anywhere.

I think that enabling Jumbo frames may make the problem worse.  If the switch can't forward the standard sized packets fast enough, asking it to forward larger packets could really hurt performance.  (vendors only seem to publish the performance of their switches using 64-byte packets so it's hard to judge how fast it can switch jumbo frames)

When looking at the CMC, how is the Average Queue Depth in the performance monitor when this is happening?

One other thing to look at is the ESX server; doing an esxtop and press 'n' to see the if you are seeing Dropped Packets.

0 Kudos
JayCardin
Contributor
Contributor

Do you have any Windows clients running the Lefthand DSM for MPIO?  Are they attached to the same LUNs that the ESX hosts are?

If so, this can cause a problem:  http://kb.vmware.com/kb/1030129

0 Kudos
kwolton
Contributor
Contributor

Yeah, they are the P4300 and hosts on those ports. Spoke to Procurve support who weren't as helpfull as i thought (hoped) they would be. The useral "upgrade the firmware and test for 24 hours" line.

It definatly looks as though there is a throughput issue here, however i am strggling to see how i'm going to resolve it.

0 Kudos
kwolton
Contributor
Contributor

I don't think this one applies to me....i have the management software installed on a seperate physical machine. 

0 Kudos
Josh26
Virtuoso
Virtuoso

Hi,

If I'm reading this right, you have the iSCSI vmks bound to the Broadcom NICs to offload your iSCSI to the hardware. Whilst this should be ideal, it appears to be a newer method, and one where the bugs aren't ironed out.

There's a tonne of discussion around this being less than ideal because it doesn't support Jumbo Frames, and lots of "not worth the effort of doing it in hardware because of lacking Jumbo Frames" style advise. It's entirely possible there's more bugs in this setup than we realised.

I would recommend using the

esxcli swiscsi nic add

Command to bind the Port Group operating your iSCSI to the software driver (vmhba38 or vmhba37 usually) and see what happens.

0 Kudos
kwolton
Contributor
Contributor

Hi Jay,

I have spoken to Procurve Support and they comfirm that the switch is struggling like hell to keep up with the traffic that is being loaded on to it.

They are bouncing idea's around with the level two team at the moment to see if they can come up with a solution. How is your bonding set up on the SAN/Hosts?

I have adaptive load balancing on the san's. On the VM Hosts i have experimented with several options, at the moment i have just enabled one nic and put the other on standby. There seems to be several conflicting articles on the best way of setting this up.

Kris

0 Kudos
kwolton
Contributor
Contributor

Thanks Josh. I read that the best way was to use hardware and in my mind that seemed like the best as well. I have however heard several people say that software is better since i set it up. What are the implications of changing from hardware to software iSCSI on my vm's? Will it break anything? I am fairly new to the cli and VMware, what is the exact command used? Also how would you load blanace the nics? is round robbin OK? at the moment i have active/stanby and obviosuly no load balancing which i think is half the problem.

Edited to remove language

0 Kudos
dquintana
Virtuoso
Virtuoso

Every problem in order of disconnecting the luns was 1) fail in the firmware of the storage or switches  2) exceeded operations into the LUN. Maybe you should upgrade the firmware first in order to verify that the problem dissapears.

Ing. Diego Quintana - VMware Communities Moderator - Co Founder & CEO at Wetcom Group - vEXPERT From 2010 to 2020- VCP, VSP, VTSP, VAC - Twitter: @daquintana - Blog: http://www.wetcom.com-blog & http://www.diegoquintana.net - Enjoy the vmware communities !!!

0 Kudos
Josh26
Virtuoso
Virtuoso

Hi,

The load balance algorithm chosen is VMWare's default for your hardware. Unless your particular SAN vendor gives you specific instructions, please don't assume you'll be smarter than VMWare and attempt to setup any sort of balancing.

I don't anticipate any breakages with the VMs.

0 Kudos