Solved: iSCSI Luns getting dropped under heavy load - Page 2

kwolton · ‎12-26-2010

Hi There,

I have found a few forums on this...but not an answer as yet.

I have a Hp Lefthand 4300 cluster for storage, connected by iSCSI to 2 x DL360 G7 ESXi 4.1 (Hp edition) hosts and a gigabit Procurve switch dedicated for iSCSI. Whenever i run a backup the LUN'S drop off for a few seconds (typically 5, however it has been as much a a minute).

Things i have tried...

Updated the LeftHand software to the latest version

Updated all HP firmware on the hosts

Updated to the latest HP Boradcom drivers on the ESXi hosts

Changed the timout to 1400 (esxcfg-advcfg -s 14000 /VMFS3/HBTokenTimeout)

These servers go into production in the new year....please please can someone help me fix this?

I have calls open with both Vmware and HP, neither have been particually helpfull, Vmware suggested i called HP and updated the NIC drivers, HP simply said it was a network issue and closed the call....thanks!

I have intalled monitoring on the switches and cannot see any dropped ports etc. All the equipment is brand new and the LAN cables are Cat6.

Really don't know where to go from here!

The exact error i get is...

Lost access to volume 4cf7b1ac-5fdcc218-5c2e-1cc1dee48e30 (
S1-SP1OS) due to connectivity issues. Recovery attempt is in
progress and outcome will be reported shortly.
info
12/27/2010 1:06:06 AM
192.168.32.2

Please help me!

Kind Regards

Kris

kwolton · ‎01-04-2011

Hi Jay,

Happy New Year etc etc....

Have you got anywhere with this problem? HP say the switch needs to be upgraded to a much better model to cope with the traffic, however i know people running up to 60 vmware machines on much worse switches.

Have you installed Procurve Manager? If you look at the utilisation it rarely goes above about 50% even when running a backup on my iSCSI switch.

Something weird happened though, two devices turned up in PCM under manufacturer, we have no mcafee devices on our network, so i invested a bit further and it seems that PCM is seeing the two san's for some reason (it's only meant to see switches!) as Mcafee devices. Big deal...however it shows two nics, connected at 1000mb and one called bond0 connected at 10mb..if this is indeed the case, there is no wonder there is a bottleneck.

I am not sure if this is just being wrongly reported though, it might just be a glitch! When running a backup, it appears the 10mb device goes to 100% utilization and stays there!!

I would be really interested to see if yours does the same thing with PCM. This could be the answer.

I am a little stuck as to where yo go from here though!

JayCardin · ‎01-04-2011

On Sunday I replaced the two 2810-24G's with two 3500yl-48 (J8693A). Not only does this double the amount of port buffer memory but it also allows for Flow Control and Jumbo Frames to be enabled at the same time.

So far the results have been positive. The latency errors have almost completely disappeared (but I still have some more changes to make). I am still in the process of enabling Flow Control on the ESXi hosts and the Storage Nodes (WTF can't I enable Flow Control on the Lefthand nodes without breaking the bond first?!?).

I will let you know how it goes.

kwolton · ‎01-04-2011

Thanks for the update....

How many vm's are you running? Which backup software you running? I should think with that switch you shouldn't be getting any latency!!

Do you see the same results as me for utilisation on the ports? Do you have pro curve manager (it's free)

Thanks again fir your continued posts.

Sent from my iPhone.

JayCardin · ‎01-05-2011

We are running 50 VMs on 6 ESXi hosts and 6 P4500 nodes (only 2 nodes are G2's). Attached is a diagram of the new configuration with the 3500YL switches. Red lines are Gigabit Ethernet and Blue are 10Gb Ethernet.

We are use three different backup methods for data on our VMs.

1) vRanger Pro 4.5 on a phyiscal DL360-G5 with a single Gigabit iSCSI connection to one of the 3500YL switches. This is how we backup C drives on the VMs. These backups aren't a problem and never cause latency errors.

2) BackupExec 11d on the same physical server as vRanger Pro. This uses the BE remote Agent to backup files on some VMs (those VMs have iSCSI connections and use the LH DSM for MPIO). These backups aren't a problem either.

3) Robocopy or Xcopy. One 2008R2 VM with an DSM MPIO connection to the SAN will initiate a RoboCopy session to another 2008R2 server. These will copy from LUN to LUN over the 10Gb LAN on the backplane of the C7000 Blade chassis. These backups cause the latency errors.

I have ProCurve manager but I haven't updated for the new network config yet.

I continue to get latency errors when copying data from LUN to LUN:

[Fatal] ESX Host ul-c7k-01-b01 Total Command Latency (time taken during the collection interval to process a SCSI command issued by the Guest OS to the virtual machine. The sum of kernelLatency and deviceLatency) to extent: 59.00 have exceeded the threshold of 40 ms. Virtual machines on this ESX Host may be experiencing performance problems.

Yesterday while trying to enable Flow Control on the storage nodes, I found that when 1 storage node (out of 4) disconnected, all of the LUNs dropped!!! HP has escalted the case but I haven't heard back yet.

I am starting to wish I had gone with NetApp...

kwolton · ‎01-06-2011

Ok, i have raised calls with HP Storage, HP Networking, VMware, another IT company that specialises in VMware....i am running out of options now and hoping one of you guys on there will come up with the answer for me....If i had a phone number for God, i would ask him as well!!

I have tried changing the switch, however this does not change a thing.

The old swtich used to get a lot of paused frame transmittions and after reading a vmware article this appears to be a problem with flow control, do i disabled it on hosts, switches and san's still no difference.

Please can someone help me with this case??? not sure where to go from here.

Kind Regards

Kris

kwolton · ‎01-06-2011

Another thing that is weird is the switch utilization never goes above 45%....surely a network time out shouldn't be occurring at only 45% utilization on the switches??

mzahedi · ‎01-06-2011

I had the same issue with Dell MD300i , there were a couple of things I did which help me to improve the service

1) Is switch is GiB switch and how the port set ?

2) How your Nick cards are set ? are you using 1Gib or 100 MB

kwolton · ‎01-06-2011

Hi,

Switches are gb, each port is set to auto and negotiating at 1gb. I have tried flow control on and off.

Nics are set to auto and negotiate at 1gb.

LarryBlanco2 · ‎01-06-2011

I'd go one layer down and make sure your cables are good. If you have changed switches and still are having the same issue, and the switch is only getting 45% utilization. It would not hurt to make sure that your cables are good.

Just a thought.

Larry

mzahedi · ‎01-06-2011

How is the your network set up is in VM?

I will send you some inforamtion which will help you by end of the day

Majid Zahedi

Network Engineer

49er shops, Inc

California State University Long Beach

6049 East 7th street

Long Beach , Ca 90840

Tel: 562-985-7727

Cell: 949-275-7072

kwolton · ‎01-11-2011

Hi Majid,

Thanks for your update, sorry i have been out of the office for the last few days.

The network is using a seperate vmkernal for iSCSI traffic in round robbin. iSCSI is using the hardware on the nic's ,we are not using software iSCSI.

Did you say you sad some info that could help me?

Kind Regards

Kris

sevenp · ‎02-03-2011

Hi Kris,

We had similar issues in our VMware (ESXi 4.1) environment. We use different hosts, switches and SAN, but also use the Broadcom NICs (NetXtreme II - 5708 / 5709) for iSCSI with Offloading (and without Jumbo, because that's not supported).

On hosts with Qlogic HBA's we don't have issues (no dropped connections).

So I suppose that the problems are gone when u use software iSCSI (with Jumbo) on the adapters.

Or use other adapters like Intel or Qlogic.

Please let me know if this is a solution for you.

Regards,

sevenp · ‎02-05-2011

We recently changed some hosts that also had issues on heavy load with their iscsi connections to the SAN. We used hwiscsi (TOE) from the Broadcom adapters (was a recommendation from different vendors) before, and changed this recently to swiscsi (with Jumbo).

No dropped / resets on connections anymore from this host. And a much faster throughtput from vm's (tested with IOmeter).

We also seen a much beter performance (throughput) on hosts with NIC's configured for swiscsi on comparisation to hosts with an iSCSI HBA (Qlogic 4062).

We use VMware ESXi 4.1 (latest patches).

Issues with broadcom TOE (hwiscsi) on HP DL360G6 also on Dell M610 blades.

SAN: EqualLogic PS6500, PS6000, PS5500, PS5000

Switches: Juniper weer Jumbo Frames and Flow Control.

So I advise you to use also swiscsi on those broadcom adapters.

If you search for broadcom issues with vmware, you found others with same experiences.

kwolton · ‎02-07-2011

Hi,

Yes the answer was to use Software iSCSI. This solved the problem instantly. I have not enabled jumbo frames but have got flow control enabled on all devices. Seems fine even under heavy load.

KInd Regards

Kris

kwolton · ‎02-07-2011

PS - Thanks to everyone that has posted.

idle-jam · ‎02-07-2011

Glad to see that it has finally being resolved. 😃

sevenp · ‎02-07-2011

Kris,

Nice to see problems are solved; you're welcome .

We replaced many hardware components before we found the problems are caused by the hwiscsi of the broadcoms (5709).

mzahedi · ‎02-07-2011

Sorry I was out of country since last month. It is ggod to see your problem has ben resolved.

kwolton <communities-emailer@vmware.com>

02/07/2011 03:45 AM

Please respond to

communities-emailer <communities-emailer@vmware.com>

To

<mzahedi@csulb.edu>

cc

Subject

New message: "iSCSI Luns getting dropped under heavy load"

VMware Communities<http://communities.vmware.com/index.jspa>

iSCSI Luns getting dropped under heavy load

reply from kwolton<http://communities.vmware.com/people/kwolton> in VMware vSphere™ Storage - View the full discussion<http://communities.vmware.com/message/1693995#1693995