Disgustingly poor performance from iSCSI luns on N...

JacquesCronje · ‎03-31-2009

I have been suffering with this issue for some time now - I have two Netapp 3020's clustered - one head with FC disks and one head with SATA disks. Same level of Ontap etc. - the lun on the FC head is performing real well, but the luns on the SATA head is really bad. My backups from SATA using esXpress is 1meg or less a second, but on the FC luns it comes at 18meg/sec, which is what the throttle is set at. My guests on SATA perform poorly too.

One major point of difference between the two (besides one being FC and one SATA) is these infernal messages in the ESX console I keep getting since the upgrade to 3.5 and the new iSCSI initiator -- they go non-stop, and I don't get them on the FC head. Anyone have any ideas?

Wed Apr 1 19:08:58 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx03-18778d0c at IP addr 172.18.250.33

Wed Apr 1 19:09:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

Wed Apr 1 19:09:37 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx04-1d9d5218 at IP addr 172.18.250.34

Wed Apr 1 19:09:58 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx03-18778d0c at IP addr 172.18.250.33

Wed Apr 1 19:11:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

Wed Apr 1 19:12:15 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx05-2ee7bcca at IP addr 172.18.250.35

Wed Apr 1 19:12:37 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx04-1d9d5218 at IP addr 172.18.250.34

Wed Apr 1 19:14:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

Wed Apr 1 19:14:18 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx07-08308b02 at IP addr 172.18.250.37

Wed Apr 1 19:16:15 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx05-2ee7bcca at IP addr 172.18.250.35

Wed Apr 1 19:16:37 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx04-1d9d5218 at IP addr 172.18.250.34

Wed Apr 1 19:17:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

Wed Apr 1 19:17:18 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx07-08308b02 at IP addr 172.18.250.37

Wed Apr 1 19:17:58 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx03-18778d0c at IP addr 172.18.250.33

Wed Apr 1 19:18:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

Wed Apr 1 19:18:18 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx07-08308b02 at IP addr 172.18.250.37

Wed Apr 1 19:18:37 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx04-1d9d5218 at IP addr 172.18.250.34

Wed Apr 1 19:19:37 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx04-1d9d5218 at IP addr 172.18.250.34

Wed Apr 1 19:20:00 NZDT : ISCSI: New session from initiator iqn.1998-01.com.vmware:akesx06-077d73bf at IP addr 172.18.250.36

paul_xtravirt · ‎04-01-2009

Hi there,

I would be interested to see if you had performance problems before upgrading ESX. The message points to an error on the Netapp unit, where the response time between the Netapp and the iscsi initiator on the ESX server times out. The message is there to show that connection has been re-established after the timeout period has expired.

The problem has been fixed, but requires you to upgrade onTap. What version do you currently have?

ONTAP 7.1RC1 (First Fixed) - Fixed, 7.1.2.1 (GD) - Fixed, 7.2.4 (GD) - Fixed, 7.2.4L1 (GD) - Fixed

I woudl suggest contacting NetApp support to ensure that you get the correct version required for the upgrade.

If you found this helpful, please consider awarding points

If you found this helpful, please consider awarding some points

RParker · ‎04-01-2009

one head with FC disks and one head with SATA disks.

That's the problem. In addition to what Paul said, there are other factors on the Netapp you should be looking at, but SATA disks are great for single purpose, desktop, not SAN. Case in point.

JacquesCronje · ‎04-01-2009

I'm afraid the fact that SATA is SATA is not causing this issue - I am aware of the limitations of SATA disks, but I also believe that they are suitable for low I/O applications.

Paul, you make a good point - I'll get in touch with NetApp, I was just really perplexed with the FC head (the other member of the cluster) not showing up those errors, despite using the same ESX software initiator and being on the same version of OnTap - 7.2.4

To answer your first question, they were working and performing fine on the old Cisco initiator. I believe that it's the constant reconnecting that's causing my performance issues.

RParker · ‎04-01-2009

I'm afraid the fact that SATA is SATA is not causing this issue - I am aware of the limitations of SATA disks, but I also believe that they are suitable for low I/O applications.

Gee that's funny, perhaps you can explain the difference in this statement that YOU posted? Because that's EXACTLY what that shows........

the lun on the FC head is performing real well, but the luns on the SATA head is really bad. My backups from SATA using esXpress is 1meg or less a second, but on the FC luns it comes at 18meg/sec, which is what the throttle is set at. My guests on SATA perform poorly too.

JacquesCronje · ‎04-01-2009

No need to shout and be condescending, I was just look for some constructive dialogue. I also said:

"To answer your first question, they were working and performing fine on the old Cisco initiator."

I didn't have the issue before the upgrade to ESX 3.5 and the software initiator change.

paul_xtravirt · ‎04-01-2009

Thats where you are now seeing the issue - between the iscsi software initiator and your hardware. I would stake a good bet that once you upgrade as previously mentioned, you will be fine. Would be good to hear back though, so make sure you post your results on here

If you found this helpful, please consider awarding points

If you found this helpful, please consider awarding some points

JohnADCO · ‎04-01-2009

I am looking at all those different IP addresses...

You actually have a network card for each of those? Or is there a DHCP server, serv'n these up each time it starts new?

I won't even set two up on the same subnet, let alone the slew of different IP addresses I see there.

I think your issue lies there somehow, someway..... How many subnets are you running in your iSCSI network? How many nics are you using for the connections on the host? How many connections available on the storage device?

I am having a hard time grasping your iSCSI network archetecture.

PS: I don't think it's the SATA, I mean they are slow, but not usually disgustingly slow, milldly irritatingly slow maybe.

JacquesCronje · ‎04-01-2009

Each host is set up with a team of two nics - dedicated vmkernel for iscsi - each nic connected to two separate Foundry x448 switches. These two switches are dedicated to form a storage LAN that is isolated from the production network. The NetApp cluster (FC head and SATA head) is presenting a virtual network interface (vif) on the storage lan that consists of two ethernet interfaces.

I have five ESX hosts connecting to the SATA head (NetApp 3020) - the port utilisation on ESX, the switches and the NetApp is below negligible - the FC head iscsi uses the same esx nics, the same switches and has the same vif setup and it screams along.

I guess it's worth mentioning that I'm running A-SIS (de-duplication) on the VMware volumes. That was implemented almost a year ago, no issues. The more I read and research, the more I start leaning toward an OnTap upgrade. But I'm certainly welcoming any suggestions and will update the post for those interested.

JohnADCO · ‎04-02-2009

Pretty basic setup....

All your hosts are doing it, that is what I was not grasping.

Only difference is the SATA thing then? Suspicious. Hard to believe the drive scheme could cause what is happening.

Maybe an IOmeter summary would prove more valuable in trouble shooting efforts. Not sure.

dilidolo · ‎04-02-2009

We have 3070 cluster, we put production VMs on aggr with 15K RPM FC disks and test VMs on aggr with SATA disks. Yes, SATA is very slow, but we are not using iSCSI, we use NFS.

The performance really depends on how many disks are in the aggr, how many datastores on that aggr, and how many VMs in each datastore.

JacquesCronje · ‎04-29-2009

Hi all - I thought I'd post a follow-up for those interested:

The performance issues went away after upgrading from onTap 7.2.4 to 7.3.1.

My esXpress backups went from 1meg/second back to 10-18meg/second!!

The iscsi messages didn't stop though, and further investigation of the messages in the vmkernel logs showed:

"Apr 25 01:01:47 akesx03 vmkernel: 1:19:13:05.300 cpu3:1074)<5>iSCSI: session 0x96180a0 iSCSI: session 0x96180a0 retrying all the portals again, since the portal list got exhausted"

The post below seemed to explain the problem.

http://vikashkumarroy.blogspot.com/2009/03/iscsi-error-for-netapp.html

naimhb · ‎09-16-2009

You may want to try from the command line interface to run

filer>priv set advanced

filer> statit -b wait 30 seconds

filer> statit -e

You will see large amounts of data spew out to the screen. Look at the disks belonging to your vmware server in the correct raidgroup. If the disk i/o is above 50ns the sata drives are definately causing the problem. Sata disks perform well at their peak below this threshold. Anything higher will cause performance issues.

All

Disgustingly poor performance from iSCSI luns on NetApp