_VR_
Contributor
Contributor

Equallogic Performance Issues

A CALL FOR HELP

I've spent a week trying to troubleshoot an issue with a new Equallogic PS4100X. A case has been opened with Dell a week ago. After multiple escalations it has gotten absolutely nowhere. I wanted to see if anyone would be able to add some insight.

IOMeter test result:

SERVER TYPE: Windows 2008 R2
HOST TYPE: DL380 G7, 72GB RAM; 2x XEON E5649 2.53 GHz 6-Core
SAN Type: Equallogic PS4100X / Disks: 600GB 10k SAS / RAID LEVEL: Raid50 / 22 Disks / iSCSI
##################################################################################
TEST NAME--Av. Resp. Time ms--Av. IOs/sek---Av. MB/sek----
##################################################################################
Max Throughput-100%Read.......______18___..........___3217__........___101____
RealLife-60%Rand-65%Read..._____13___.........._____3438__........_____27____
Max Throughput-50%Read.........______19___..........____3199__........___100____
Random-8k-70%Read................_____13___.........._____3463__........_____27____

DESCRIPTION OF PROBLEM:

The PS4100X has a system bottleneck that limits throughput to 100MB/s. When a single host is connected with a single path, eth0 and eth1 on the PS4100x can max out at 1Gbit/s. When there are multiple hosts or multiple paths connected (tested 2 - 8 concurrent paths, 2-6 host nics), the throughput of eth0 and eth1 drop to half of the speed (500Mbit/s). The combined throughput of both ethernet adapters can never exceed 1Gbit/s. Unit has been upgraded to v5.2.1 (latest) firmware.

SEE TEST RESULTS HERE:

1. Shows eth1 being maxed out in single path, then the connection switches to multipath
2. Shows eth0 being maxed out in single path, then the connection switches to multipath
3. Shows two concurrent tests from two separate test hosts

RULLING OUT NETWORK ISSUES:

I'm able to replicate the above problem in the following configurations:
Test host connected to PS4100X via Cisco 6509
Test host connected to PS4100X directly via cross over cable (two active iscsi paths setup manually)
Test host connected to PS4100X via dedicated unmanaged netgear switch
I can further prove that the Cisco 6509 is functioning properly because I'm able to show speeds of 180MB/s+ speeds to the production PS6000XV and the production PS4000E.

RULLING OUT HOST ISSUES:

Tested from a host running Windows 2008 R2 and another host running Windows 2003. Both test hosts encounter the issue described above. Both hosts show speeds of 180MB/s+ when running tests against the two Equallogics in production.

DEALING WITH DELL-EQUALLOGIC SUPPORT HELL:

The analyst I'm currently dealing with says the PS4100x is working as expected. He refuses to do any further troubleshooting because some of the blades on the Cisco 6509 have QOS and VOIP. The blade the SAN and test hosts are connected to have no QOS or VOIP configured.

56 Replies
steffan
Contributor
Contributor

Hi KetchAdmin,

which Switches do you use for iSCSI in your setup ?

I asked Dell about the IOPS Threshold, and they told me that it is not supported. Do you have a official answer from VMWare or Dell regarding that setting?

Kind regards,

Steffan

0 Kudos
sparrowangelste
Virtuoso
Virtuoso

This thread is VERY interesting. I surprised Equallogic was not able to get you guy fixed...

:smileyshocked:

--------------------- Sparrowangelstechnology : Vmware lover http://sparrowangelstechnology.blogspot.com
0 Kudos
dwilliam62
Enthusiast
Enthusiast

re: IOPS value.   Dell/EQL, HP, EMC, VMware, etc  agreed on some general principles about iSCSI configuration.  One of the was to change the IOs per path from 1000 to 3.   If you google "multivendor iscsi post" you will find it on several blogs.

http://en.community.dell.com/techcenter/storage/w/wiki/2671.aspx

One thing about Delayed ACK, is that you have to verify that the change took place.  If you just change it, it appears that only new LUNs will inherit the value. (disable)   I find that, while in maint mode, removing the discovery address and any discovered targets (in static discovery tab), then disabling Delayed ACK and re-add in Discovery Address and rescan resolves this.

At the ESX console run:   #vmkiscsid --dump-db | grep Delayed     All the entries should end with ='0' for disabled. 

I just posted this to another thread on EQL perf issues:

http://communities.vmware.com/message/2078917#2078917

Common causes of performance issues that generate that alert are:

1.)  Delayed ACK is enabled.

2.)  Large Recieve Offload (LRO) is enabled

3.)  MPIO pathing is set to FIXED

4.)  MPIO is set to VMware Round Robin but the IOs per path is left at default of 1000.  Should be 3.

5.)  VMs with more than one VMDK (or RDM) are sharing one Virtual SCSI adapter.  Each VM can have up to four Virtual SCSI adapters.

6.)  iSCSI switch not configured correctly or not designed for iSCSI SAN use.

This thread has some specifc instructions on how to disable DelayedACK as well.

Regards,

0 Kudos
dwilliam62
Enthusiast
Enthusiast

Re: Storage Heartbeat VMK port.  (Mentioned in EQL iSCSI config guides for ESX)

This will never affect performance.   The lowest VMK port in each IP subnet is the default device for that subnet.  With iSCSI MPIO when VMK ports are on the same subnet this can cause problems when the link associated with that VMK port goes down.  (cable pull, rebooting/failed/powered off switch, etc)  ESX will still use that port to reply to ICMP and Jumbo Frame SYN packets.  During the iSCSI login process, the EQL array pings the source port from the array port it wants to use to handle that session request.  If that ping fails the login process fails.  The Storage Heartbeat makes sure that default VMK port is always able to respond to the ping.  That VMK port must NOT be bound to the iSCSI adapter.

Regards,

0 Kudos
faceytime
Contributor
Contributor

i had to update the nic driver for the broadcoms on our 620's to get it decent, after the firmware update to 5.2.4h1, latency went to god high numbers (20k+) and until the drivers, would just stay there nice and high. horrible horrible performance

we are a 100% virt shop so it was killing us, needless to say l1 support was less then useful.

0 Kudos
alex555550
Enthusiast
Enthusiast

Hi all,

with the new Equallogic firmware 6.0.1 it seems better. Also Cisco release a IOS Update 15.0(2)SE, that solves the Flow Control issue.

0 Kudos
alex_wu
Enthusiast
Enthusiast

hello,

have you resolved the issue about flowcontrol?

I met the same issue.. my cisco ISO is 12.2(58)se2  cisco 2960s switch

Switch# show flowcontrol interface gigabitEthernet g1/0/21

#

Port       Send FlowControl  Receive FlowControl  RxPause TxPause

#

           admin    oper     admin    oper

#

---------  -------- -------- -------- --------    ------- -------

#

Gi1/0/21   Unsupp.  Unsupp.  desired  off         0       0

the port 21 is connected with EQL, but the "oper" under receive flowcontrol is always "OFF".

0 Kudos
chimera
Contributor
Contributor

Re: Storage Heartbeat VMK port.  (Mentioned in EQL iSCSI config guides for ESX)

This will never affect performance. 

Saying it will "never" affect performance is incorrect.  Without storage heartbeat, performance still can be affected.  If storage heartbeat is not setup and a NIC (with the lowest vmk mapping) fails, then "the Equallogic will not be able to accurately determine connectivity during the login process, and therefore suboptimal placement of iSCSI sessions will occur". If a storage heartbeat vmk is setup, then it gets tied to both pNIC's and so if a NIC fails, it uses the 2nd NIC to determine optimal placement of iSCSI sessions. So the heartbeat vmk in itself won't direct affect performance, but to say it will never affect performance is incorrect Smiley Wink

0 Kudos
dwilliam62
Enthusiast
Enthusiast

Sorry that I wasn't clear.

Under normal conditions the SHB VMK port has no impact. Since no iSCSI traffic passes through it.  However, a network outage isn't what I would consider a "performance" problem.  Since w/o SHB you can end up with a total outage, not just degraded performance.  I've seen it happen.   That's what I was referring to. 

FYI: This issue is resolved in ESXi v5.1.  So you no longer need the SHB VMK port.

0 Kudos
0v3rc10ck3d
Enthusiast
Enthusiast

Sorry to as the dumb questions but I just want to clarify what the problem is.

You  have an array with two 1GB links connected on the active controller,  your hosts have multiple 1GB ( how many?) links connected to the  switching.

You are seeing a 100MB/s limit on IOMeter running within a single VM.

You say for the array "The combined throughput of both ethernet adapters can never exceed 1Gbit/s"

So for a single 1 gigabit link the maximum theoritical throughput is 108 megabytes per second.

If you have two separate VM's running on two separate hosts running I/O meter you are seeing 50MB/s throughput on each?

VCIX6 - NV | VCAP5 - DCA / DCD / CID | vExpert 2014,2015,2016 | http://www.vcrumbs.com - My Virtualization Blog!
0 Kudos
_VR_
Contributor
Contributor

>> If you have two separate VM's running on two separate hosts running I/O meter you are seeing 50MB/s throughput on each?
Correct.
Here is what it looks like on the EQL side:
eth0 enabled, eth1 disabled = 100 MB/s
eth0 disabled, eth1 enabled = 100 MB/s
eth0 enabled, eth1 enabled = 100MB/s (50MB/s on eth0, 50MB/s on eth1)
It doesnt matter how many hosts are running concurrent tests or how many nics each host has or how many paths are setup. The results are always the same.
0 Kudos
dwilliam62
Enthusiast
Enthusiast

Do you still have a support case open with Dell?

Don

0 Kudos
Kaeon
Contributor
Contributor

Hi KetchAdmin,

I would very much like to know what settings you used to get your Equallogic speed up. Also, what tool do you use to get your performance numbers. We are using Equallogic storage with vsphere 5.1 and MEM configured.

BR

0 Kudos
JNixERG
Contributor
Contributor

I am out of the office until Monday 4/22 . If your request is urgent, please resend your message with URGENT in the subject and I will get back to you as quickly as possible.

Steve Pearce

0 Kudos
dtwilley
Contributor
Contributor

I know its been a while but I found this while trying to diagnose a fault of our own.

Environment

PS4100X (24x600Gb SAS), 2 x Cisco 3560X (12.2 (55), 3 x HP DL360p (2 x QP Broadcom 5719 NICS), VMWare 5.1 U2 (HP custom image).

Problem

We see 50MB/s in total from the Equallogic regardless of how many hosts are used and how many NICs, if we test from two hosts the speed halves, if we test from two NICS we see 25MB/s on each NIC or 50MB/s if one NIC is tested.,

I followed advice from this post as well as following all the usual best practice guides.

  • Delayed ACK Disabled
  • IOPS threshold changed from 1000 to 3
  • Login Timeout changed to 60
  • Equallogic Firmware updated from 6.0.2 to 6.0.4 then eventually to 7.0.2
  • Ran the Hard drive firmware upgrade utility
  • Broadcom TG3 drivers upgraded from 3.12.x to 3.13.x (both 50.1 and 50.2)
  • Firmware upgraded on the hosts using the built in HP Intelligent Provisioning Utility
  • Cisco config has flow control set to desired and confirmed as working on all iSCSI ports, portfast, confirmed disabled storm control, VLAN separation, again, as per guides

I logged a call with Dell who wanted Diags and SANHQ output, they said the SAN was fine.  Logged a ticket with VMWare and they looked at DAVG in ESXTOP and found that this was showing values of 100-1000 during IO operations so VMWare pointed me at the switches as they said it wasn't VMWare and Dell had already said it wasn't the SAN.

I also found millions of oversize packets on iSCSI ports, its almost as if the Equallogic isn't correctly discovering path MTU and the Cisco was having to fragment packets before transmitting them to the hosts, maybe inducing the lag.

Checked all the switch config, checked flow control was operational on the ports set to desired (as per this post), thanks

Eventually I went to the Datacentre and disconnected all the cables to the SAN, plugged in a laptop and ran IOMeter and did some basic file copies between SAN volumes and again could only get 50MB/s.

I got Dell on a remote session via the wireless on the laptop so they could see it directly and they've agreed to send out a specialist engineer Monday morning so we'll see what happens next.  They're also sending some spares but as yet I don't know what they are.

This whole process started on the Monday and it took me until Friday to convince Dell of the problem.

I'll update this post when we know more in case anyone else has to go through the same hell I have.

Dave Twilley VCP5

Datek Solutions Ltd.

RamzyMasarweh
Contributor
Contributor


Thanks A lot Dave .. please keep us posted

regards,

Ramzy

0 Kudos
dtwilley
Contributor
Contributor

So I met the Dell Skytech guy at the datacentre this morning, his laptop gets 100MB/s either direct to the SAN or via the switches so looks like my laptop test was just an anomaly!  He spoke to some guys at base and they said try an earlier TG3 driver, that made no difference but its interesting that the one on record as testing/working with EQL is the 3.12xx driver which is known to corrupt datastores!

Anyway, we're now escalating it with Dell as a Broadcom NIC issue as we have nowhere left to go.  I have an Intel QP NIC on standby.

In the meantime I'm sticking VSphere 5.5 on to the host and will see what happens.  What's annoying is we see 100-200MB/s on the NICs when doing vMotion, so this is either limited to just iSCSI or its not a Broadcom issue.

0 Kudos