VMware Cloud Community
_VR_
Contributor
Contributor

Equallogic Performance Issues

A CALL FOR HELP

I've spent a week trying to troubleshoot an issue with a new Equallogic PS4100X. A case has been opened with Dell a week ago. After multiple escalations it has gotten absolutely nowhere. I wanted to see if anyone would be able to add some insight.

IOMeter test result:

SERVER TYPE: Windows 2008 R2
HOST TYPE: DL380 G7, 72GB RAM; 2x XEON E5649 2.53 GHz 6-Core
SAN Type: Equallogic PS4100X / Disks: 600GB 10k SAS / RAID LEVEL: Raid50 / 22 Disks / iSCSI
##################################################################################
TEST NAME--Av. Resp. Time ms--Av. IOs/sek---Av. MB/sek----
##################################################################################
Max Throughput-100%Read.......______18___..........___3217__........___101____
RealLife-60%Rand-65%Read..._____13___.........._____3438__........_____27____
Max Throughput-50%Read.........______19___..........____3199__........___100____
Random-8k-70%Read................_____13___.........._____3463__........_____27____

DESCRIPTION OF PROBLEM:

The PS4100X has a system bottleneck that limits throughput to 100MB/s. When a single host is connected with a single path, eth0 and eth1 on the PS4100x can max out at 1Gbit/s. When there are multiple hosts or multiple paths connected (tested 2 - 8 concurrent paths, 2-6 host nics), the throughput of eth0 and eth1 drop to half of the speed (500Mbit/s). The combined throughput of both ethernet adapters can never exceed 1Gbit/s. Unit has been upgraded to v5.2.1 (latest) firmware.

SEE TEST RESULTS HERE:

1. Shows eth1 being maxed out in single path, then the connection switches to multipath
2. Shows eth0 being maxed out in single path, then the connection switches to multipath
3. Shows two concurrent tests from two separate test hosts

RULLING OUT NETWORK ISSUES:

I'm able to replicate the above problem in the following configurations:
Test host connected to PS4100X via Cisco 6509
Test host connected to PS4100X directly via cross over cable (two active iscsi paths setup manually)
Test host connected to PS4100X via dedicated unmanaged netgear switch
I can further prove that the Cisco 6509 is functioning properly because I'm able to show speeds of 180MB/s+ speeds to the production PS6000XV and the production PS4000E.

RULLING OUT HOST ISSUES:

Tested from a host running Windows 2008 R2 and another host running Windows 2003. Both test hosts encounter the issue described above. Both hosts show speeds of 180MB/s+ when running tests against the two Equallogics in production.

DEALING WITH DELL-EQUALLOGIC SUPPORT HELL:

The analyst I'm currently dealing with says the PS4100x is working as expected. He refuses to do any further troubleshooting because some of the blades on the Cisco 6509 have QOS and VOIP. The blade the SAN and test hosts are connected to have no QOS or VOIP configured.

56 Replies
alex555550
Enthusiast
Enthusiast

Thanks for the update. And what was the resolution for problem in you`re first threat? The kb is for intel nics.

Reply
0 Kudos
_VR_
Contributor
Contributor

Unresolved

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

You have a PM.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

Can please anybody with the same problem then _VR_ with a new Equallogic look at the Switch if FlowControl is up. At the moment it looks like flowcontrol never comes up on the new PS4100.

This is an output when it`s not running, FlowControl oepr must be on:

Switch# show flowcontrol interface gigabitEthernet 5/5

Port       Send FlowControl  Receive FlowControl  RxPause TxPause

           admin    oper     admin    oper

---------  -------- -------- -------- --------    ------- -------

Gi5/5      off      off      desired  off         0       0
Reply
0 Kudos
CraigD
Enthusiast
Enthusiast

Is there any update on this situation?  I am about to place an order for either a 4100X or a 4100XV and wonder if I am going to experience the same problems.  My three hosts are IBM x3650 M2 boxes with two built-in Broadcom NIC ports and a quad-port Intel 82571EB NIC.

I will probably replace my current physically-dedicated SAN switches with a pair of HP 2510-24G units.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

No, still not fiexed at my site. But cause is unknown.

Real live Performance is OK. But read performance is ----

Nachricht geändert durch alex555550

Reply
0 Kudos
JNixERG
Contributor
Contributor

I'm having the exact same issue you're seeing.  Alex - I've shot you a PM with more details...

Reply
0 Kudos
JNixERG
Contributor
Contributor

So I've been working on this a lot today and I've made a little progress.  Equallogic support sent me this doc:

http://www.equallogic.com/WorkArea/DownloadAsset.aspx?id=10799

Specifically it mentions the heartbeat VMK for iSCSI.  I did not have this configured (I am using the MEMs).  For a while my retransmits in SAN HQ have been above normal (<1% but above 0.1% where it should be <0.1%).  I was attributing this to the flowcontrol issue that Alex pointed out.  After I added the Storage Heartbeat VMK, my retransmits fell to <0.1% and my transfer rate is much higher.  I would suggest trying this if you haven't already.

On a side note, I have a replica environment that has a PS6100 and Dell R710 hosts with Intel PT quad-port cards in it for failover (I'm using the same in the primary except it's a PS6500).  That site has the same switches (Cisco 3750) but a slightly older firmware (12.2.55 vs 12.2.58).  This site is reporting proper flowcontrol on for the storage ports.  I'm not sure if the switch firmware has anything to do with it but I plan on opening up a TAC case soon to see what they say.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

Hi,

thanks for the update. I`ll have this Problem with all my Cisco`s.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

Hi,

ok cisco finaly found an bug in IOS.

CISCO BUG ID: CSCty55093

Reply
0 Kudos
JNixERG
Contributor
Contributor

This is good news.  I am in the process of renewing our SmartNet agreements to discuss this with Cisco myself.  What's interesting is I'm not using IOS 15.  I have used IOS 15 in a few environments in the past and in every single one of them I've had switch crashes in the middle of the night (one about every few months).  I do NOT see a flowcontrol problem with switches running 12.2(55)SE5 but I am seeing the problem with 12.2(58)SE2 and 15.  These are all 3750G switches.

I've sent you a PM as well.  Thanks for the update on this.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

The crashed also happen on IOS 15 and C3750-x units in a stack. There is no crashlog written.

Reply
0 Kudos
KetchAdmin
Contributor
Contributor

I am having the exact same issue as well.... But the sad thing is I am using all Dell equipment : Dell R720 servrs, a pair of stacked 6248 switches... I've changed the IOPS parameter on all the ISCSI interfaces from 1000 to 3, and set them to default to round robin, turned off LRO and Delayed ACK, but still no change in throughput... Maximum I've been able to get has been around a 110 MBps... Did anybody find a solution for this issue??? Any help would be very much appreciated...

Reply
0 Kudos
steffan
Contributor
Contributor

Hi


We see this error since years now, i have tried a lot with dell, vmware and IBM also, but i have the same problems as you also have....

We´re on VMWare 5.1, 2 EQL Groups (PS6000 and PS5000), Firmware 5.2.2, Using Onboard Broadcom and Quadport Intel Cards.

I have never seen more than 80-100 MB/s....

Reply
0 Kudos
steffan
Contributor
Contributor

We should maybe try to share our VM Network setup, as this could be a big point in regards of performance.

I have 3 Nics in a vSwitch and use 3 VMKernel Ports. On top i have two VM Network.

So the ESXi Servers use 3x 1 Gbit Nics to the Groups (Switches DELL 6248 Stacked).

The VM´s (as seen on the screenshot): ISCSI1 uses VMNIC3, ISCIS2 uses VMNIC4  (and VMNIC5 as Standby)

Internally in the VM´s i use VMXNET3 with default settings (except that i have set Jumbo Frames to 9000)

I haven´t changed anything in ESXi (in regards to TCO, TSO, LRO etc). We use MEM 1.1.

So i some of you see trouble with my above setup, feel free to give me a hint. Everything is made after the manual.

But the problem is that the performance is quite bad....

Reply
0 Kudos
_VR_
Contributor
Contributor

Has anyone tested V5.2.4 firmware yet?

Reply
0 Kudos
steffan
Contributor
Contributor

I have not, but it will not solve the problem i think. 

Reply
0 Kudos
_VR_
Contributor
Contributor

steffan,

It's not a vmware or network configuration issue. I've tested performance using multiple different physical server as well as 3 different types of network switches. All configurations experienced the same issue.

At the same time I have two EQL devices running 5.0.7 that are not seeing any issues.

Reply
0 Kudos
alex555550
Enthusiast
Enthusiast

The Problem with ISCSI in ESXI 5 U1 is only after a reboot. To avoid this bug, please set Failback to yes on the ISCSI vSwitch. Before U1 the solution was Failback to No.

Reply
0 Kudos
KetchAdmin
Contributor
Contributor

After working on this issue for over a month with EqualLogic Level 3 support, Dell PowerEdge Group (responsible for VMware issues within Dell) and Directly with VMware I think we might have found a resolution to this issue... That being said, there are a number of things that can cause this issue so i'll try to include some troubleshooting steps that might be helpful in narrowing down cause of the issue in your particular environment...

I would highly recommend you read carefully “ Configuring iSCSI Connectivity with VMware vSphere 5 and Dell EqualLogic PS Series Storage “… When we first started working on this issue, this document was not even available (they still only had the best practices for vSphere 4.1)... I would highly recommend you setup the Storage Heartbeat port as per the recommendations outlined on this document (Needs to be the lowest numbered vmkernal port on the vSwitch, which means this port needs to be created first on the vSwitch, and also enable Jumbo Frames on it if you are using jumbo frames in your environment)...

There still seems to be some weird issues with the VMware ESXi 5 Software ISCSI Initiator even after Update 1 (build 5.0.0, 623860)… So it might not be a bad idea to start with a new vSwitch (especially because you want the Storage Heartbeat needs to be the lowest numbered vmkernal port on that vswitch). You can move the nics one at a time to the new vSwitch if you cannot afford to bring down the host…

As one of the first troubleshooting tips, enable SSH on the host and run the esxtop command, then press n to display the live performance data of only the network interfaces on that esx host… But before you do this, please note which vmnics you have assigned for ISCSI traffic… I am almost certain, you will only see traffic flowing through only one of the assigned vmnics, hence the kind of performance numbers we all have been seeing (little over 100MBps throughput and around 3200+ IOPS, which by the way is approximately the theoretical maximum of a single gigabit port)… This was happening despite the fact we had Round Robin enabled and it was showing both paths as active (I/O)…

Also verify this from the controller side of things as well… If you have EqualLogic SAN HQ running in your environment… Look under Network -> Ports… You should see roughly the same amount of data sent and received on both the ISCSI gigabit ports on the controller…

If I am not mistaken, when properly configured each vmkernal port assigned to ISCSI traffic on the host creates an individual ISCSI connection to the volumes on the PS series array… Please verify that there are indeed two connections for each volume on the array (this can be done via EQL group manger or via the array console)…

If you are using MEM and the HIT kit, I think a lot of the tuning and performance optimizations are all taken care of, but in our environment we do not have VMware enterprise licensing, so we had to use Round Robin, but there are a few performance tuning configuration changes (change IOPS threshold for Round Robin from 1000 to 3, Disable Delayed ACK and LRO) you can make on the ESXi host that makes a substantial difference especially in response times … Let me know if anybody is interested in any of these settings and I can post them as well…

Currently I am getting a Maximum throughput (100% Read, 0% Random, 32 Byte packet size) of ~234 MBps, IOPS 7485, Average Response time 8.5ms

Hopefully this will give you a starting point in diagnosing and finding a resolution to the performance issues…

Reply
0 Kudos