VMware Cloud Community
HansdeJongh
Enthusiast
Enthusiast

low througput and high latency on read io's on Equallogic storage

Hello,

I have multiple dell poweredge r610/r620 connected to a stack of 4 powerconnects 6224/6248 to 4 equallogic's (PS5000E + PS4000E 2X) + PS6100XV).

the PS5000E is in pool 1

the 2 PS4000E's are in pool 2 (holds 2 datastores, all on 7 esx hosts)

the PS6100XV are in pool 3. (holds 7 datastores, all on 7 esx hosts)

I have test it both to pool2 and pool3. The most problems are on pool 3. but it also happens on pool 2.

Each esxhost is configured like equallogic wants it. 2 nics dedicated for iscsi traffic purely for vmfs..

VMnic's are bound correctly... Dont use those nics for anything else..

Windows guest vms also have access to the san through the microsoft iscsi initiator and 2 seperate vmnic's on all esxhosts.

Windows guest iscsi traffic does NOT experience the same problem.

The problem is easily reproduced: when i do a random or seq read io`s from io meter with a 64KB io size i get a latency of 140+ and a throughput of 28MB...)


I have been working for more then a week now on this case and i have as far as i know tried everything.

I have created a document in which i have putted all my brain waves and results in it.:

Problems:

2 paths slow

Throughput per path is 14MB and latency = 141ms per path (230 iops)

If I disable one path throughput stays the same but latency goes up to 300MS

Question: why does the latency double when disabling one path?

1 path slow and one path fast

First path:

1600 iops / 100mb per sec / 19ms latency

Second path

30 iops / 14mb per sec / 142ms latency

Question: why is the latency on the bad path only 142ms and not 300?

2 paths fast

Both paths. 1600+ iops / 100mb per sec / 17 ms per path

IO size comparison: (see pictures and xlsx sheet.)

  • Path’s will never automatically become fast if they gone “bad”
  • Path’s will go bad after a while
  • If I reset the switchport of the effected path the problem is gone for that path.
  • Sometimes if I reset the switchport for an effect path the problem the other path also get solved..
  • The problem only occurs with big io’s.
  • On Broadcom nics the problem also shows up but it’s a lot better (50mb per sec and latency of around 70-90ms).
  • If the paths are good I can keep it running for 24 hours without any issue’s but if I stop the test and start it soon afterwards (1 hour?) the problem suddenly occurs
  • It always OR bad OR good right when I start the test.
  • If I move my test volume from pool 3 to pool 2 and back the problem is gone for that test volume. It seems that when I only do 64kb test it takes a lot longer before the path’s go bad then when I try with first 0,5kb/4/16/32/64/128. Almost allways after the first test run the path’s become bad…
  • When its bad for volume 1 on host 1 it doesn’t necessarily  mean its bad for volume1 on host 2
  • It seems like it takes longer for my test volume to go bad (maybe because it used less?)
  • Disabling all acl’s doesn’t make any difference
  • DDOS prevention on the switch is disabled
  • QoS is not active.
  • Flow control active or inactive doesn’t make any difference
  • There seems to be a difference in latency what the san sees: the san sees 10% less then esxtop. On very small IIO’s this can go up to 50% (2ms compared to 4ms on esxtop)
  • I have created 2 portgroups on the same vswitch as I use for iscsi. I created a port group a and b. Both bound to the same nic’s as the vmknic’s. So one for each port. Then from within my test vm I tried to run the tests on a esx datastore and the problem occurs right away. But I have no problem when I access a windows volume.
  • Disabling the nic in esx (esxcli network nic down –n vmnic5) and then enabling it again solves the problem for that path….
  • Wireshark traces show the same latency when I use statistics \ Service Reponse time \ SCSI.

I have test this but on the virtual switch (port group in prom mode) and by mirroring both the uplink of the esx host and uplink of the san.

Test

I have reset all counters on the powerconnect switch. (9:35 CET)
On the ESX09 the time is 07:39 UTC 2012.

Then I started a test turn its running on datastore “ESX-SAS-03”.

On hour later, I don’t see any dropped frames on the used interfaces. Neither do I see any pause frames being send or received. Neither are there any warning or errors on the EQL or vmware.

There were no other vm’s running on the host at that time.

please help:( cause dell/ equallogic support isnt taking me seriously (they blame my switches because i route iscsi traffic between vlan's and have acl's. But that iscsi traffic is not effected). they tell me to break down the stack.) They didnt even talk to use. Everything happend through email. There was no webex what so ever...

Regards

Hans

0 Kudos
3 Replies
HansdeJongh
Enthusiast
Enthusiast

this is what wireshark sees when i look at the scsi.request_frame

wireshark.jpg

0 Kudos
dwilliam62
Enthusiast
Enthusiast

Something to try is make sure that DelayedACK and Large Recieve Offload are disabled.

Here's a VMware KB on howto disable Delayed ACK

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100259...

Disabling LRO:

S

olution Title
HOWTO: Disable Large Receive Offload (LRO) in ESX v4/v5
Solution Details
Within VMware, the following command will query the current LRO value.

# esxcfg-advcfg -g /Net/TcpipDefLROEnabled

To set the LRO value to zero (disabled):

# esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled

NOTE: a server reboot is required.


Info on changing LRO in the Guest network.

http://docwiki.cisco.com/wiki/Disable_LRO

What version of ESX?  What's the build #?


What MPIO pathing are you using?   I.e.  Fixed, Round Robin,  Dell MEM?   (If MEM, what version?)

If you are using Round Robin then you probably are using the default IOs per path value of 1000.   That should be changed to 3.  Depending on which version of ESX I have a script that will change that for you.

This link has some agreed upon suggestions about iSCSI settings. 

http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...

0 Kudos
Super6VCA
Expert
Expert

Hans,

Have you found the answer to this issue??  I have pretty much the same issue.  Very low read iops and extremely high latency.  Very frustrating!! Have you talked with Dell to get any answers?  What do you see when you look at ESXTOP?  Do you see any issues there?

Thank you, Perry
0 Kudos