I have multiple dell poweredge r610/r620 connected to a stack of 4 powerconnects 6224/6248 to 4 equallogic's (PS5000E + PS4000E 2X) + PS6100XV).
the PS5000E is in pool 1
the 2 PS4000E's are in pool 2 (holds 2 datastores, all on 7 esx hosts)
the PS6100XV are in pool 3. (holds 7 datastores, all on 7 esx hosts)
I have test it both to pool2 and pool3. The most problems are on pool 3. but it also happens on pool 2.
Each esxhost is configured like equallogic wants it. 2 nics dedicated for iscsi traffic purely for vmfs..
VMnic's are bound correctly... Dont use those nics for anything else..
Windows guest vms also have access to the san through the microsoft iscsi initiator and 2 seperate vmnic's on all esxhosts.
Windows guest iscsi traffic does NOT experience the same problem.
The problem is easily reproduced: when i do a random or seq read io`s from io meter with a 64KB io size i get a latency of 140+ and a throughput of 28MB...)
I have been working for more then a week now on this case and i have as far as i know tried everything.
I have created a document in which i have putted all my brain waves and results in it.:
2 paths slow
Throughput per path is 14MB and latency = 141ms per path (230 iops)
If I disable one path throughput stays the same but latency goes up to 300MS
Question: why does the latency double when disabling one path?
1 path slow and one path fast
1600 iops / 100mb per sec / 19ms latency
30 iops / 14mb per sec / 142ms latency
Question: why is the latency on the bad path only 142ms and not 300?
2 paths fast
Both paths. 1600+ iops / 100mb per sec / 17 ms per path
IO size comparison: (see pictures and xlsx sheet.)
- Path’s will never automatically become fast if they gone “bad”
- Path’s will go bad after a while
- If I reset the switchport of the effected path the problem is gone for that path.
- Sometimes if I reset the switchport for an effect path the problem the other path also get solved..
- The problem only occurs with big io’s.
- On Broadcom nics the problem also shows up but it’s a lot better (50mb per sec and latency of around 70-90ms).
- If the paths are good I can keep it running for 24 hours without any issue’s but if I stop the test and start it soon afterwards (1 hour?) the problem suddenly occurs
- It always OR bad OR good right when I start the test.
- If I move my test volume from pool 3 to pool 2 and back the problem is gone for that test volume. It seems that when I only do 64kb test it takes a lot longer before the path’s go bad then when I try with first 0,5kb/4/16/32/64/128. Almost allways after the first test run the path’s become bad…
- When its bad for volume 1 on host 1 it doesn’t necessarily mean its bad for volume1 on host 2
- It seems like it takes longer for my test volume to go bad (maybe because it used less?)
- Disabling all acl’s doesn’t make any difference
- DDOS prevention on the switch is disabled
- QoS is not active.
- Flow control active or inactive doesn’t make any difference
- There seems to be a difference in latency what the san sees: the san sees 10% less then esxtop. On very small IIO’s this can go up to 50% (2ms compared to 4ms on esxtop)
- I have created 2 portgroups on the same vswitch as I use for iscsi. I created a port group a and b. Both bound to the same nic’s as the vmknic’s. So one for each port. Then from within my test vm I tried to run the tests on a esx datastore and the problem occurs right away. But I have no problem when I access a windows volume.
- Disabling the nic in esx (esxcli network nic down –n vmnic5) and then enabling it again solves the problem for that path….
- Wireshark traces show the same latency when I use statistics \ Service Reponse time \ SCSI.
I have test this but on the virtual switch (port group in prom mode) and by mirroring both the uplink of the esx host and uplink of the san.
I have reset all counters on the powerconnect switch. (9:35 CET)
On the ESX09 the time is 07:39 UTC 2012.
Then I started a test turn its running on datastore “ESX-SAS-03”.
On hour later, I don’t see any dropped frames on the used interfaces. Neither do I see any pause frames being send or received. Neither are there any warning or errors on the EQL or vmware.
There were no other vm’s running on the host at that time.
please help:( cause dell/ equallogic support isnt taking me seriously (they blame my switches because i route iscsi traffic between vlan's and have acl's. But that iscsi traffic is not effected). they tell me to break down the stack.) They didnt even talk to use. Everything happend through email. There was no webex what so ever...