Hello everybody,
the old thread seems to be sooooo looooong - therefore I decided (after a discussion with our moderator oreeh - thanks Oliver -) to start a new thread here.
Oliver will make a few links between the old and the new one and then he will close the old thread.
Thanks for joining in.
Reg
Christian
SAN is a Dell R710 (1 x L5600-series, 3GB RAM) running Windows Server 2008 R2 and Microsoft iSCSI software target v3.3. IOMeter was run directly on the R710 (not from VM via hypervisor iSCSI). Data volume is a GPT partition with NTFS 64K default allocation unit size. RAID stripe element size on Perc 6i is 64K.
Access Specification | IOPS | MB/s | Avg Resp Time (ms) |
---|---|---|---|
Max Throughput-100%Read | 26,818.24 | 838.07 | 2.22 |
RealLife-60%Rand-65%Read | 1,070.44 | 8.36 | 41.94 |
Max Throughput-50%Read | 22,213.96 | 694.19 | 2.66 |
Random-8k-70%Read | 946.45 | 7.39 | 47.99 |
I'm happy with the sequential I/O results, but the random I/O results are just OK. I'm guessing random I/O results would be much better with more spindles, but I'm limited to 6 drive bays.
Comments appreciated on how I can improve random I/O results, if at all. Also, would StarWind target with read cache help at all with random I/O results? MS iSCSI target doesn't do caching...
If you want to keep Server 2008 R2 as the host OS, then yeah, Starwind iSCSI will get you caching although you’ll need more than 3GB of RAM for it to be noticeable. Although on the face of it, that’s not bad for 6 SATA spindles.
What does performance look like across the wire?
Thanks for your advice RE StarWind, mikeyb79. Based on benchmarking tests I've made with IOMeter against a StarWind iSCSI target with caching enabled on the LUN, you are right - a small RAM cache makes little difference to read performance. I'm guessing that even a larger cache (i.e. 16GB) won't help that much with random IO performance. What do you think? I'm not familiar enough with the StarWind caching algorithms to know for sure and StarWind themselves haven't been very forthcoming: http://www.starwindsoftware.com/forums/starwind-f5/how-properly-benchmark-caching-benefits-t3123.htm...
When I run IOMeter within a Windows 7 64-bit VM against a direct attached VMDK that resides on a LUN exposed by MS iSCSI target, performance is pretty close to "native" for smaller IO sizes. However, as I haven't yet implemented MPIO, network throughput is currently a bottleneck through a single Gig NIC. I'll rectify that this weekend
In general, I'm pretty happy with this performance given the relative low cost of the hardware. It would be nice to get some SSD caching involved for better random IO performance, but I don't think that's going to be possible.
I'll post back when I have some "in VM" results to share...
Thanks!
I'm still figuring all this out so take this with a grain of salt...
One thing, your memory bus speed may pick-up by about 70% on that box if you fill out all you DIMM slots with 1333MHz sticks so there could be potetial for a better than expected cache bonus with even a moderate increse in memory if balanced across all 3 channels.
I think you could get a 2-3x random I/O boost w/ ZFS on a *nix box but you may need 12-16x the RAM. If you think about doing it, it'd be interesting to see how much additional memory by itself improves your results on windows server with starwind, and then what additioanl boost, if any, you get from ZFS.
Access Specification | IOPS | MB/s | Avg Resp Time (ms) |
---|---|---|---|
Max Throughput-100%Read | 3398.88 (NIC on SAN almost saturated) | 106.22 | 9.09 |
RealLife-60%Rand-65%Read | 1069.34 | 8.35 | 26.16 |
Max Throughput-50%Read | saturates NIC on SAN | ||
Random-8k-70%Read | 994.77 | 7.77 | 28.66 |
Results from within a Windows 7 64-bit VM with 1.5GB of RAM and 1 vCPU with iSCSI software initiator running on ESXi 4.1 U2 host. Target VMDK resides on a LUN exposed by StarWind that has no caching:
Access Specification | IOPS | MB/s | Avg Resp Time (ms) |
---|---|---|---|
Max Throughput-100%Read | 3,247.61 (NIC on SAN almost saturated) | 101.49 | 9.46 |
RealLife-60%Rand-65%Read | 1118.26 | 8.74 | 24.26 |
Max Throughput-50%Read | saturates NIC on SAN | ||
Random-8k-70%Read | 975.25 | 7.62 | 28.10 |
Here are the results using the MS iSCSI target instead of StarWind, IOMeter run within the same test Windows 7 64-bit VM:
Access Specification | IOPS | MB/s | Avg Resp Time (ms) |
---|---|---|---|
Max Throughput-100%Read | 3,353.83 (NIC on SAN almost saturated) | 104.81 | 17.83 |
RealLife-60%Rand-65%Read | 1177.83 | 9.20 | 51.24 |
Max Throughput-50%Read | saturates NIC on SAN | ||
Random-8k-70%Read | 1040.72 | 8.13 | 57.96 |
As far as I know, the MS iSCSI target does no caching, so these results suggest that the 1GB Starwind cache provides little if any benefit.
Based on the "native" performance for max throughput (both 100% read and 50/50 read/write), the network throughput is clearly the limiting factor. With over 26,000 IOPS and over 800 MB/s throughput for reads, I would need 8 NICs in order to keep up! Pretty crazy...
I had planned on using (2) pNICs on the SAN and on each ESXi server and then have two iSCSI subnets and bind the vmknic to a single pNIC. Basically, a standard MPIO setup with round robin and switching between storage paths with each IO operation.
Can anyone out there advise on what's involved in using more than (2) pNICs with MPIO? Is it just a matter of creating additional vmknics and then binding each to a pNIC? Is it worth bothering?
About the only sequential IO in our environment that would generate sustained iSCSI traffic would be PHD backups (virtual fulls) and I'm guessing that a good chunk of the job time is spent on hashing, compression, and verification rather than on IO operations, so backup job times likely wouldn't be reduced by a huge amount with faster disk throughput.
Any thoughts?
Bit of unix-side network tuning today. Tremendous gains on the sequential tests! In case you made the same mistake I did on your solaris-based ZFS storage rig, reconfigure your network with ipadm instead of ifconfig and see if you get a huge boost too! (Btw, this is a 4Gb test so all the reads are from the ARC cache. write-back cache is also enabled.)
Test name | Latency | Avg iops | Avg MBps | cpu load |
---|---|---|---|---|
Max Throughput-100%Read | 8.26 | 7072 | 221 | 10% |
RealLife-60%Rand-65%Read | 3.68 | 12254 | 95 | 20% |
Max Throughput-50%Read | 8.71 | 6771 | 211 | 16% |
Random-8k-70%Read | 4.07 | 13111 | 102 | 20% |
Hi guys,
We just bought a HUS110 with 16SSD and 21SAS 15K and look's like have some performace issue.
Tested from VDI on HP blade system. with two Cisco FC shwitch m9100 4GB with low load on old EVA4400 storage.
This test is on clear disk array.
SERVER TYPE: W7 on ESX4.1 CPU TYPE / NUMBER: VCPU/4 HOST TYPE: HP BL460c G1, 52GB RAM, 2xXeonE5450 @ 3.00GHz STORAGE TYPE / DISK NUMBER / RAID LEVEL: HUS110 / 16x200GB SSD 3x(4+1)+1P/ RAID5 | ||||
Test name | Latency | Avg iops | Avg MBps | cpu load |
Max Throughput-100%Read | 4.78 | 11944 | 373 | 72% |
RealLife-60%Rand-65%Read | 40.45 | 1301 | 10 | 59% |
Max Throughput-50%Read | 30.21 | 1657 | 51 | 51% |
Random-8k-70%Read | 33.27 | 1511 | 11 | 49% |
SERVER TYPE: W7 on ESX4.1 CPU TYPE / NUMBER: VCPU/4 HOST TYPE: HP BL460c G1, 52GB RAM, 2xXeonE5450 @ 3.00GHz STORAGE TYPE / DISK NUMBER / RAID LEVEL: HUS110 / 20x300GB 15K SAS 4x(4+1)+1P / RAID5 | ||||
Test name | Latency | Avg iops | Avg MBps | cpu load |
Max Throughput-100%Read | 4.78 | 11964 | 373 | 71% |
RealLife-60%Rand-65%Read | 116.92 | 501 | 3 | 18% |
Max Throughput-50%Read | 30.38 | 1645 | 51 | 50% |
Random-8k-70%Read | 125.57 | 468 | 3 | 34% |
If I test it one single Vmware host 4.1 without another VDI and out of vcenter the latency goes down. It look that I have something wrong on first ESX. It could be balancing setting ? What should I check ?
SERVER TYPE: W7 on ESX4.1 CPU TYPE / NUMBER: VCPU/4 HOST TYPE: HP BL460c G1, 52GB RAM, 2xXeonE5405 @ 2.00GHz STORAGE TYPE / DISK NUMBER / RAID LEVEL: HUS110 / 15x 200GB SSD / Raid5 4+1 | ||||
Test name | Latency | Avg iops | Avg MBps | cpu load |
Max Throughput-100%Read | 4.56 | 12286 | 383 | 84% |
RealLife-60%Rand-65%Read | 9.76 | 4073 | 31 | 2% |
Max Throughput-50%Read | 5.86 | 9441 | 295 | 72% |
Random-8k-70%Read | 5.56 | 8453 | 66 | 2% |
Why is reading between SSD and SAS so similar. It bottleneck FC HBA ? How else could I test it?
Thank you so much for any advice
Host:
IBM x3650 M4
(2) Intel 2.9 GHZ 8 core
192 GB
(2) IBM SAS 6GB HBA
SAN:
IBM DS3524 2 GB Cache
(2) EXP3524
(2) LSI LSISAS6160 Switches
(2) 200 GB SSD
(10) 15k 146 GB
(48) 10k 600 GB
Guest:
Windows 2008R2 Enterprise
2 CPU 8 Core each
8 GB
Raid 10 (46) 10k 600GB cache on
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 33249.68 | 1039.053 | 1.701312 | 6.304418 |
RealLife-60%Rand-65%Read | 6893.556 | 53.8559 | 6.713703 | 3.512941 |
Max Throughput-50%Read | 12531.99 | 391.6246 | 4.712662 | 6.004028 |
Random-8k-70%Read | 6779.728 | 52.96663 | 6.216324 | 3.666074 |
Raid 10 (46) 10k 600 GB cache off
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 16346.72449 | 510.83514 | 3.745185 | 3.113469 |
RealLife-60%Rand-65%Read | 4153.403377 | 32.448464 | 10.89437 | 2.76599 |
Max Throughput-50%Read | 1395.445359 | 43.607667 | 35.306443 | 2.438852 |
Random-8k-70%Read | 5466.856889 | 42.709819 | 8.081162 | 3.035924 |
Diskpool (48) 10k 600 GB cache on 3 drive preservation
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 33363.02947 | 1042.594671 | 1.689026 | 6.452065 |
RealLife-60%Rand-65%Read | 3814.127886 | 29.797874 | 12.072742 | 3.061749 |
Max Throughput-50%Read | 13355.51381 | 417.359807 | 4.470258 | 3.003937 |
Random-8k-70%Read | 3589.380777 | 28.042037 | 12.029601 | 3.344783 |
Diskpool (48) 10K 600 GB cache off 3 drive preservation
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 15781.88086 | 493.183777 | 3.876935 | 4.768079 |
RealLife-60%Rand-65%Read | 1533.440557 | 11.980004 | 31.202513 | 2.326186 |
Max Throughput-50%Read | 992.594997 | 31.018594 | 49.800264 | 2.324266 |
Random-8k-70%Read | 1669.930696 | 13.046334 | 28.862594 | 2.284067 |
Raid10 (10) 146 GB 15k cache on
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 33245.00003 | 1038.906251 | 1.706747 | 6.508344 |
RealLife-60%Rand-65%Read | 4250.683429 | 33.208464 | 10.56322 | 3.420131 |
Max Throughput-50%Read | 12514.10222 | 391.065696 | 4.726333 | 2.948779 |
Random-8k-70%Read | 3762.80457 | 29.396911 | 11.635182 | 3.516433 |
Raid10 (10) 146 GB 15k cache off
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 16380.95056 | 511.904705 | 3.75275 | 3.269551 |
RealLife-60%Rand-65%Read | 2871.883938 | 22.436593 | 17.253888 | 2.256621 |
Max Throughput-50%Read | 1355.161177 | 42.348787 | 38.087507 | 2.204567 |
Random-8k-70%Read | 3083.466497 | 24.089582 | 16.046979 | 2.290914 |
Raid1 (2) 200 GB SSD cache on
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 36207.60053 | 1131.487517 | 1.601896 | 6.78024 |
RealLife-60%Rand-65%Read | 10244.94536 | 80.038636 | 5.712712 | 2.466072 |
Max Throughput-50%Read | 12445.03385 | 388.907308 | 4.77228 | 2.912787 |
Random-8k-70%Read | 10911.60675 | 85.246928 | 5.29515 | 2.564752 |
Raid1 (2) 200 GB SSD cache off
Test Name | IOps | MBps | AVG Resp | CPU % |
---|---|---|---|---|
Max Throughput-100%Read | 9446.244212 | 295.195132 | 6.229075 | 2.668895 |
RealLife-60%Rand-65%Read | 10015.22017 | 78.243908 | 5.636935 | 2.71507 |
Max Throughput-50%Read | 4008.127503 | 125.253984 | 14.681326 | 1.695191 |
Random-8k-70%Read | 10665.43192 | 83.323687 | 5.331753 | 2.764539 |
Hi,
The test was done on a single ESXi 5.1u1 Host with 2 physical 10GB nics connected in a Dell M1000 chassi with two 10GB Dell PowerConnect M8024 switches. The PC M8024 switches is conected two Dell Force10 10GB switches and at last the Equallogic PS6010 Storage array. This is isolated storage network.
SERVER TYPE: Dell M710
CPU TYPE / NUMBER: Intel Xeon X5680 Processor (3.33Ghz, 6C, 12M Cache, 6.40GT/s QPI, 130W TDP, Turbo, HT) / 1
HOST TYPE: ESXi 5.1u1 (Dell MEM Installed) - VMware I/O Analyzer 1.5.1 (Disk2=8GB)
STORAGE TYPE / DISK NUMBER / RAID LEVEL / CONNECTIVITY: Dell Equallogic PS6010 (FW 6.0.2) / 16 SAS 600GB 15KRpm / RAID 10 / 2 iSCSI path
TEST NAME | LATENCY/rd | IOPS/cmds | MBPS/reads | VM CPU LOAD |
---|---|---|---|---|
Max Throughput 512k 100% Read | 20.11 | 1583.53 | 782.46 | ~58 |
SQL Server 64k 100% Rand - 66% Read | 10.05 | 2224.87 | 90.92 | ~23 |
OLTP 8K 70% Read - 100% Rand | 5.72 | 3685.85 | 20.6 | ~26 |
MAX IOPS 0.5k 100% Read | 0.75 | 36949.00 | 18.15 | ~100 |
I had a hard time to understand the numbers (see screenshot below), is it good or do we need to ses this over?
Lab setup to evaluate what the Intel S3700 SSDs are capable of. Throughput was bottle-necked by the P410 controller. Tests performed in steady state.
2x 800GB Intel S3700 SSDs / Raid 0 / 2x X5560 @ 2.80 Ghz / HP P410 Controller w/ 512MB Cache
Test | Latency | Avg Iops | Avg MBps | CPU Load |
---|---|---|---|---|
Max Throughput-100% Read | 4.48 | 13667 | 427 | 4.04% |
RealLife-60%Rand-65% Read | 3.21 | 17614 | 138 | 4.95% |
Max Throughput-50% Read | 1.33 | 41920 | 1310 | 6.33% |
Random-8k-70% Read | 3.34 | 17102 | 134 | 10.56% |
Retested without the bottleneck
2x 800GB Intel S3700 SSDs / Raid 0 / 2x E5-2690 @ 2.90 Ghz / HP P420 Controller w/ 2GB Cache / 76GB iobw.tst
Test | Latency | Avg Iops | Avg MBps | CPU Load |
---|---|---|---|---|
Max Throughput-100% Read | 0.10 | 112799 | 3525 | 6.39% |
RealLife-60%Rand-65% Read | 0.96 | 53584 | 418 | 5.71% |
Max Throughput-50% Read | 0.33 | 118880 | 3715 | 7.18% |
Random-8k-70% Read | 1.02 | 50190 | 392 | 4.88% |
Hey guys, I know this thread isn't the most active place on the internets but I'm hoping maybe a storage networking guru can help me out. I'm troubleshooting poor performance on our iSCSI SAN and seeing some interesting IOmeter results:
Array: FreeNAS 27 SATA disk array 2x 1Gb links
Access Specification Name | IOps | MBps | Latency(ms) |
Max Throughput-100%Read | 3563.99 | 111.37 | 16.80 |
RealLife-60%Rand-65%Read | 984.89 | 7.69 | 60.35 |
Max Throughput-50%Read | 5800.14 | 181.25 | 10.11 |
Random-8k-70%Read | 1692.67 | 13.22 | 35.09 |
Now, looking past these fairly mediocre results - one thing that's come up consistently is that the 100% random 8K 70% Read tests is consistently faster than the RealLife test, sometimes up to 2x as "fast" and considerably less latency (although both are bad, I know). In my mind I'm thinking that the 100% randomness should result in slower performance... I'm wondering if this is a symptom of some sort of misconfig on our networking gear. I've isolated it to our iSCSI "core", which is a stack of four PowerConnect 8024s. If i isolated this array to a single switch in the stack and run the tests directly from my laptop these are the results (I think I'm hitting some bottlenecks related to the laptop NIC):
Access Specification Name | IOps | MBps | Latency(ms) |
Max Throughput-100%Read | 1798.61 | 56.21 | 25.48 |
RealLife-60%Rand-65%Read | 5447.46 | 42.56 | 6.20 |
Max Throughput-50%Read | 1757.25 | 54.91 | 21.81 |
Random-8k-70%Read | 5245.12 | 40.98 | 6.10 |
Anyway, I've reviewed the 8024s config and they're to Dell's recommended best practices (jumbos, flow control, unicast storm control disabled). Nothing looks obviously wrong in the stack, CPU/mem utilization is fine, no stackport errors, etc etc.
Someone want to throw an idea my way? I'm out of ideas. Thanks!
Sono fuori ufficio.
Per urgenze contattare l'assistenza tecnica allo 045 8738738 o inviare una mail a supporto.tecnico@rtc-spa.it
Francesco Bonetti
RTC SpA
If removing one of the switches from the configuration is resulting in that significant of a performance increase, then I suspect there may be something up with the stacking. How are the switches stacked? Can you take a look at the status on the ports for collisions or dropped packets? May be worth taking a look at an EqualLogic on PowerConnect configuration guide again and stripping away the configuration details as they may be highly relevant to your situation. Also, what does the path selection policy look like in VMware?
Thanks for the reply!
I didn't actually remove the switch from the stack for my 2nd test - I carved out a new interface & iscsi target on the array and connected it via a single link to to one of the member switches. Then I literally plugged my laptop into another port on the same switch and ran IOmeter from that. Switches are stacked using their 10Gb interfaces via DAC cables.
Path selection is round robin, iscsi bound vmks etc etc. I have another test I ran to this same array via a host that again, lives on the same switch as the array. The results were similar (better actually, as my ESXi guests are using multipathing and whatnot). It seems to be only traffic that traverses the core stack that get screwed up, yet the stack ports themselves show no errors / drops.
I wonder if maybe we've hit a firmware bug of some sort. We're running a fairly old firmware revision, in fact the one that first supported ethernet port stacking, 4.2.2.3. Problem is the upgrade process would be site-wide downtime - so it'd be nice to know it's not a config issue of some sort.
Compellent SC8000 - The SMALLEST config around!
So it was a good week so far. We decided to buy some new, dedicated storage for VMware Replication, replicating our production environment to our DR location. We spoke with a large number of vendors to find something we were really happy with and Dell brought Compellent to the table as there was a promo on for 2x SC8000 controllers and 1x SC200 disk enclosure with 12x 600GB 15k drives for EXTREMELY reasonable pricing. We loved how it optimized data placement and how the controllers were isolated from the drives themselves and my boss was quite fond of the fact that adding drives in small increments and re-striping entire tiers was quite easy. Another local company with whom we've dealt with before has 400TB on Compellent and provided a glowing review for us. So we went ahead and ordered the bundle, along with an additional shelf with 8x 4TB 7k drives for "cold" data. This was mostly to get a feel for the product, and see if this was going to be a good fit on the production side where we would be requiring much more performance.
I basically followed the guide and got it up and running no problem. Our config is 1GbE iSCSI, with a single dual-port HBA per controller for the time being. Our controllers are also the 16GB variety, but Dell spiffed us a pair of 64GB cache upgrade kits. Still waiting for those to come in. That means this is basically the lowest performance you could expect out of a Compellent Storage Center. The storage profile for this test was the default profile, and the data was spread across 11 drives (the twelfth drive is a hot spare). It was connected through an HP ProCurve 5412XL switch with two dedicated VLANs for iSCSI, flow control enabled but no jumbo frames.
Needless to say, I am quite impressed with how many IOPS it can squeeze from 11 measly spindles (389 IOPS/spindle at best), and how consistent the results are. It really wants to deliver 4000 IOPS no matter what kind of workload you throw at it - good enough for me.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
TABLE OF RESULTS
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
SERVER TYPE: Windows 2008 R2, 1 vCPU, 4GB RAM, 40GB hard disk
CPU TYPE / NUMBER: Intel E5-2660, single vCPU
HOST TYPE: Dell PowerEdge R720, 256GB RAM; 2x E5-2660, 2.2 GHz
STORAGE TYPE / DISK NUMBER / RAID LEVEL: Compellent SC8000, 11 data disks in Tier 1 (RAID10), 600GB 15k
##################################################################################
TEST NAME--
Resp. Time ms Avg IO/sec MB/sec
##################################################################################
Max Throughput-100%Read........____15.58___..........____3844.90__.........____120.15____
RealLife-60%Rand-65%Read......_____11.34_.........._____3993.28__.........____31.20____
Max Throughput-50%Read.........._____12.73___.........._____3784.41__.........____118.26____
Random-8k-70%Read................._____10.72__.........._____4283.04__.........____33.50___
Further testing continues today as I have time in between a number of other projects. One simple tweak:
esxcli storage nmp psp roundrobin deviceconfig set --type=iops --iops 1 --device naa.xxxxxx
where "device" is the 500GB Compellent volume has resulted in a reasonably dramatic performance improvement on the 100% Read tests, but nothing worth noting on the others. Going to keep playing with this to find the best performing combination for this storage array, next testing an IOPS policy of 3 then possibly enabling jumbo frames end-to-end and trying a "bytes" policy of 8800.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
TABLE OF RESULTS
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
SERVER TYPE: Windows 2008 R2, 1 vCPU, 4GB RAM, 40GB hard disk
CPU TYPE / NUMBER: Intel E5-2660, single vCPU
HOST TYPE: Dell PowerEdge R720, 256GB RAM; 2x E5-2660, 2.2 GHz
STORAGE TYPE / DISK NUMBER / RAID LEVEL: Compellent SC8000, 11 data disks in Tier 1 (RAID10), 600GB 15k (IOPS policy set to 1)
##################################################################################
TEST NAME--
Resp. Time ms Avg IO/sec MB/sec
##################################################################################
Max Throughput-100%Read........____10.07___..........____5944.75__.........____185.77____
RealLife-60%Rand-65%Read......_____11.38_.........._____3944.52__.........____30.82____
Max Throughput-50%Read.........._____13.01___.........._____2893.33__.........____90.42____
Random-8k-70%Read................._____10.61__.........._____4327.21__.........____33.81___
If the storage is iSCSI the recommendation that I have seen is to not use IOPS as the path limit control, but to definitely use JUMBO frames if possible and set the BYTE limit to the maximum payload packet size of jumbo minus the header overhead. I am not sure what that BYTE setting is but am sure it is elsewhere in this very long thread.
Yes, it should be set to 8800 bytes. I have it set now and the benchmark is cooking away.