VMware Cloud Community
birnenschnitzel
Contributor
Contributor

Infiniband SRP Latency

Hi,

a few days ago we installed 10Gb Infiniband cards in our Linux storage server and one of our ESXi machines. After successful setup we where able to do some NFS over IP over IB benchmarks. These where far away from what we expected. Because of using old Infinihost cards MTU is limited to 2044 bytes and that results in maximum bandwidth of about 150-200 MB/sek when performing fully cached reads. Not much better than 100 MB/sek for 1Gb Ethernet.

Unhappy with that we went over to SRP and installed a SCST SRP target driver on the linux box and suddenly transfers raised beyond expectations. 700 MB/sec throughput is more than our harddisks can supply at the moment. Nevertheless one aspect does not scale as expected. 512b fully buffered random reads are still at the same latency as on normal 1Gb Ethernet.

Performance numbers are:

NFS 1 Gb - 512b random reads / one thread ~ 4000 I/Os per second => 0,25 ms

SRP 10Gb - 512b random reads / one thread ~ 4000 I/Os per second => 0,25 ms

Also recompiling the linux kernel with the required SCST patches did not help. It seems as if there is some kind of interrupt coalescing for infiniband cards so that small transfers do not get better. If I understand the specs right latency of our adapters should be something around 3 us. Each of the 4 serial infiniband lanes clocks at 2,5 Gb/sec. So I would at least expect to see small I/Os doubled in comparison to Ethernet.

Does anybody know if there is some switch to come around this odd behaviour?

Thanks in advance.

Reply
0 Kudos
3 Replies
BustedTyre
Contributor
Contributor

What was the benchmark you expected? In other words, assuming that storage was local.

4K random IOPS isn't bad at all if it's an HDD array. Best disks provide ~170 random IOPS per drive and that doesn't scale up linearly by adding disks to the array.

If the array's SSD-based it's an entirely different story of course. There you'd usually have the SAS controller as the main bottleneck.

Also, depending on whether you benchmark a bare server or a VM you would notice a difference at these speeds. The limit if you used VMware storage stack is about 30K random IOPS for many threads combined, either read or write, and, suprpisingly regardless if the disks were RDM or VMFS. If your SRP/iSER initiators are in VM-s that is reportely a lot faster, but we're yet to test it.

I wonder how it fared in the end. We've just finished testing a monster DAS configuration and are about to netowk it (iSER or SRP over QDR IB.)

Reply
0 Kudos
mcowger
Immortal
Immortal

The card -> card latency of 3us sounds about right for IB, however, theres more to it than that. 0.25ms is 250 microsends, so besides the raw roundtrip time (around 6-10 usec on your switches, you've got 245microseconds to explain. To be honest, 0.24ms is fantastic times for moving data through a storage system - well under whats normally achieved (consider that topend FC arrays costing millions filled with SSDs have trouble doing better than 2-3ms).

There are definitly some interrupts happening, and theres the time needed for the data to move from main memory, through the kernel, get wrapped in IB frames, thrown onto the wire, etc.

Honestly, I dont think anything is wrong....

--Matt

VCP, VCDX #52, Unix Geek, Storage Nerd

9773_9773.gif

--Matt VCDX #52 blog.cowger.us
Reply
0 Kudos
BustedTyre
Contributor
Contributor

Assuming it's SSD-s, and the array is not RAID6 or RAID5 and that the storage boxes are not caching the requests, the best latency you get off a single SLC SSD is in the range of 100us. RAIDing them in RAID10 or RAID0 won't improve single-thread latency but improves combined performance for multiple threads.

That extra 150us might indeed be explained by wrapping/unwrapping traffic if initiators are in VM-s behind hypervisors, or within hypervisors, which in this case we don't know.

What's more interesting though is, in DAS scenario there is no difference between RDM disks and VMFS storage made of the same SSD-s. Bare server performance on 16 Intel X25-E SSD-s in RAID10 is about 300K/80K read/write IOPS random on 4K blocks, while it's not more than 30K/30K for the VM-s running in VMware hypervisor on the same server off the same array - combined for multiple threads running off a few VM-s.

Reply
0 Kudos