High latency on LUNs

peter76 · ‎06-20-2012

Hi experts,

I have a problem with latencys where I need some help.

My setup:

I have a blade chassis with several blade servers from Fujitsu. The chassis contains two 10GbE switches with multiple external ports.

For the ESXi 5 hosts this means, each server have 2 10GbE nics.

The blade switches have each one 10Gbit uplink to our LAN (each to a different Cisco switch, stack configuration).

As shared storage we have a EMC CX4 120 (latest flare code). Each storage processor is connected via iSCSI to each of the blade switches. Each storage processor has two ports. Means that each SP is connected with port A to blade switch 1 and with port B to blade switch 2. So overall it´s an full redundant setup and everything is working fine so far.

I think I can describe my problem best with our backup szenario (using Veeam B&R 6).

Veeam is running as a VM and has some dedicated Raid Groups (SATA Storage) as backup repository on the EMC system.

Now when I run a job, which backups a VM located on "LUN-A" (FC 15k Storage) to "LUN-C" (SATA Storage - Backup repository) I see high write latency on LUN-C. Of course thats normal because the slow SATA storage is the bottleneck in this setup, so it´s write performance is used nearly 100% all the time (which is not a problem for us because all the jobs are still fast enough for our environment).

No what I don´t really get is why I also get high write latencys on all my other (around 20) LUNs connected to the ESXi hosts. The latency is not as high as the backup LUN with ~100ms avg, but also increase to ~30-40ms avg (with much higher peaks). This LUNs are physical totally seperated (different raidgroup, different spindels).

By fact I am just getting a wirte performance of 40MB/s on my slow SATA Raid6, I don´t understand why the whole environment is affected. Any ideas where I can start troubleshooting?

I already checked the load of the storage processors, which are far away from being overloaded.

Same szenario I can see when I clone a VM for example. Here I also see high latency on all my luns and not just on the LUNs which are involved for the clone process (source & target).

Thanks for every tip!

Regards,

Peter

peter76 · ‎06-20-2012

I forgot to mention that normally i have a avg. latency of ~5-6ms.

kastlr · ‎06-20-2012

Hi Peter,

welcome to the community.

There's no eay answer on your question.

But I assume the following.

Based on your description you're copying a large amount of data to the host and back to the array.

This might end into a scenario where the array cache must destage data more or less immediatly, so you're back on pure disk performance.

I'm not familiar with Veeam as a Backup Tool, but usually backup software uses large IO blocksizes.

This will increase troughput on costs of response times which is fine for backups.

Handling large IO's could also stress the storage processors more than usual, simply because they need to "split" the IO to their internal stripe/cache size.

So I would bet that your activities does stress some kind of shared resources on the array which will end in the effects you recognized.

Kind regards,

Ralf

Hope this helps a bit.
Greetings from Germany. (CEST)

mcowger · ‎06-20-2012

kastlr is exactly correct.

Just because the disk groups aren't shared, doesn't mean there aren't other shared resources on your array.

When doing the heavy duty write, you will be stressing the ability of the backend busses to destage that data (a resource shared between all disks), stressing the write cache (because its probably full from ingesting the writes, and is a shared resource), stressing the CPUs (a shared resource, and not that fast on a 120) and stressing the frontend port queues (from handling all the large block writes).

--Matt VCDX #52 blog.cowger.us

All

High latency on LUNs