DAVG values over 200 ms on EMC 960 array SAN

honda2good · ‎09-10-2011

Hi,

I currently have a rather nasty performance issue on my EMC based SAN. Here the configuration

23 ESX 4.1 servers running 4.1

Using Qlogic 2462 HBA at 4 gb

Using four EMC CX4/960 array running Flare code 30

All ESX clusters can access all four EMC arrays

We have four ESX clusters

Using MRU as our pathing policy

Using Qlogic SAN switches at 4 gb

When we run ESX top on any of the ESX servers, which are DL 380 G7 server, we see a DAVG time of between 120 ms to 200 ms. Vmware support says it should be 10 ms. Vmware support has also deemed this to be a EMC storage issue and is not providing much help to rectify the problem.

1.) I think we just have too many hbas\servers accessing the arrays --- Do other argee?

2.) It was suggested that we could reduce disk latency by implementing adaptive queue depth settings on each ESX server -- Do others agree?

3.) Have others encountered this issue with ESX clusters essentially overloaded the attached disk arrays?

Thanks for your help,

f10 · ‎09-10-2011

Hi,

DAVG vaue of 200+ is definately not good and before you make any changes its very important to understand what these values indicate. DAVG refers to the latency on the storage array, so you would have to fix this first. Check the Storage Processors, read/write cache, any failed disk in the raid group, errors on the switch etc.

I dont think that its an issue with the number of paths because since you are using MRU I woud presume that its an Active/Passive array so I/O wont be active on all the paths. The default queue length is 32 and before you modify this value its important that you study effects by running iometer tests etc.

Happy truoubleshooting and weekend

Regards,
Arun

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful". Regards, Arun VCP3/4, HPCP, HP UX CSA http://kb.vmware.com/

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

mcowger · ‎09-10-2011

How many VMs are running - thats really the driving factor. How many IOPs? How many disks behind the raid groups?

Also, if you are runnign FLARE 30 you can probably do Failover Mode 4 and Active/Active pathing, which can help this as well.

--Matt VCDX #52 blog.cowger.us

broylesd27 · ‎09-10-2011

I'd agree with f10 here. Jump over to the array immediately and look for things to be concerned about. Start with the storage processors and look for clues. High utilization on one, but not the other. (Trespassed LUNs or imbalanced LUNs) High percent dirty pages that never drops. (Improper watermarks. Improper read/write cache ratio) Verify write cache is on at all. If the performance is only being experienced with certain datastore look at the statistics for the specific LUNs associated. 10ms is nice, but from personal experience anything getting put to work will be around 20-30ms.

Troubleshooting DAVG from the host side will usually be futile. Plus if all your hosts are having troubles then start with the most common components. Also with flare 30 you might consider using ALUA for your PSP.