Help! Good disk IOs but latency is high?

Hello, we are running 4 hosts using 3.5 Update 4 and 5. The back end is an EMC Celerra NS20 utilizing iSCSI and we have around 35 VMs. Overall the backed end and esx hosts seem to perform well, but latency seems to be a little high at times, or so I am told. I found this out via 2 problems.

1) We have been experimenting with Pano devices and Virtualized desktops. The Pano's would loose connection (although the VM would stay up) and we are being told by the local rep our disk latency is a little high. When the disconnect occurs the DAVG/cmd counter on the LUN approaches 20 or 30. But these are just spikes, so it doesnt sound that bad to me?

2) We have a virtualized SQL 2005 server running Windows Server 2003 32-bit. The programmers complain about the performance of the box saying it takes up to 3 hours to complete an overnight report sometimes. During the day when I view the DAVG/cmd counter I see spikes as high as 100, but again these are just spikes, so thy dont sound that bad. Throughput on the iSCSI nic shows up to 40 MB/s. That doesnt sound bad.

We are looking at replacing the Celerra next year and have had peformance evaluations done by EMC and Equalogic. The peformance evals have been benefical becuase they told us the total IO and throughput hitting the Celerra, thus the number of spindles need to facilatle our total IOs on the backened (Around 1600). What this doesnt show is the latency, EMC told me the cost effectiveness of iSCSI thoughput tops out at 60MB/s. The SQL server I referenced above is pushing 36 MB/s by itself according to the analysis.

Are we reaching the thresholds of iSCSI? I found a VMWare knowledgebase article on troubleshooting storage issues, but it says if the DAVG counter is over 5000 you will see iSCSI timeouts in the vmkernel log. We are not seeing that.

Are we being mislead and where do I go from here? I have 24 hour ESXTOP and perfmon statics which show spikes and what VM's where the busiest, so I know I can find out what process (backups, sql jobs) where running at that time, but that doesnt help me decrease the latency or increase throughput. I want to try and fix this? Suggestions?


