High Disk Queue Length on SQL2005

BecaVince · ‎09-09-2008

We have a Virtual SQL2005 server on Windows 2003 which is getting very high Average Disk Queue Lengths and the database drive. It can spike up to 450, and sit around 50-100 for several minutes. Just wondering if anyone has come across this and has any ideas.

Host Machine: HP Proliant BL480c G1, 8CPU E5320 @ 1.86Ghz, 32Gb RAM

Virtual Machine: 2003 Standard 32bit, 2vCPU, 3.5Gb RAM

Database drive is a SAN, fibre attached LUN, 3 disk RAID 5 50Gb. Separate physical LOG and OS drives which perform normally

Database is 8Gb and no other databases or logs reside on this drive. has a peak throughput of 10MB/s.

CPU and Memory usage low.

VMWare Tools is installed.

Any suggestions appreciated

RParker · ‎09-09-2008

First of all, you should look at the disk

latency before wasting any time on average disk queue length in a SAN

environment. But it really does depend on what you are trying to figure

out. You'll see why in a moment.

The
explanation of this counter (from Perfmon) is "Avg. Disk Queue Length
is the average number of both read and write requests that were queued
for the selected disk during the sample interval." This is either a
physical disk or logical disk counter, so the number that you get
depends on how the underlying storage is presented to the operating
system.

f the Avg. Disk Queue Length says 10,
then the OS queued 10 I/Os to the disk array. Theoretically, that could
be one I/O to each of the 10 mirrored sets in the stripe, or it could
have been 10 I/Os all to the one disk. You have no way of knowing which
it was.

This is where the
missing information comes in. Let's assume your stripe size is 64KB,
your write size is 8KB, and you're going to do a whole chunk of
sequential writes. This is a typical scenario for SQL Server storage
activity. In this case, there is a good chance that eight of the I/Os
all went to the first disk, and the next two I/Os went to the next
disk. So, if you're trying to work out the disk queue length per disk
for this scenario, it is 8 for the first disk, 2 for the second disk,
and 0 for the remaining eight disks in the array.

Now
let's change the scenario to a theoretical stripe size of 8KB and a
write block size of 64KB-and keep the disk queue length at 10. In this
case, each 64KB block gets spread over 8 disks, so one I/O is written
to 8 disks and the 10 queued I/Os are spread across 80 disk writes over
all 10 disks in the array. If you're trying to calculate the disk queue
length per disk in the array, it's going to be 8 for each disk in the
array.

Let's be realistic
and add another level of uncertainty into the picture. In most cases,
your SAN storage is going to be connected to the server using one or
more HBAs in the server, a number of fibers to connect the HBA to the
SAN, a number of ports on the front end of the SAN, and perhaps some
kind of fiber switch in the fabric between the server and the SAN. Then
we get into the internal architecture of the buses inside the SAN, and
how the disks are connected to the ports on the front of the SAN.

Any
queuing you are seeing reported in Perfmon can be a symptom of high
latency or queuing at any of these points between where the OS measures
its disk queue length and the surface of the disks. This is why you
should look at the latency and base any decisions on that counter
rather than the average disk queue length.

http://technet.microsoft.com/en-us/magazine/cc162466.aspx

RParker · ‎09-09-2008

A statement was published many years ago that sustained disk queue length greater than 2 is an indication of an I/O bottleneck. This statement is still true if the application is not designed to handle the situation. SQL Server is designed to push disk queue lengths above 2 when it is appropriate.

SQL Server uses async I/O to help maximize resource usage. SQL Server understands that it can hand off an I/O request to the I/O subsystem and continue with other activity. Let’s look an example of this.

SQL Server checkpoint posts up to 100 I/O requests and monitors the I/O response time in order to properly throttle checkpoint impact. When the I/O response time exceeds the target the number of I/Os is throttled. The disk queue length can easily exceed 2 and not be an indication of a subsystem problem. SQL Server is attempting to maximize the I/O channel.

SQL Server does read ahead or tempdb spooling actions. The workers seldom directly wait for the I/O to complete. Instead the workers continue on with other actives taking advantage of CPU processing power instead of waiting for the I/O response. You would not want SQL Server to read a page, add it to a sort, and write the sort page during a large sort. Instead SQL Server will post the sort I/O in motion and return to the next read page and continue sorting. The sort operations track the outstanding I/Os and check completion status at later intervals.

The next time someone looks at the disk counters on the system and states the disk queue length is above 2 and this is a problem take a minute to look at other data points. The key for SQL Server is the I/O response times. Start by looking at the Average Disk Seconds Per Transfer for the same I/O path. Then look at the SQL Server sysprocesses and waitstats for buffer based I/O wait information. You want to see evidence from SQL Server that the disk queue length is related to an I/O bottleneck. Don’t just take the depth of 2 at face value.

BecaVince · ‎09-09-2008

Many thanks for your reply, some good reading there and somewhat away from my core skillset.

The Latency is also high 400ms Average Disk sec/Transfer during heavy times, I was reading it should be below 50ms.

Stripe size is 128KB the default.

Maybe this vmware forum is now not the best place for this.

BecaVince · ‎09-11-2008

I changed the array from a 3 disk RAID5 to a 6 disk RAID1 it the latency and queue length became normal again.

I've read conflicting opinions on RAID configurations for SQL DBs, but this resolved the issue.

Thanks for your help RParker.

ChrisDearden · ‎09-17-2008

I've recently come accross this issue in the physical world - depending on the type of database you are running , best practices for SQL are to have RAID 1 storage for Log files , tempdb and for sequential write heavy (OLTP type) environments. If you are running a more query based workload (OLAP) then RAID 5 seems to be the better option ( also the Raid 5 Disk group on our EVA's is much larger so actually runs faster than the Raid 1 group )

If this post has been useful , please consider awarding points. @chrisdearden http://jfvi.co.uk http://vsoup.net

All

High Disk Queue Length on SQL2005