Hello,
I'm not sure how many problems are lurking in my setup but I'll try to explain what I am experiencing.
I have a few ESX hosts with several VMs on them. (I'll update this post soon with all the hardware and VM specs)
After running for days, one of the VMs all of a sudden experienced very high Avg Disk Read QL, in the thousands (~1000-4000) continuously. Total %PT is low and memory is fine. That VM was killing my application which runs across several VMs; basically others VMs access the poorly performing VM so they end up spinning waiting for data that's taking forever to be read and transfered. This is basically the problem in a nutshell.
There are two disks on the ESX host; the poor VM runs on one of them, 3 other VMs run on the second disk. The 3 VMs are fine. So I thought initially that it's a disk read head problem. I migrated the VM to another ESX Host, but the problem persisted. I then noticed that the VM was first created with Thin storage provisionning (256GB), so I migrated it once again to another ESX Host with Thick provisionning this time (256GB), but the problem still persisted.
I run windows disk Fragmentation analysis within the VM's OS and it showed 5% fragmentation.
I'm really not sure what else to look at. Please ask me questions regarding setup and I'll get the answers, I totally understand that you need to get the full picture.
Thanks in advance.
What storage are you running on? FC, iSCSI, NFS, what vendor and model?, what version of ESX? what OS is the underperforming guest running, what applications?
these are stater questions
Have you looked at the Queue Length value before you noticed that VM performed badly? That is, do you know if this high value is new or is it possible that it might have been like this before?
The value does seem very high, even if the Disk Queue Length is not the most important value to look at, especially not when having multiple disks in some kind of RAID or when using virtual disk files.
Often it is better to study the latency for the IOs and see the time it actually takes for a read or write to complete. Check Avg. Disk sec / Transfer for the average and then Avg. Disk sec / Read + Avg. Disk sec / Write. You could also check the throughput you get and the average IO size. See this for the specific counters for this. Make sure you also select to see the values over some time, for example over four hours, to see how the system perform during work. Are there any peaks or is the values similar?
I know this is a really old thread. But I wanted to post a possible fix just in case someone is having the same issue. We ran into this same problem after we did a P to V migration on our ERP system. Right after the migration, we started running into application performance issues. When we checked the avg disk queue length, it was abnormally high. I'm guessing if we had also checked read and write times those would have been high too . In this case, the VM needed more RAM in the virtual environment than it did in the physical environment. As soon as we allocated the additional RAM to the VM, our disk queue length dropped back down to less than one and the application performance issues went away. Our ERP system uses a Unidata database for the back-end. So mileage may vary on this solution for other implementations. Just wanted to throw it out there as a possible solution.