Storage Queues and Performance

Version 11

    Introduction

    VMware recently published a paper titled Scalable Storage Performance that delivered a wealth of information on storage with respect to the  ESX Server architecture.  This paper contains details about the storage  queues that are a mystery to many of VMware's customers and partners.   I  wanted to start a wiki article on some aspects of this paper that may  be interesting to storage enthusiasts and performance freaks.

    Two Important Queues

    Let's use the following figure as a starting point for this discussion.



    For the purposes of this paper, I'm going to call the two different  queue types the "kernel queue" and the "device driver queue".  The  device driver queue is specified in the device itself and has  historically been configured through Linux-like module commands in the  console operating system.  More on that in "Changing Queue Depth" below.   The kernel queue should be thought of as infinitely long, for all  practical purposes.  Any time the device driver queue gets full,  commands to the storage will queue up in the kernel.

    Note that each LUN gets its own queue.  This means that when you change  the queue depth in the device driver, you're changing the queue depths  for many queues.  The underlying device (HBA) is going to have a hard  limit on the number of active commands it will allow at one time.  This  should be considered when setting queue depth.  If your HBA can support  only 2,000 active commands but it is addressing 40 LUNs, a specified  queue depth of 64 won't allow that many commands to all LUNs.  This  being due to the fact that 64*40 = 2,560--which is more than the 2,000  maximum commands.  In practice this is rarely a concern, though, as  rarely are so many LUNs being simultaneously addressed through a single  HBA and so many outstanding commands being issued to these LUNs.

    Device Driver Queue Function

    The device driver queue is used for a low-level interaction with the  storage device.  It controls how many active, or "in flight", commands  there can be at any one time.  This is effectively the concurrency of  the storage stack.  Set the device queue to 1 and each storage command  becomes sequential: each one must complete before the next starts.

    But if the device queue is left at its default of 32, as an example, 32  commands will be concurrently processed by the storage system.  All 32  will be shipped off to the storage device by the kernel and new commands  are queued when completions arrive.

    Kernel Queue Function

    The kernel queue can be thought of as kind of an overflow queue for the  device driver queues.  But it's not just an overflow queue.  ESX Server  contains all kinds of cool optimizations to get the most out of your  storage. And these features apply to commands in the kernel queue only.  Here are some examples of features provided to commands queued at the  kernel queues:

    1. Multi-pathing for failover and load balancing.
    2. Prioritization of storage activities based on VM and cluster shares.
    3. Optimizations to improve efficiency for long sequential operations.

    There are others, as well.

    Impacts of Queue Depths

    So, increasing queue depths in the device driver can greatly improve the  performance of the storage at the device level. Decreasing the device  driver queue will result in increases in usage of the kernel queues.   This decreases the device efficiency, but introduces opportunities for  optimizations across multiple VMs and devices.  So, what's the right  ratio of these two depths?  We think that the sweet spot lies with a  depth 32 device driver queue.  That's why we've set 32 as the default  device driver queue length.

    But your configuration and workloads may benefit from a change to this  default queue depth.  I'll refer you to the aforementioned storage paper for information on when you might want to change the driver queue  depth.  I'll just point out a couple of broad observations here:

     

    • With fewer, very high IO VMs on a host, larger queues at the device driver will improve performance.
    • As the VM count grows and storage performance features--like shares,  load balancing, failover, etc.--become more important, the default  queue depth is best.
    • With too many servers each having too large of device queues, your  storage array could easily be overloaded and see its performance suffer.

    Improving Storage Performance

    Now that we've covered how storage queuing works, you may be wondering  how you can monkey around with these queue sizes for optimal  performance.  I can tell you as someone that has been involved with  many, many performance analysis projects that changing queue size is  rarely a fix to an acute storage performance problem.  You should first  go through the analysis techniques in Storage Performance Analysis and Monitoring.  That may or may not lead to changing queue depths.

    But, in the event that you do end up changing queue depths...

    Changing Queue Depth

    We have a helpful knowledge base article that describes the process of changing the device driver queue on ESX.   For ESXi you will need to modify the queue using the vMA.  First find  the HBA module name (as the first command does below) then change the  depth of the queue against the matching module name using the second  command:

    esxcfg-module -list | grep qla
    vicfg-module -s ql2xmaxqdepth=64 <module_name>