vStorage APIs for Array Integration (VAAI)

Version 6


    From vSphere 4.1 a new set of vStorage API has been added: vStorage APIs for Array Integration (VAAI).

    For others vStorage API see this diagram:


    The VAAI can offload specific storage operations to compliant storage hardware, which results in less CPU, memory and storage fabric bandwidth consumption.

    From the previous diagram, is possible to notice that the API are on the storage side, so you must have storage compliant with them (or a new storage firmware compliant).

    This API introduce different primitives (as well document in the Mike Laverick article on Techtarget):

    Actually only the first 3 primitives has been implemented. Note that Write Zero and Full Copy API can bring performance improvements, but the Locking API are designed to increase scalability.


    See also:

    Overview - http://www.vmware.com/products/vstorage-apis-for-array-integration/overview.html

    How vStorage APIs for Array Integration change the way storage is handled - http://blogs.vmware.com/kb/2010/11/how-vstorage-apis-for-array-integration-change-the-way-storage-is-handled.html

    Great post - http://virtualgeek.typepad.com/virtual_geek/2010/07/vsphere-41---what-do-the-vstorage-apis-for-array-integration-mean-to-you.html

    VMware VAAI pros and cons and the hidden fourth primitive - http://searchvmware.techtarget.com/tip/0,289483,sid179_gci1516821,00.html

    VMware vSphere 4.1 vStorage APIs for Array Integration (VAAI) understanding - http://geeksilver.wordpress.com/2010/08/02/vmware-vsphere-4-1-vstorage-apis-for-array-integration-vaai-understanding/




    Enable VAAI

    VAAI require vSphere 4.1 (Enterprise or Enterprise Plus: http://www.vmware.com/products/vsphere/buy/editions_comparison.html) and a block storage (actually NFS is not supported) arrays that support storage-based hardware acceleration.

    For the storage compatibility list is possible to use:


    If both requirements are compliant, than VAAI is enabled by default and can be disabled by changing those advanced settings to 0:

    • DataMover/HardwareAcceleratedMove - For Full Copy API

    • DataMover/HardwareAcceleratedInit - For Write Zero API

    • VMFS3/HardwareAcceleratedLocking - For locking API


    To see if VAAI is working is possible to use the GUI (Host > Configuration > Storage, you can see the Hardware Acceleration Status on the on the right side of the right panel. ) or use this command:

    esxcfg-scsidevs -l | egrep "Display Name:|VAAI Status:"



    The status values are Unknown, Supported, and Not Supported. The initial value is Unknown. The status changes to Supported after the host successfully performs the offload basic operations. If the offload operation fails, the status changes to Not Supported.

    This is very important, the status change after you do the first operation, not after some time!


    To determine if your storage device supports VAAI, you need to test the basic operations.

    To easily test the basic operations, use vSphere Client and browse the datastore. Copy and paste a virtual disk of at least 4 MB (that is not in use). The status of Hardware acceleration changes to supported or not supported.

    Creating a virtual machine with at least one vDisk or cloning a virtual machines also tests the basic operations.


    Is this a datastore property or is related to the host? By doing some tests, seems that you have to do at least one operation from each host, so it's host related.


    See also:

    vStorage APIs for Array Integration FAQ - http://kb.vmware.com/kb/1021976




    Full Copy improvements

    With this primitive (also called SAN Data Copy Offload) operations like VM clones, deploying from template, Storage vMotion can be improved:

    • reduce the time for many VMware storage-related tasks by 25% or more (in some cases up to 10x)

    • reduce the amount of CPU load on the ESX host and the array by 50% or more (in some cases up to 10x)

    • reduce the traffic on the storage connections by 99% for those tasks.


    Most people may consider that reduced time for each operation the most attractive aspect, but I think that the other two aspects are more interesting.

    Time reduction usually is around 25%, so not very much... your life does not change for it

    But CPU and Storage I/O reduction are really cool, you save host resource, and (IMHO) this could be more interesting on iSCSI based storage with software initiator, where also the storage I/O traffic has an impact on the CPU.


    Note that this primitive can handle also vmdk type conversion in the right way... I've tested some conversion during cloning and Storave vMotion and VAAI works fine.


    Simple test

    I've do a simple test with a old Equallogic (PS5000X) array with the new 5.0.2 firmware and, of course, with vSphere 4.1.

    I free a host (just to have a simple way to see the load difference) and make a simple clone operation of a (powered off) VM on the same datastore.

    This task takes 6' 14" with VAAI and 8' 14" without it... so a "simple" time reduction of about 24%.


    More interesting is the benefit on the CPU side (in my case I used a software iSCSI initiator solution with Jumbo Frames).

    Those are the values of esxtop during the task using VAAI:

    PCPU USED(%): 2.0 0.1 0.2 0.1 0.1 0.1 0.0 0.0 AVG: 0.3 
    PCPU UTIL(%): 2.2 0.1 0.3 0.2 0.1 0.2 0.1 0.1 AVG: 0.4


    Those without VAAI:

    PCPU USED(%):  18  15  16  16  18  17  15  16 AVG:  16 
    PCPU UTIL(%):  22  20  21  21  22  22  19  20 AVG:  21


    You can save a lot of CPU power (in my case 1:10!) and this mean more resources for VMs (or for intra-vSwitch traffic or for software iSCSI).


    But the best is on storage I/O traffic on host side. In my case I have monitored the iSCSI NICs and the iSCSI vmhba (task without VAAI is around 20:05 and with VAAI is after 20:25... do not consider the peak between 20:15 and 20:25, because is an interrupted task):

    In this case the I/O saving is quite impressive... this mean more bandwidth, but also less latency...


    Sime improvements are during Storage vMotion or copy across datastores (with same block size)... in those cases the time reduction is near 40%.




    Write Zero improvements

    This primitive (also called SAN Zero Offload) enables storage arrays to zero out a large number of blocks. For example can be used during the initialization of vmdk for cluster of VMware FT (eagerzeroedthick disk), that usually is really time consuming.

    With VAAI this can be done by the storage, and can save a lot of time, and some CPU and storage I/O resources.


    Is not clear if the conversion of a vmdk, in order to enable FT, works with this primitive... On same blogs there is written the the "eagerzeroedthick" conversion is done by Full Copy, but I think that is correct assume that the right primiteve is the Write Zero.


    Simple test

    In this case, the test was very simple. Just the creation of a new VM with a single vmdk (5 GB) enabled for FT/clustering.

    The time reduction was from 26" (without VAAI) to 8" (with VAAI).

    In this case a real and great time improvement!

    About NICs and storage I/O saving those are quite similar than the previous test. Except the write latency that, without VAAI, was very high (max 171 ms!), probably for the RAID level that has been used (RAID5); but this also confirm that this kind of task is not only time consuming but also I/O consuming!




    Locking improvements

    This primitive (also called Scalable Lock Management or Hardware Assisted Locking) is designed to improve the scalability and the locking mechanisms of VMFS. In some change can also revolutionize the LUNs design (for example, because more VMs can be put in the same LUNs). But also it depends by the type of the storage... so check how your storage can benefit from this kind of API.

    The following are examples of VMFS operations that require locking metadata:

    • Creating a VMFS datastore

    • Expanding a VMFS datastore onto additional extents

    • Powering on a virtual machine

    • Acquiring a lock on a file

    • Creating or deleting a file

    • Creating a template

    • Deploying a virtual machine from a template

    • Creating a new virtual machine

    • Migrating a virtual machine with VMotion

    • Growing a file, for example, a Snapshot file or a thin provisioned Virtual Disk

    A simple way to see this improvement could be test a simultaneously powering on many virtual machines, but due to time lack I haven't take this kind of test.




    VAAI and Equallogic

    Equallogic VAAI Demo


    How VAAI Helps Equallogic - http://www.2vcps.com/2010/10/07/how-vaai-helps-equallogic/

    Equallogic, VAAI and the Fear of Queues - http://www.2vcps.com/2010/10/29/equallogic-vaai-and-the-fear-of-queues/

    Equallogic: Firmware 5 vStorage APIs - http://marcmalotke.net/2010/07/02/equallogic-firmware-5-vstorage-apis/

    Equallogic Firmware 5.0.2, MEM, VAAI and ESX Storage Hardware Accleration - http://www.modelcar.hk/?p=2771


    Note that some people have report great improvement with they Equallogic storage, more that 25% in copy operation.

    This difference may depend by the RAID level, but also by the storage CPU and memory.




    This new feature could be very interesting to improve performance and scalability of your environment.

    IMHO the best part are the CPU and I/O reduction (that can save a lot of resources) and the locking improvement (that is not really visible, but make its job behind the scenes). Time saving, in most cases is not so much, and is just a side effect


    Does LUNs design best practices change with new VAAI? It allow you to have one huge volume vs. the standard recommendation for more smaller volumes while still maintaining the same performance? At least, on Equallogic, the answer is NO... and there is a good explanation in the notes of http://www.2vcps.com/2010/10/07/how-vaai-helps-equallogic/


    And what about snapshots? VAAI can improve also them? Seems yes, but in this case we are talking about storage snapshot... From http://blogs.vmware.com/kb/2010/11/how-vstorage-apis-for-array-integration-change-the-way-storage-is-handled.html the sentence is "VAAI uses a modified version of the SCSI EXTENDED COPY command to initiate cloning of LUNs or sub-LUNs. This offloads the task for de-duplication and with snapshot-capable storage because the hardware can start by using proprietary mechanisms to mark cloned destination extents as duplicates of source extents."

    If it is true this can really change a lot of thinks... for example VMware Composer could be replaced by this storage feature.

    But also without this storage snapshot feature, View can get some improvement from VAAI, for example during pool deployment.


    One curious question could be if I can have storage vendor tools that use VAAI without use vCenter VAAI... Sound strange... but could be a way to extend some VAAI features (like pool deployment) also to other vSphere edition (rather than Enterprise and Enterprise Plus). Theoretically could be possible... And for Equallogic I cannot say more, until the new version of ASM/VE and HIT for VMware is not released.





    Sometimes the task start with VAAI disable... I cannot reproduce this issue... but this is the reason of the "false peak" in my diagram. Maybe is only related to the continuous changes in the advanced values... But must be investigated.

    The strange thing is that the datastore status say enable and also this command confirm that VAAI is loaded:

    esxcli corestorage plugin list --plugin-class=Filter


    A great problem is instead related to the storage load... If you offload some I/O operation from hosts to the storage there is (or there could be) no way for hosts and vCenter to use features like Storage I/O Control (SIOC). The I/O control must be handle in the right way by the storage (or must be notified from the storage to the vCenter Server).

    I've done a simple test... Initialization of a 40 GB "FT" vmdk (with VAAI) and during this task a simple dd inside a Linux guest VM (in the same host and datastore) to do a 1 GB write operation. The result was really not good... I've got a 69" (yes seconds!) latency in write operation!

    This mean that VAAI must be improved on storage side to avoid that I put hidden load on it.

    Note that I've got this problem only with the Write Zero primitive... with the Full Copy (by repeating the test during a cloning task) the dd result was quite the same (little worst) than without the running task.


    Another great problem that I have to investigate is VAAI + FT enable... I've got a task running for 3 hours at 41% (Configuring primary VM for Fault Tolerance: scrubbing the VM's disk to make it thick-eagerzeroed and setting other VM configurations) ... The only solution was restart vCenter Server to remove this task... This also has to been investigated.

    This could be a serious issue in Write Zero implementation (probably on Equallogic firmware). Or simple a problem with my enviroment (I have several other Equallogic plugin enabled, some also in beta).

    Must important thing is that this problem does not exist if the vmdk is already in the the right format. In this case FT start normally.


    Another improvement could be put the VAAI status at vCenter side, not at host side... So it can become a datastore property globally for all hosts.

    And also why not implement a simple test (and maybe automatic) procedure to check the status (instead way the first operation)? But probably, for new datastores this is not a problem (I've not tested, but probably the datastore initialization process could be enough to check the status).


    The block size problem

    As documented in the "vStorage APIs for Array Integration FAQ" (KB 1021976) one case where hardware offload is not used is when

    The source and destination VMFS volumes have different block sizes

    And this is true, a clone or a Storage vMotion across datastores with different block size will work in the "old" way without the Full Copy primitive.

    So the idea could be standardize a unique block size for all the datastores. But which size? Some notes are also available on VMFS block size document.

    Usually performance does not change with different block size and also space is not wasted too much with large block size. So seems that max size (8 MB) could be the best soluzion?!

    But Full Copy primitive depends by block size? Could be strange, but seems that performance improvements depends by the block size.

    I've test the same copy operation (with VAAI) with block size of 1MB and 2MB and seems that on my storage with smaller blocks the overall time is reduce of 37%.

    Could be depends by the block size (or the chunk size) on the storage side, or more by the storage CPU and cache?

    Actually I cannot make similar test on a different model (maybe with more storage cache), so could not be a objective value.