1 Reply Latest reply on Aug 15, 2017 2:50 AM by operando

    Poor performance w/ SATA disks & LSI SAS HBA (SOLVED - Finally!)

    GoC_Dave Lurker

      I just solved a long-standing storage performance issue when using cheap consumer SATA disks for ESXi 5.5.0 datastores through an LSI 9201-16i SAS HBA. Hopefully this helps somebody else.




      • Sudden, extreme disk latency during I/O heavy operations, like:
        • Copying a large file in a VM with a freshly created thick-lazy_zero VMDK
        • Creating thick-eager_zero VMDKs
        • Creating storage pools in Server 2012's "Storage Spaces" feature
        • SMB shares under heavy write load would "disappear"
        • Windows resource monitor reporting 100% Disk Active Time but zero MB/sec
        • Using SSH/SCP to copy files to datastores


      • Disk I/O errors and degraded performance messages in /var/log/vmkernel.log
      • Disks will "disappear" completely from ESXi during high I/O, then eventually re-appear when the I/O stops
      • Only occurs with cheap SATA spinning disks (not SSDs or enterprise SAS)
      • Same disks work fine while connected to onboard AHCI (ex. Intel ICH) SATA, but choke when connected via the LSI HBA.
      • Controller and disks work fine when used with non-ESX (ex. Windows Server) on the bare metal.


      Finally after a lot of pain I discovered how to fix it. As with so many things in IT, when you find the root cause it's very satisfying.


      Root Cause:


      • The VAAI (vStorage APIs for Array Integration) storage acceleration feature in ESXi uses a special SCSI command 0x93 WRITE_SAME.
      • Cheap SATA disks often do not support WRITE_SAME.
      • When the 0x93 WRITE_SAME command hits the SATA disk, it hiccups, flushes its buffer, and causes a huge latency.
      • For whatever reason, SAS HBAs pass the 0x93s through to the disks (which start choking them) but AHCI SATA controllers do not. (Not sure why)
      • If the 0x93s come fast & heavy, the disk will "disappear" momentarily from ESXi, and eventually be discovered again when the 0x93s stop.
      • I suspect if the disks were connected with hardware RAID, the controller might think they're "failed" and start populating a hot spare.




      Disable VAAI on the ESXi host - Configuration -> Advanced Settings:


      • Set DataMover -> DataMover.HardwareAcceleratedInit = 0
      • Set DataMover -> DataMover.HardwareAccelerated Move = 0
      • Set VMFS3 -> VMFS3.HardwareAcceleratedLocking = 0


      Depending on your setup, maybe only some of these features may need to be set. A reboot of the host is not required. See KB1033665.


      Hopefully somebody will find this useful - Or perhaps somebody will tell me something I missed...