VMware Cloud Community
GoC_Dave
Contributor
Contributor

Poor performance w/ SATA disks & LSI SAS HBA (SOLVED - Finally!)

I just solved a long-standing storage performance issue when using cheap consumer SATA disks for ESXi 5.5.0 datastores through an LSI 9201-16i SAS HBA. Hopefully this helps somebody else.

Symptoms:

  • Sudden, extreme disk latency during I/O heavy operations, like:
    • Copying a large file in a VM with a freshly created thick-lazy_zero VMDK
    • Creating thick-eager_zero VMDKs
    • Creating storage pools in Server 2012's "Storage Spaces" feature
    • SMB shares under heavy write load would "disappear"
    • Windows resource monitor reporting 100% Disk Active Time but zero MB/sec
    • Using SSH/SCP to copy files to datastores

  • Disk I/O errors and degraded performance messages in /var/log/vmkernel.log
  • Disks will "disappear" completely from ESXi during high I/O, then eventually re-appear when the I/O stops
  • Only occurs with cheap SATA spinning disks (not SSDs or enterprise SAS)
  • Same disks work fine while connected to onboard AHCI (ex. Intel ICH) SATA, but choke when connected via the LSI HBA.
  • Controller and disks work fine when used with non-ESX (ex. Windows Server) on the bare metal.

Finally after a lot of pain I discovered how to fix it. As with so many things in IT, when you find the root cause it's very satisfying.

Root Cause:

  • The VAAI (vStorage APIs for Array Integration) storage acceleration feature in ESXi uses a special SCSI command 0x93 WRITE_SAME.
  • Cheap SATA disks often do not support WRITE_SAME.
  • When the 0x93 WRITE_SAME command hits the SATA disk, it hiccups, flushes its buffer, and causes a huge latency.
  • For whatever reason, SAS HBAs pass the 0x93s through to the disks (which start choking them) but AHCI SATA controllers do not. (Not sure why)
  • If the 0x93s come fast & heavy, the disk will "disappear" momentarily from ESXi, and eventually be discovered again when the 0x93s stop.
  • I suspect if the disks were connected with hardware RAID, the controller might think they're "failed" and start populating a hot spare.

Solution:

Disable VAAI on the ESXi host - Configuration -> Advanced Settings:

  • Set DataMover -> DataMover.HardwareAcceleratedInit = 0
  • Set DataMover -> DataMover.HardwareAccelerated Move = 0
  • Set VMFS3 -> VMFS3.HardwareAcceleratedLocking = 0

Depending on your setup, maybe only some of these features may need to be set. A reboot of the host is not required. See KB1033665.

Hopefully somebody will find this useful - Or perhaps somebody will tell me something I missed...

Regards,

Dave

1 Reply
operando
Enthusiast
Enthusiast

Thanks for the info!

I've got the same issue with Micron 5100 Max - enterprise grade SSD in HBA HCL (SAS3008). It's not disappearing, but zeroing through VAAI works 4 times slower than with HardwareAcceleratedInit=0 and other workloads suffering from I/O delays.

Reply
0 Kudos