We just setup a 3 node cluster to pilot vSAN for an upcoming project. We are seeing disk latency spike to 70ms routinely with no real io load on the box and we see it spike to 300ms when we place an io load of about 500 iops.
We are using 3 Dell R720s each has 7 x 280GB HDD and 1 x 179 GB SSD. We are using dedicated 10Gbs network cards connected to Cisco Nexxus switches for the vSAN connectivity.
Anyone else ran into latency issue / found a solution?
Hi Friend,
What is the disk controller do you use?
Below blogs gives insight on queue depth across different vendors from disk controller perspective.
http://www.yellow-bricks.com/2014/04/17/disk-controller-features-and-queue-depth/
Its a PERC H310
Are the drives SATA?
I'm running a lab on a C6105 w/ whatever terrible SATA controller comes with it and SATAII drives. I have a queue depth of 31 for the entire onboard HBA, and none of the gear is on the HCL.
However specs may be similar to yours. I'm nowhere near fully utilizing the NICs (and I'm only running 1GbE) and I'm seeing the same as you - latency spikes when I exceed 500 IOPS. I've looked at everything and I'm fairly sure I'm saturating the controllers.
I think you should look at upgrading your HBA. Keep in mind SATA drives with native command queuing only have a queue depth of 32 per drive as well.
You should open up esxtop and look at the queues of your devices to see if those are filling up... I think it is save to assume that that is your problem.
We ended up trying the PERC 710P which improved the performance but we still saw the latency spikes. Unfortunately the SSDs were still hitting queue saturation. VMware support felt that the saturation periods were too brief to cause concern but we didn't have time to continue testing. our project timeline was at risk so we ended up using a VNXe instead... so no happy endings to this story. I assume we will see some good reference architectures in the coming year and we can try again on a future project. We had this in mind for our remote office use cases.
@BTB2809 - did you create an SR? If so, can you share the SR #?
Thanks.
Kiran
If you don't care about Dell support you can flash this with an updated LSI IT mode 2008 flash and it will increase the queue depth to 600.
I saw the same problems in my lab testing with my LSI 2008 based PIKE cards until we updated them.
LSI 2008 Dell H310 VSAN rebuild performance concerns - Virtual Ramblings
For some reason half of Dell's reference VSAN nodes use this worthless controller. Since their sales people are telling people that VMware will not support production use of VSAN I suspect its some misguided desire to convince people to not use it...
Reference configs seem to be out of date already.
Dell has both the Perc H710 and H710P on the HCL and both support a queue depth of 975
As for your last comment, I doubt Dell is highly motivated to create a strong reference config that will ultimately cannibalize their EqualLogic and Compellant revenue so you may be correct....
Staring with July 2014 the VSAN-HCL was modified. All controllers with QD < 256 were removed, also the H310 / LSI2004/2008.
VMware KB: Storage Controllers previously supported for VSAN that are no longer supported
Best regards,
Joerg