What exact version of ESXi and vSAN are you running?
We are running the following.
VMware ESXi, 6.0.0, 4600944
There are 10GbE nics out there where recently a new driver was released that leads to improved performance. Also, 6.0 U3 has a bunch of improvements, which I would recommend.
Thank you very much for the advice. I updated to the latest firmware and drivers for the nic. I also upgraded to 6.0 U3. I'm still seeing the errors and latencies. Overall, performance is good. I'm just trying to understand some inconsistencies and where I may be falling short or having a bottleneck. I'm attaching some more screen shots from observer while doing a large file copy within vsan. Any advice or direction is greatly appreciated.
Do you experience any congestion during the high write latency?
You will also want to see how much of the write cache on the SSD is already used up during the latency event. This is easier to see using SexiGraf (this is a free tool, just search google), which will show all the SSDs stats for a cluster on one page. I think in VSAN observer, if you deep dive to the SSD one of the graphs will show this too but for a single SSD at a time.
Just a theory but maybe the write buffer on the SSD fills up, then destaging to capacity disk becomes the bottleneck (this will show as congestion). Congestion will artificially introduce latency as a result which is maybe what you are seeing.
Possibly unrelated but maybe worth checking (as without detailed packet analysis I would not make any assumptions)
- Is RX and TX Flow Control disabled as per best practice?
Check if is enabled using:
# ethtool -a vmnic<#OfNic(s)InUseForVsanTrafficHere>
Set to off:
# ethtool --pause VMNic_Name tx off rx off
(checked and set on every host in the vSAN cluster of course)
What does your configuration look like (Server)? And how are you testing performance? (file copy is usually not how people test performance in my experience.)
We are running a 4 node hybrid cluster with HP DL380 G9's.
3 disk groups per host. Each group has a 200GB SSD HP MO0200JEFNV (Mainstream Endurance SFF) with 4 900GB 10k SAS magnetic disks per group. (EG0900JFCKB). P440 array controller.
Flow control is enabled. I did try disabling that before and it did not seem to have any effect. I'm not seeing any congestion. Write buffers are not filling up. A colleague is noting a performance inconsistency when doing sql backups from local disk to local disk on a windows server. We aren't having any problems with the environment. I'm just trying to see if there's anything that can be done to alleviate these concerns. Is there a recommended performance testing tool for vsan?
I never reached a resolution on this.
"Possibly unrelated but maybe worth checking (as without detailed packet analysis I would not make any assumptions)
- Is RX and TX Flow Control disabled as per best practice?"
That very same design guide advises to keep it enabled (pages 28 and 138). The document does not mention disabling it anywhere.
I've tried it both ways and we've still found the performance to be inconsistent. We ended up getting a storage array to use for our high performance servers.
That link now points to the new Networking guide (which didn't exist at the time of my post), I can't seem to find a copy of it locally but maybe you can find one online somewhere if you want to clarify what it did/didn't say back then.
Edit: Found it, google VMware® vSAN™ Network Design-OLD - VMware Storage Hub:
"vSAN manages congestion by introducing artificial latency to prevent cache/buffer exhaustion. Since vSAN has built-in congestion management, disabling flow control on VMkernel interfaces tagged for vSAN traffic is recommended. Note Flow Control is enabled by default on all physical uplinks. For further information on Flow Control see KB1013413. VMware Recommends: Disable flow control for vSAN traffic."
And yes, good point that the recommendation for this has changed, nowadays we only advise disabling this with the switch-vendors blessing.