Having some vSAN write performance issues, I would appreciate your thoughts.
The basic spec;
5x vSAN ready nodes, 2x AMD EPYC 7302 16-Core Processor, 2TB RAM, 20x NVMe disks across 4 disk groups. 4x Mellanox 25GbE Networking, Jumbo frames configured E2E.
When running any workloads, including HCIBench we are observing really poor write performance. See below, 30 minutes of 30+ms write latency. Reads are through the roof 400k+ IOPS, writes between 20-40k IOPS depending on parameters. Took 12 hours to consolidate a 10TB snapshot the other day!
Things I have tried:
Any ideas please we really expected better.
It seems that you've been doing some deep troubleshooting so let me ask you some things you didn't mention.
- Are your hosts compatible with ESXi 7.0?
- Are your disks on-format version 11?
- Did you check if your NIC and HBA firmware and driver versions are up to date? I believe they are since you mentioned hardware and software is up-to-date.
- Can you run proactive tests? https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-B88B5900-3...
- Did you run HCI Bench tests with different type of block sizes?
- Do you have vSAN VLAN and VMKernel separated from management and vmotion?
Are you running 7.0 or 7.0 U1 here?
"Took 12 hours to consolidate a 10TB snapshot the other day!"
Are you running HCIBench alongside a production workload? This is not in any way advisable and actually means the benchmarks are not a valid baseline (as they are contending for the resources of the other workloads and available storage).
Why would anyone have 10TB snapshots lying around in their environment?
Specifically which model of NVMe are in use for Cache and Capacity-tiers? (and what driver + firmware combination)
When you tested (anything) with FTT=0, were you configuring it so it ran/deployed the FTT=0 vmdk(s) only on the node where these data were stored? (otherwise it isn't really a good test as the IO still has to traverse inter-node network to commit writes).
Just so that you are aware (and in general for all storage): IOPS do not equal all other IOPS and thus stating X is expected to do X IOPS isn't really painting the full picture, e.g. if one is pushing 100,000 4K IOPS this is the same throughput as 6,250 64K IOPS - from the throughput in the picture you shared this looks to be ~64K block size (but no way of telling if it is half 4K and half 512K either).
I would advise opening a Support Request with vSAN GSS, we have a dedicated team for Performance cases.
@lucasbernadsky thanks for helping here.
Thanks for taking a look at this:
Looking forward to hearing from you soon.
First of all you should check metrics on the backend (select host - monitor - vsan - performance -backend/disk/etc) to found out there the bottleneck appears and latency spikes. This schema could help you:
Secondly, I'll suggest to run HCIbench with such parameters - 3 VMs per each host, 5 vmdks per each VM, 30GB size of vmdk, 4K 100% Write, 30 min warmup and 1h test, initialization on (zero if there is no DD&C on the cluster and random in case you enable DD&C). After test you could save observer data ("save results" button) there is all information from each vSAN module. Screenshots from it could help to found out the issue.
Can you update the disks to ODF v13? I suggest this as in any vSAN update (where these are present), the vast majority of performance enhancements actually only come into effect once the disks have been updated for the new version introduced (and thus why we have these at all aside from where it is for a specific feature-enablement e.g. Encryption in v5).
Thanks for clarifying that you aren't running HCIBench while other workloads are running, but are you running this on an empty vsandatastore? If not then this can have implications, the main 2 being that caches have data on them and data stored on the Disk-Groups may limit (and/or dictate) where test data can be placed, in an extreme case (e.g. if the utilisation in the cluster or on certain disks was relatively high) the test data could in theory push individual disk utilisation >80% (the default CLOM rebalance threshold) and now the test is in contention with a reactive rebalance. Are you using flush-cache between tests and have you checked what the storage utilisation (per-disk) via RVC during these tests?
If you are running this alongside other data and these cannot be moved off temporarily, if you have the resources available to evacuate one node, you could test it as a 1-node vSAN (I know, only FTT=0 then but will give a good idea of per-host capabilities).
As an aside relating to how long a snapshot of X size takes to consolidate - this isn't just a case of how much data the cluster can write, as you are aware this isn't the only VM using the cluster and what can determine this even more is the VMs usage of the snapshot and base-disk data during this time (and other sources of contention such as backups).
Regarding the FTT=0 tests done already - I haven't played with HCIBench in quite some time but I do recall at one point there being some issue with placement of FTT=0 data not being 'pinned' to the respective host(s) as it is supposed to (or at least expected to).
What is the VM layout and numbers you were running during these tests? Is it possible that it was just pushing I/O against a very limited number of components of a very limited amount of vmdk Objects on a very limited number of disks?
@NikolayKulikov makes some good points and you should be aiming to dig deeper and not just focus on one set of graphs in isolation, vSAN Observer data and esxtop data can also help with this.
My guess at 64k is nothing to be impressed about - average IO size = throughput/s divided by iops 😄