Having some vSAN write performance issues, I would appreciate your thoughts.
The basic spec;
5x vSAN ready nodes, 2x AMD EPYC 7302 16-Core Processor, 2TB RAM, 20x NVMe disks across 4 disk groups. 4x Mellanox 25GbE Networking, Jumbo frames configured E2E.
When running any workloads, including HCIBench we are observing really poor write performance. See below, 30 minutes of 30+ms write latency. Reads are through the roof 400k+ IOPS, writes between 20-40k IOPS depending on parameters. Took 12 hours to consolidate a 10TB snapshot the other day!
Things I have tried:
Any ideas please we really expected better.
It seems that you've been doing some deep troubleshooting so let me ask you some things you didn't mention.
- Are your hosts compatible with ESXi 7.0?
- Are your disks on-format version 11?
- Did you check if your NIC and HBA firmware and driver versions are up to date? I believe they are since you mentioned hardware and software is up-to-date.
- Can you run proactive tests? https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-B88B5900-3...
- Did you run HCI Bench tests with different type of block sizes?
- Do you have vSAN VLAN and VMKernel separated from management and vmotion?
Are you running 7.0 or 7.0 U1 here?
"Took 12 hours to consolidate a 10TB snapshot the other day!"
Are you running HCIBench alongside a production workload? This is not in any way advisable and actually means the benchmarks are not a valid baseline (as they are contending for the resources of the other workloads and available storage).
Why would anyone have 10TB snapshots lying around in their environment?
Specifically which model of NVMe are in use for Cache and Capacity-tiers? (and what driver + firmware combination)
When you tested (anything) with FTT=0, were you configuring it so it ran/deployed the FTT=0 vmdk(s) only on the node where these data were stored? (otherwise it isn't really a good test as the IO still has to traverse inter-node network to commit writes).
Just so that you are aware (and in general for all storage): IOPS do not equal all other IOPS and thus stating X is expected to do X IOPS isn't really painting the full picture, e.g. if one is pushing 100,000 4K IOPS this is the same throughput as 6,250 64K IOPS - from the throughput in the picture you shared this looks to be ~64K block size (but no way of telling if it is half 4K and half 512K either).
I would advise opening a Support Request with vSAN GSS, we have a dedicated team for Performance cases.
@lucasbernadsky thanks for helping here.
Thanks for taking a look at this:
Looking forward to hearing from you soon.
First of all you should check metrics on the backend (select host - monitor - vsan - performance -backend/disk/etc) to found out there the bottleneck appears and latency spikes. This schema could help you:
Secondly, I'll suggest to run HCIbench with such parameters - 3 VMs per each host, 5 vmdks per each VM, 30GB size of vmdk, 4K 100% Write, 30 min warmup and 1h test, initialization on (zero if there is no DD&C on the cluster and random in case you enable DD&C). After test you could save observer data ("save results" button) there is all information from each vSAN module. Screenshots from it could help to found out the issue.
Can you update the disks to ODF v13? I suggest this as in any vSAN update (where these are present), the vast majority of performance enhancements actually only come into effect once the disks have been updated for the new version introduced (and thus why we have these at all aside from where it is for a specific feature-enablement e.g. Encryption in v5).
Thanks for clarifying that you aren't running HCIBench while other workloads are running, but are you running this on an empty vsandatastore? If not then this can have implications, the main 2 being that caches have data on them and data stored on the Disk-Groups may limit (and/or dictate) where test data can be placed, in an extreme case (e.g. if the utilisation in the cluster or on certain disks was relatively high) the test data could in theory push individual disk utilisation >80% (the default CLOM rebalance threshold) and now the test is in contention with a reactive rebalance. Are you using flush-cache between tests and have you checked what the storage utilisation (per-disk) via RVC during these tests?
If you are running this alongside other data and these cannot be moved off temporarily, if you have the resources available to evacuate one node, you could test it as a 1-node vSAN (I know, only FTT=0 then but will give a good idea of per-host capabilities).
As an aside relating to how long a snapshot of X size takes to consolidate - this isn't just a case of how much data the cluster can write, as you are aware this isn't the only VM using the cluster and what can determine this even more is the VMs usage of the snapshot and base-disk data during this time (and other sources of contention such as backups).
Regarding the FTT=0 tests done already - I haven't played with HCIBench in quite some time but I do recall at one point there being some issue with placement of FTT=0 data not being 'pinned' to the respective host(s) as it is supposed to (or at least expected to).
What is the VM layout and numbers you were running during these tests? Is it possible that it was just pushing I/O against a very limited number of components of a very limited amount of vmdk Objects on a very limited number of disks?
@NikolayKulikov makes some good points and you should be aiming to dig deeper and not just focus on one set of graphs in isolation, vSAN Observer data and esxtop data can also help with this.
My guess at 64k is nothing to be impressed about - average IO size = throughput/s divided by iops 😄
Not really, upgrading to 7.0.2 helped. In the end Dell arranged for a VMware SME to engage and he called out some minor tweaks but ultimately said it's correctly configured and working as expected. Given the all NVMe and 4x 25GbE networking I still think it's not running A1.
We reduced write IOPS to vSAN and now it's fine, but we are waiting for 25Gbit NIC to switch from 10G to 25G on vSAN, so I hope it'll help. Then I can try previous setup and can lat you know.
Ok interesting. Have you set the PVSCSI to the recommended values mentioned here https://kb.vmware.com/s/article/2053145. If so set it back and test again with SQL.
What kind of workload profile have you tested with HCIBench?
We had SPLUNK running on Linux and per default Linux running max sector at 512K. That means an IO can have as much as 512K block size and SPLUNK was writing with this size. vSAN will take this and split it into 64K blocks. In addition, we had max PVSCSI settings. After changing max sector to 64K and cmd_per_lun=32 and ring_pages to 8 we were going down from 77ms to 3ms on the VM layer and everything worked flawlessly.
We had major write latency with VSAN. After logging a call with VMware they highlighted that it is a bug and been fixed in 7.0U3f:
Thank you sharing the requested details.
I have reviewed the logs and below are my findings.
High write latency on a disk group.
6 node vSAN all flash cluster with deduplication and compression enabled.
ESXi version: ESXi 7.0 Update 3e (build-19482537)
We could see that one of the disk group on host '<hostname>' was reporting high Log congestion bandwidth, causing latency on the disk group and could impact multiple VMs having components placed on this disk group..
[Image is no longer available]
We could see the host '<hostname>' was taken into maintenance mode on 18th July at 11:16 UTC which resolved the issue.
2022-07-18T09:05:33.743Z: [GenericCorrelator] 5641261302076us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2022-07-18T11:16:19.384Z: [GenericCorrelator] 5649106943053us: [vob.user.maintenancemode.entered] The host has entered maintenance mode
We have know issue with the current ESXi build when 'unmapFairness' and 'GuestUnmap' is enabled.
Please find the KB# below:
Update to vSAN/ESXi 7.0 U3f which contains the code fix for this issue.
To disable unmap, SSH into each host in the cluster and run the following commands:
# esxcfg-advcfg -s 0 /VSAN/GuestUnmap
# esxcfg-advcfg -s 0 /LSOM/unmapFairness
Place the host into maintenance mode with ensure accessibility and reboot the host to make the new setting active