vSAN 7.0 poor write performance and high latency w...

MrPowerEdge · ‎11-09-2020

Hi All,

Having some vSAN write performance issues, I would appreciate your thoughts.

The basic spec;

5x vSAN ready nodes, 2x AMD EPYC 7302 16-Core Processor, 2TB RAM, 20x NVMe disks across 4 disk groups. 4x Mellanox 25GbE Networking, Jumbo frames configured E2E.

When running any workloads, including HCIBench we are observing really poor write performance. See below, 30 minutes of 30+ms write latency. Reads are through the roof 400k+ IOPS, writes between 20-40k IOPS depending on parameters. Took 12 hours to consolidate a 10TB snapshot the other day!

Things I have tried:

Disabled vSAN checksum - This made 2k IOPS improvement.
AMD tuning guide : NPS=1 which is the default but suits the workload.
Increased the stripe width from 1 to 2, this improved reads but made write worse.
No de-dupe and compression enabled.
Tried Mirroring and FFT=0 some small improvement but nothing significant.
All patched up from both hardware and software.

Notes:

vSAN insight shows no issues.
Really expected 60K+ write IOPS.

Any ideas please we really expected better.

lucasbernadsky · ‎11-09-2020

Hi there!
It seems that you've been doing some deep troubleshooting so let me ask you some things you didn't mention.

- Are your hosts compatible with ESXi 7.0?
- Are your disks on-format version 11?
- Did you check if your NIC and HBA firmware and driver versions are up to date? I believe they are since you mentioned hardware and software is up-to-date.
- Can you run proactive tests? https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-B88B5900-3...
- Did you run HCI Bench tests with different type of block sizes?
- Do you have vSAN VLAN and VMKernel separated from management and vmotion?

TheBobkin · ‎11-09-2020

Hi,

Are you running 7.0 or 7.0 U1 here?

"Took 12 hours to consolidate a 10TB snapshot the other day!"
Are you running HCIBench alongside a production workload? This is not in any way advisable and actually means the benchmarks are not a valid baseline (as they are contending for the resources of the other workloads and available storage).
Why would anyone have 10TB snapshots lying around in their environment?

Specifically which model of NVMe are in use for Cache and Capacity-tiers? (and what driver + firmware combination)

When you tested (anything) with FTT=0, were you configuring it so it ran/deployed the FTT=0 vmdk(s) only on the node where these data were stored? (otherwise it isn't really a good test as the IO still has to traverse inter-node network to commit writes).

Just so that you are aware (and in general for all storage): IOPS do not equal all other IOPS and thus stating X is expected to do X IOPS isn't really painting the full picture, e.g. if one is pushing 100,000 4K IOPS this is the same throughput as 6,250 64K IOPS - from the throughput in the picture you shared this looks to be ~64K block size (but no way of telling if it is half 4K and half 512K either).

I would advise opening a Support Request with vSAN GSS, we have a dedicated team for Performance cases.

MrPowerEdge · ‎11-10-2020

@lucasbernadsky thanks for helping here.

Yes all hosts on the HCL from a major vendor, they are vSAN ready nodes.
Yes disk on v11 and data redistributed.
Support have checked the firmware is all up to date as per the VMW HCL
Proactive tests report no issues.
We ran easyrun and made our own 60/40 64k parameter file.
vSAN network is dedicated pNICs and VLAN.

MrPowerEdge · ‎11-10-2020

@TheBobkin

Thanks for taking a look at this:

It started at 7.0 and is now 7.0.1 Update 1 (build-16850804) Kernel 7.0.1 (x86_64)
No we are not running HCIbench along side anything that would be madness.
10TB snapshot - Long story, basically someone created a snapshot when the VM was built last week and forgot about it. Then the DBA restored 10TBs of data!
Cache disks are all Dell Express Flash PM1725b 1.6TB SFF F/W 1.1.0
Capacity disks are Dell Express Flash PM1725b 3.2TB SFF F/W 1.1.0
The FFT=0 test as to eliminate the vSAN network in case that was the issue, I presumed all writes would start local to the host?
It was a 64K test, so i'm impressed you can see that from the image.

Looking forward to hearing from you soon.

NikolayKulikov · ‎11-10-2020

Hi,

First of all you should check metrics on the backend (select host - monitor - vsan - performance -backend/disk/etc) to found out there the bottleneck appears and latency spikes. This schema could help you:

Secondly, I'll suggest to run HCIbench with such parameters - 3 VMs per each host, 5 vmdks per each VM, 30GB size of vmdk, 4K 100% Write, 30 min warmup and 1h test, initialization on (zero if there is no DD&C on the cluster and random in case you enable DD&C). After test you could save observer data ("save results" button) there is all information from each vSAN module. Screenshots from it could help to found out the issue.

TheBobkin · ‎11-11-2020

@MrPowerEdge

Can you update the disks to ODF v13? I suggest this as in any vSAN update (where these are present), the vast majority of performance enhancements actually only come into effect once the disks have been updated for the new version introduced (and thus why we have these at all aside from where it is for a specific feature-enablement e.g. Encryption in v5).

Thanks for clarifying that you aren't running HCIBench while other workloads are running, but are you running this on an empty vsandatastore? If not then this can have implications, the main 2 being that caches have data on them and data stored on the Disk-Groups may limit (and/or dictate) where test data can be placed, in an extreme case (e.g. if the utilisation in the cluster or on certain disks was relatively high) the test data could in theory push individual disk utilisation >80% (the default CLOM rebalance threshold) and now the test is in contention with a reactive rebalance. Are you using flush-cache between tests and have you checked what the storage utilisation (per-disk) via RVC during these tests?
If you are running this alongside other data and these cannot be moved off temporarily, if you have the resources available to evacuate one node, you could test it as a 1-node vSAN (I know, only FTT=0 then but will give a good idea of per-host capabilities).

As an aside relating to how long a snapshot of X size takes to consolidate - this isn't just a case of how much data the cluster can write, as you are aware this isn't the only VM using the cluster and what can determine this even more is the VMs usage of the snapshot and base-disk data during this time (and other sources of contention such as backups).

Regarding the FTT=0 tests done already - I haven't played with HCIBench in quite some time but I do recall at one point there being some issue with placement of FTT=0 data not being 'pinned' to the respective host(s) as it is supposed to (or at least expected to).

What is the VM layout and numbers you were running during these tests? Is it possible that it was just pushing I/O against a very limited number of components of a very limited amount of vmdk Objects on a very limited number of disks?

@NikolayKulikov makes some good points and you should be aiming to dig deeper and not just focus on one set of graphs in isolation, vSAN Observer data and esxtop data can also help with this.

My guess at 64k is nothing to be impressed about - average IO size = throughput/s divided by iops 😄

JakubEm · ‎11-07-2021

Hello, did you solve this problem ? We have similar issue with All-Flash SAS on 10Gbit network.

4hosts 3diskgroups

MrPowerEdge · ‎11-08-2021

Hi,

Not really, upgrading to 7.0.2 helped. In the end Dell arranged for a VMware SME to engage and he called out some minor tweaks but ultimately said it's correctly configured and working as expected. Given the all NVMe and 4x 25GbE networking I still think it's not running A1.

JakubEm · ‎11-08-2021

Same here I'm trying 4x10Gbit LACP but still big latency, sometimes congestion. We consider to buy 25Gbit switch, but I don't know if it will helps.

Fred_vBrain · ‎07-14-2022

Do you have still this issue or is it solved?