VMware Cloud Community
MrPowerEdge
Contributor
Contributor

vSAN 7.0 poor write performance and high latency with NVMe

Hi All,

Having some vSAN write performance issues, I would appreciate your thoughts.

The basic spec;

5x vSAN ready nodes, 2x AMD EPYC 7302 16-Core Processor, 2TB RAM, 20x NVMe disks across 4 disk groups. 4x Mellanox 25GbE Networking, Jumbo frames configured E2E.

When running any workloads, including HCIBench we are observing really poor write performance. See below, 30 minutes of 30+ms write latency. Reads are through the roof 400k+ IOPS, writes between 20-40k IOPS depending on parameters. Took 12 hours to consolidate a 10TB snapshot the other day!

 
 
 

Screenshot 2020-11-09 194510.jpg

Things I have tried:

  • Disabled vSAN checksum - This made 2k IOPS improvement.
  • AMD tuning guide : NPS=1 which is the default but suits the workload.
  • Increased the stripe width from 1 to 2, this improved reads but made write worse.
  • No de-dupe and compression enabled.
  • Tried Mirroring and FFT=0 some small improvement but nothing significant.
  • All patched up from both hardware and software.

Notes:

  • vSAN insight shows no issues.
  • Really expected 60K+ write IOPS.

Any ideas please we really expected better.

Labels (3)
Reply
0 Kudos
20 Replies
lucasbernadsky
Hot Shot
Hot Shot

Hi there!
It seems that you've been doing some deep troubleshooting so let me ask you some things you didn't mention.

- Are your hosts compatible with ESXi 7.0?
- Are your disks on-format version 11?
- Did you check if your NIC and HBA firmware and driver versions are up to date? I believe they are since you mentioned hardware and software is up-to-date.
- Can you run proactive tests? https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-B88B5900-3...
- Did you run HCI Bench tests with different type of block sizes?
- Do you have vSAN VLAN and VMKernel separated from management and vmotion? 

Reply
0 Kudos
TheBobkin
Champion
Champion

Hi,

Are you running 7.0 or 7.0 U1 here?


"Took 12 hours to consolidate a 10TB snapshot the other day!"
Are you running HCIBench alongside a production workload? This is not in any way advisable and actually means the benchmarks are not a valid baseline (as they are contending for the resources of the other workloads and available storage).
Why would anyone have 10TB snapshots lying around in their environment?


Specifically which model of NVMe are in use for Cache and Capacity-tiers? (and what driver + firmware combination)


When you tested (anything) with FTT=0, were you configuring it so it ran/deployed the FTT=0 vmdk(s) only on the node where these data were stored? (otherwise it isn't really a good test as the IO still has to traverse inter-node network to commit writes).


Just so that you are aware (and in general for all storage): IOPS do not equal all other IOPS and thus stating X is expected to do X IOPS isn't really painting the full picture, e.g. if one is pushing 100,000 4K IOPS this is the same throughput as 6,250 64K IOPS - from the throughput in the picture you shared this looks to be ~64K block size (but no way of telling if it is half 4K and half 512K either).


I would advise opening a Support Request with vSAN GSS, we have a dedicated team for Performance cases.

Reply
0 Kudos
MrPowerEdge
Contributor
Contributor

@lucasbernadsky thanks for helping here.

  • Yes all hosts on the HCL from a major vendor, they are vSAN ready nodes.
  • Yes disk on v11 and data redistributed.
  • Support have checked the firmware is all up to date as per the VMW HCL
  • Proactive tests report no issues.
  • We ran easyrun and made our own 60/40 64k parameter file.
  • vSAN network is dedicated pNICs and VLAN.
Reply
0 Kudos
MrPowerEdge
Contributor
Contributor

@TheBobkin 

Thanks for taking a look at this:

  • It started at 7.0 and is now 7.0.1 Update 1 (build-16850804) Kernel 7.0.1 (x86_64)
  • No we are not running HCIbench along side anything that would be madness.
  • 10TB snapshot - Long story, basically someone created a snapshot when the VM was built last week and forgot about it. Then the DBA restored 10TBs of data! 
  • Cache disks are all Dell Express Flash PM1725b 1.6TB SFF F/W 1.1.0
  • Capacity disks are Dell Express Flash PM1725b 3.2TB SFF F/W 1.1.0
  • The FFT=0 test as to eliminate the vSAN network in case that was the issue, I presumed all writes would start local to the host?
  • It was a 64K test, so i'm impressed you can see that from the image.

Looking forward to hearing from you soon.

Reply
0 Kudos
NikolayKulikov
Contributor
Contributor

Hi,

First of all you should check metrics on the backend (select host - monitor - vsan - performance -backend/disk/etc) to found out there the bottleneck appears and latency spikes. This schema could help you:

 

Secondly, I'll suggest to run HCIbench with such parameters - 3 VMs per each host, 5 vmdks per each VM, 30GB size of vmdk, 4K 100% Write, 30 min warmup and 1h test, initialization on (zero if there is no DD&C on the cluster and random in case you enable DD&C). After test you could save observer data ("save results" button) there is all information from each vSAN module. Screenshots from it could help to found out the issue.

Reply
0 Kudos
TheBobkin
Champion
Champion


@MrPowerEdge 

Can you update the disks to ODF v13? I suggest this as in any vSAN update (where these are present), the vast majority of performance enhancements actually only come into effect once the disks have been updated for the new version introduced (and thus why we have these at all aside from where it is for a specific feature-enablement e.g. Encryption in v5).

 

Thanks for clarifying that you aren't running HCIBench while other workloads are running, but are you running this on an empty vsandatastore? If not then this can have implications, the main 2 being that caches have data on them and data stored on the Disk-Groups may limit (and/or dictate) where test data can be placed, in an extreme case (e.g. if the utilisation in the cluster or on certain disks was relatively high) the test data could in theory push individual disk utilisation >80% (the default CLOM rebalance threshold) and now the test is in contention with a reactive rebalance. Are you using flush-cache between tests and have you checked what the storage utilisation (per-disk) via RVC during these tests?
If you are running this alongside other data and these cannot be moved off temporarily, if you have the resources available to evacuate one node, you could test it as a 1-node vSAN (I know, only FTT=0 then but will give a good idea of per-host capabilities).

 

As an aside relating to how long a snapshot of X size takes to consolidate - this isn't just a case of how much data the cluster can write, as you are aware this isn't the only VM using the cluster and what can determine this even more is the VMs usage of the snapshot and base-disk data during this time (and other sources of contention such as backups).

 

Regarding the FTT=0 tests done already - I haven't played with HCIBench in quite some time but I do recall at one point there being some issue with placement of FTT=0 data not being 'pinned' to the respective host(s) as it is supposed to (or at least expected to).

 

What is the VM layout and numbers you were running during these tests? Is it possible that it was just pushing I/O against a very limited number of components of a very limited amount of vmdk Objects on a very limited number of disks?


@NikolayKulikov  makes some good points and you should be aiming to dig deeper and not just focus on one set of graphs in isolation, vSAN Observer data and esxtop data can also help with this.

 

My guess at 64k is nothing to be impressed about - average IO size = throughput/s divided by iops 😄

Reply
0 Kudos
JakubEm
Contributor
Contributor

Hello, did you solve this problem ? We have similar issue with All-Flash SAS on 10Gbit network. 

4hosts 3diskgroups

 

Reply
0 Kudos
MrPowerEdge
Contributor
Contributor

Hi,

Not really, upgrading to 7.0.2 helped. In the end Dell arranged for a VMware SME to engage and he called out some minor tweaks but ultimately said it's correctly configured and working as expected. Given the all NVMe and 4x 25GbE networking I still think it's not running A1.

Reply
0 Kudos
JakubEm
Contributor
Contributor

Same here I'm trying 4x10Gbit LACP but still big latency, sometimes congestion. We consider to buy 25Gbit switch, but I don't know if it will helps. 

Reply
0 Kudos
Fred_vBrain
Enthusiast
Enthusiast

Do you have still this issue or is it solved?

Fred | vBrain.info | vExpert 2014-2022
Reply
0 Kudos
JakubEm
Contributor
Contributor

We reduced write IOPS to vSAN and now it's fine, but we are waiting for 25Gbit NIC to switch from 10G to 25G on vSAN, so I hope it'll help. Then I can try previous setup and can lat you know.

Reply
0 Kudos
Fred_vBrain
Enthusiast
Enthusiast

What kind of workload are you running? What GuestOS?

Or do you just HCIBench the environment?

Fred | vBrain.info | vExpert 2014-2022
Tags (1)
Reply
0 Kudos
MrPowerEdge
Contributor
Contributor

Windows, SQL workloads. Also used HCI bench.

Reply
0 Kudos
Fred_vBrain
Enthusiast
Enthusiast

Ok interesting. Have you set the PVSCSI to the recommended values mentioned here https://kb.vmware.com/s/article/2053145. If so set it back and test again with SQL.

What kind of workload profile have you tested with HCIBench?

Fred | vBrain.info | vExpert 2014-2022
Reply
0 Kudos
MrPowerEdge
Contributor
Contributor

Thanks, yes VMware tried all these. I have moved away from this project now. Hope you get yours sorted.

Reply
0 Kudos
Fred_vBrain
Enthusiast
Enthusiast

Ah ok no worries. Just wanted to see if you had the same issues as I had. I sorted it out, just wanted to help.

Fred | vBrain.info | vExpert 2014-2022
Reply
0 Kudos
JakubEm
Contributor
Contributor

WHat was your issue ?

Reply
0 Kudos
Fred_vBrain
Enthusiast
Enthusiast

We had SPLUNK running on Linux and per default Linux running max sector at 512K. That means an IO can have as much as 512K block size and SPLUNK was writing with this size. vSAN will take this and split it into 64K blocks. In addition, we had max PVSCSI settings. After changing max sector to 64K and cmd_per_lun=32 and ring_pages to 8 we were going down from 77ms to 3ms on the VM layer and everything worked flawlessly.

Fred | vBrain.info | vExpert 2014-2022
vsann00b
Contributor
Contributor

We had major write latency with VSAN.  After logging a call with VMware they highlighted that it is a bug and been fixed in 7.0U3f:

Thank you sharing the requested details.

I have reviewed the logs and below are my findings.
Issue:
High write latency on a disk group.

Assessment:
6 node vSAN all flash cluster with deduplication and compression enabled.
ESXi version: ESXi 7.0 Update 3e (build-19482537)

We could see that one of the disk group on host '<hostname>' was reporting high Log congestion bandwidth, causing latency on the disk group and could impact multiple VMs having components placed on this disk group..

[Image is no longer available]

We could see the host '<hostname>' was taken into maintenance mode on 18th July at 11:16 UTC which resolved the issue.

2022-07-18T09:05:33.743Z: [GenericCorrelator] 5641261302076us: [vob.user.maintenancemode.entering] The host has begun entering maintenance mode
2022-07-18T11:16:19.384Z: [GenericCorrelator] 5649106943053us: [vob.user.maintenancemode.entered] The host has entered maintenance mode

We have know issue with the current ESXi build when 'unmapFairness' and 'GuestUnmap' is enabled.
Please find the KB# below:
https://kb.vmware.com/s/article/88832?lang=en_us

Resolution
Update to vSAN/ESXi 7.0 U3f which contains the code fix for this issue.

Workaround
To disable unmap, SSH into each host in the cluster and run the following commands:
# esxcfg-advcfg -s 0 /VSAN/GuestUnmap
# esxcfg-advcfg -s 0 /LSOM/unmapFairness
Place the host into maintenance mode with ensure accessibility and reboot the host to make the new setting active