VMware Cloud Community
BB9193
Enthusiast
Enthusiast

All flash vSAN performance expectations?

We just deployed an all flash vSAN cluster comprised of 4 Dell R640 ready nodes.  Each node is comprised of:

2 Intel Xeon Gold 6246 @ 3.30 GHz
382 GB RAM
1 Intel Optane P4800x for cache
4 NVMe PM1725B for capacity
1 disk group per node

The vSAN traffic is running over a 25 GB core.  Dedup and compression is disabled, as is encryption.  We're using 6.7 U3.  All firmware and drivers up to date.  Storage policy is R1 FTT1.

I've deployed HCIBench and am currently running test workloads with it.  The datastore is empty except for the HCIBench VM's.  The Easy Run workload of 4K/70% Read/100% Random produced the following results:

I/O per Second: 189042.27 IO/S
Throughput: 738.00 MB/s
Read Latency: 1.48 ms
Write Latency: 1.15 ms
95th Percentile Read Latency: 3.00 ms
95th Percentile Write Latency: 2.00 ms

What should I be shooting for with regard to HCIBench results to be able to verify all is well and I can begin moving my production workload into vSAN?  I'm currently testing the other 3 Easy Run workloads and can post any of those results if needed.

Reply
0 Kudos
41 Replies
BB9193
Enthusiast
Enthusiast

VMware has reviewed our cluster end to end and while originally they suspected there was a networking problem they eventually identified the bug.

Reply
0 Kudos
TomIvone
Contributor
Contributor

@BB9193  Do you have a service request or bug number we can reference on the Dell or VMware side, I believe we might be having the same issue with our Dell all flash ready node *3 VSAN environment.

The environment is everything we expect it to be and more in terms of performance, except for single VM specific burst IO requirements.

VMware cannot find any issues and have confirmed the performance is what is expected, but when testing with HammerDB high burst IO on a single VM our performance is terrible.

From our testing on a single host with the same spec vs 3 node VSAN we are seeing a drop of minimum 70% TPM performance and in some cases higher.

While we expect a performance drop in performance single host vs HCI we do not believe the burst IO performance drops we are seeing are expected.

Test we are running

HammerDB - SQL TPC-C (Cloned VM's so identical)

1 Warehouse, 10 Users

Single host - 335000 TPM

VSAN - 95000 TPM

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

The Dell SR is 1044143886 and I believe the VMware SR is 20175210611.  I also recommend running HCI Bench to get full benchmarks for your entire cluster.  

Watch the performance graphs per VM during normal day to day operations.  What we see are random latency spikes throughout the day on all VM's running on vSAN.  They typically range anywhere from 5 ms to 30 ms.  We actually saw a spike of 130 ms recently.

This latency issue was totally unexpected for us as we anticipated all flash with Optane to have extremely low latency.  Dell was caught off guard as well.

Reply
0 Kudos
srodenburg
Expert
Expert

Install a free appliance called "SexiGraf" and configure it to talk to your vCenter. After an hour or so, you can select dashboards for the various vSAN types and find out on what latency on which level is affecting you. Disk, Client or Owner.  Very good tool for this sort of thing and easier to use than VMware's own IOINSIGHT. It also runs all the time so you can look at historical data.

QuickStart – SexiGraf

Everything you need to download and all Infos are on this Quickstart page.

BB9193
Enthusiast
Enthusiast

@TomIvone You guys make any progress on this?

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hello

That explains why I can't reproduce huge latency spike I see on my all flash vsan cluster using ioanalyzer, I use small disks on these appliances and get good results...

But on the other hand, production VMs running sustained write stream show very poor performances :

Sharantyr3_0-1617285061496.png

(this is a pftt1 sftt 1 raid 1 VM)

On my side I got no success with vmware support (I got tired to run tests and debug on our production environment while the issue is undeniable).

My personal feeling is vsan is not working well with big io size (>256KB) but your issue is with 4KB io size...

Following this thread for news.

 

I would be happy to bench and share results here to compare performances and help, if you tell me which benchs you'd like to run.

Our current setup is stretched (I can run tests on non stretched storage policy) 7+7 all flash, each node has 3 DGs (6+1, 5+1, 5+1), 25GBps ethernet, dell switchs S5248F-ON

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

My HCI Bench test results and parameters are posted early on in this thread if you want to try any of those for comparison.

I'm told this issue(s) will possibly be addressed in 7.0 U3.

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Can you confirm me the bench target that you want me to test ?

4K 100% Write 100% Random - 7 workers each one on different ESXi

pftt0 sftt1 - raid 1 (non stretched)

 

Here are my results on ioanalyzer :

Sharantyr3_0-1617290741602.png

Write latency about 5ms

 

Same test but 100% read instead of 100% write :

Sharantyr3_1-1617291201135.png

Read latency about 0.5ms

 

Please note, unlike you, I have no nvme, only regular sas ssds, write intensive for cache.

My ioanalyzer is pretty old and doesnt generate graphs or latencies anymore, I can setup HCIBench if needed.

Reply
0 Kudos
TomIvone
Contributor
Contributor

We were unable to 100% prove this was our issue and we determined that the workload requirement was unsuitable to VSAN, test within a Dell lab using Optane based storage were unable to get the performance we require in a single VM use case.

At this stage we are still working through if we keep VSAN as it meets 99% of our requirements and use a single host + replication for the 1%, other option would be to ditch VSAN and go back to the SAN + Raid10 model.

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

@Sharantyr3 Yes, you would need to use HCI Bench.  Here are some of my results from previous runs:

My 4K 100% Read 100% Random results are:

Number of VMs: 8
I/O per Second: 330801.05 IO/S
Throughput: 1292.00 MB/s
Read Latency: 0.82 ms
Write Latency: 0.00 ms
95th Percentile Read Latency: 1.00 ms
95th Percentile Write Latency: 0.00 ms

Here are my 4K 100% Write 100% Random results:

Number of VMs: 8
I/O per Second: 104066.28 IO/S
Throughput: 406.00 MB/s
Read Latency: 0.00 ms
Write Latency: 2.63 ms
95th Percentile Read Latency: 0.00 ms
95th Percentile Write Latency: 8.00 ms

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

@TomIvone What size was the vmdk on the test VM?  vSAN has throughput limitations per vmdk.

Supposedly they have made decent performance improvements in the later versions of 7.x.  I'm hoping there will be a business stable release of 7.x ready for later this year.

Reply
0 Kudos
TomIvone
Contributor
Contributor

120GB and 100GB.

My requirement for one VM in this environments is our main issue.

It must be able to using HammerDB and MS SQL reach 200,000 TPM using the following.

Number of warehouses: 1

Virtual users to build schema: 1

Virtual users: 10

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

@TomIvone We are also disappointed with our SQL performance in vSAN.

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hello,

Tried hcibench but the pressure on our vsan cluster was too high, first time I see "congestions" counter raising up. I had to abort testing when vsan performance service was hung and alarms about host communication problem with vcenter were happening.

Also you didn't mention which hcibench u did run (easy run or custom?). I choose custom as I don't like "auto" things run on their own but maybe 7 VMs with each one 4 disks was too much.

How many VMs per ESXi, and how many disks of which size per test VM did you run ?

Also what "Working-Set Percentage" did you chose ?

 

I can run new tests on off hours.

 

 

edit, bench with reduced load on vsan (by reducing number of vmdk per vm to 1) :

 

Case NameJob NameNumber of VMsNumber of VMs Finished EarlyIOPSThroughput(MB)Read Latency(ms)Write Latency(ms)Read 95tile Latency(ms)Write 95tile Latency(ms)BlocksizeRead PercentageTotal Outstanding IOPhysical CPU UsagePhysical Memory UsagevSAN CPU Usage
fio-1vmdk-100ws-4k-0rdpct-100randompct-2threads-1617721835job07012393,064801,13014K0%140.0%44.93%0.0%
fio-1vmdk-100ws-4k-100rdpct-100randompct-2threads-1617722455job07038420,111500,370004K100%14N%N% 
fio-1vmdk-100ws-512k-0rdpct-100randompct-2threads-1617723057job0704067,45203303,6504512K0%140.0%44.93%0.0%
fio-1vmdk-100ws-512k-100rdpct-100randompct-2threads-1617723569job0708413,1842061,73030512K100%140.0%44.93%0.0%

working set % : 100

1 VM / ESXi, 1 disk 10GB 2 threads / VM

7 VMs total, raid 1

 

Weird, even reducing load on vsan, the vsan performance service seems to get hammered by hcibench  and gets unresponsive while running benchs (no more performance graphs on vcenter for vsan tab).

 

I need your exact tests specifications to run the same here for comparison

 

 

edit 2

 

Case NameJob NameNumber of VMsNumber of VMs Finished EarlyIOPSThroughput(MB)Read Latency(ms)Write Latency(ms)Read 95tile Latency(ms)Write 95tile Latency(ms)BlocksizeRead PercentageTotal Outstanding IOPhysical CPU UsagePhysical Memory UsagevSAN CPU Usage
fio-4vmdk-100ws-4k-0rdpct-100randompct-2threads-1617737533job07049954,2219501,12014K0%56N%N% 
fio-4vmdk-100ws-4k-100rdpct-100randompct-2threads-1617738028job070146658,575720,380004K100%5615.69%45.0%1.61%
fio-4vmdk-100ws-512k-0rdpct-100randompct-2threads-1617739020job07010684,99534205,3409512K0%56N%N% 
fio-4vmdk-100ws-512k-100rdpct-100randompct-2threads-1617739833job07026186,92130932,16030512K100%560.0%45.0%0.0%

 

working set % : 100

1 VM / ESXi, 4 disks 100GB 2 threads / VM

7 VMs total, raid 1

 

 

Looking my numbers I find the difference % ratio between read and writes iops quite the same as you (you got faster but less cache disks)

 

What numbers would you be expecting (pre-sale vendor said) ? 

 

Reply
0 Kudos
akaddour
Contributor
Contributor

I too am finding some unexpectedly "average" performance on a 5-node all-flash vSAN cluster I'm working on for a client (vCenter 6.7), vSAN on-disk format version = 10.

The environment private cloud so I don't have full visibility of the back end network, however I do see that the hardware specs of the host servers are very good and modern.

I use a combination of tools to assess performance. HCIbench for cluster level performance, but for 'real-word' performance test I use simple SSD read/write utilities on a Windows VM guest. I won't go into the exact numbers, but basically the VM guest performance on vSAN performs worse than it does on VMware workstation, installed on my laptop with a single consumer grade SSD. The performance is slightly better than my testing lab server, which has 8x Samsung Evo 850's in RAID10 on an old LSI MegaRAID 9261-8i controller.

I have a case open with VMware and the private cloud vendor, so we are trying to work out what is causing the poor write performance and high latency spikes.

Reply
0 Kudos
ManuelDB
Enthusiast
Enthusiast

For me it was because on VSAN nics I was not using lacp and the 2 ports was configured as active/active and not active/passive and Aruba switch was awful as intraswitch performance.

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Same here, and I tell you why : benchmark use small to average io size : good perfs on vsan.

When you use file copy or any real world use case, like database dump, etc, you may get poor performances (by poor I mean not what you would expect from SSDs). I write "may" because it depends on the OS, filesystem etc, but in the end, if your io size are > 1MB, performances are poor.

Everyone here will just say to you, "file copy is not a real benchmark" end of discussion.

But as you, I think it may not be a benchmark, but it's real world use case, and performances are not ok.

And I can't find any bottleneck in the chain, neither vmware support could, so I gave up on my support request.

 

Overall, servers works just fine because 90% of iops are small io. Just don't be surprised that when you do a filecopy you get poor performances, it's by design.

Reply
0 Kudos
BB9193
Enthusiast
Enthusiast

We had an escalation ticket open about this for months and support eventually confirmed that with vSAN 6.7 there are low throughput limits per vmdk.  They actually suggested we break up all vmdk's to no more than 50 GB each to get around these limits, which obviously is ridiculous.

If you have tickets open, ask support about the vmdk throughput limit as I don't recall the specifics.

Reply
0 Kudos
kastlr
Expert
Expert

Hi,

using a single file copy job on any shared storage array will always result in lower performance.

This is by design, as any shared storage is designed to get accessed by multiple hosts (different OS, different applications, different IO profiles, multiple threads).

So IF your used case requires high single thread IO performance it would require tuning on the VM side.

Here're some recommendations to increase windows single thread IO performance.

  • disable Windows Write cache
  • allign NTFS file system
  • if possible, configure guest OS to use a maximum IO size of 64kb
  • use multiple PVSCSI Controller per VM (up to 4)
  • create multiple vmdks and use a SPBM with proper number of stripes
  • spread those vmdks evenly across the PVSCSI controllers
  • use windows dynamic disk to create a striped volume out of those vmdks

This should increase performance of those single thread applications.

Just a side note.

Windows explorer file copy tasks uses different IO Sizes for read and writes, but both are larger than 64KB per IO.

vSAN is optimized for 64kb IOs.

vSAN has to split larger IOs into smaller chuncks, and such activity would increase the latency of the IO.

And when using only a single/few vmdks with default SPBM stripe setting of 1 you might end working with a single cache device instead of spreading the load on multiple cache devices.

 

But I would be really interested in a reference/kb article regarding vmdk throughput limit.


Hope this helps a bit.
Greetings from Germany. (CEST)
Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hello,

What you are writing is partialy right. Have you ever tested file copy on low end array ?

I get more throughput using a 60 magnetic disks in sequential read / write than in a full flash vsan cluster.

But it's as you said, the vsan is here to serve many "customers" not only my benchmark, whereas the low end array will go all in for me.

 

I think the main problem here is to accept the fact that, in a vsan cluster, you may get less performances on one job with hundred of flash disks working than you would get on a single direct attachment flash disk (or even local raid 5 raid card with 3 flash disks). It's disappointing but it's by design. 

And it's good to know it - one VM can't compromise the whole vsan storage cluster.

 

What's the most frustrating I think is no bottleneck is seen in the chain.

Check data disks : latency ok

Check cache disks : latency ok and not filled up

Check network cards : way below full bandwitdh

Check switchs : underutilized

Check vsan : no bottleneck in graphs

 

Also some graphs doesn't make sense to me, and that is frustrating too.

But I believe the global vsan graphs are not accurate : it's related to our discussion. The vsan graphs show an average performance of VMs doing IOps.

And if at a specific point of time, only one VM is doing high IOPS : this VM will make the graphs "averages" go crazy.

Typical exemple : the backups.

Look at this (I removed writes as they are not spiking and it's easier to read and get my point) : 

Sharantyr3_0-1635144761627.png

 

We see huge spikes in read latency while iops don't raise, but throughput do raise. So it's the sign of increasing IO size incoming on the vsan.

It's typical of backup jobs, reading big chunks.

You can see later an iops read spike while latency don't increase this much.

 

At first I thought there was a problem of latency in my vsan, but in fact there is no problem : just one VM doing a lot of stuff and result in "false average" latency graph.

I believe that, if during this backup time, I had many VMs doing "normal" iops (at least 3 or 4 more than the backup job), my graphs would show "normal" latency, not such spikes.

I don't know how this could be fixed.

Reply
0 Kudos