BB9193
Contributor
Contributor

All flash vSAN performance expectations?

We just deployed an all flash vSAN cluster comprised of 4 Dell R640 ready nodes.  Each node is comprised of:

2 Intel Xeon Gold 6246 @ 3.30 GHz
382 GB RAM
1 Intel Optane P4800x for cache
4 NVMe PM1725B for capacity
1 disk group per node

The vSAN traffic is running over a 25 GB core.  Dedup and compression is disabled, as is encryption.  We're using 6.7 U3.  All firmware and drivers up to date.  Storage policy is R1 FTT1.

I've deployed HCIBench and am currently running test workloads with it.  The datastore is empty except for the HCIBench VM's.  The Easy Run workload of 4K/70% Read/100% Random produced the following results:

I/O per Second: 189042.27 IO/S
Throughput: 738.00 MB/s
Read Latency: 1.48 ms
Write Latency: 1.15 ms
95th Percentile Read Latency: 3.00 ms
95th Percentile Write Latency: 2.00 ms

What should I be shooting for with regard to HCIBench results to be able to verify all is well and I can begin moving my production workload into vSAN?  I'm currently testing the other 3 Easy Run workloads and can post any of those results if needed.

0 Kudos
33 Replies
srodenburg
Expert
Expert

I don't have HCIbench numbers for you but my lab has 12G dual-ported SAS SSD's (highest vSAN HCL performance category "F") and high-end Enterprise SATA SSD's as capacity devices. Everything I do just flies. Super zippy. Cloning a 100gig VM -> BAM! done. Working with Servers and doing heavy stuff, it goes like a bat out of hell. Your flash hardware is even faster so you can only expect goodness. The flash-devices and the CPU's in my Lab are fast enough to consistently max out the 10gig links between nodes when I really hammer it.

And I use "compression only" in vSAN 7 U1 and the difference between "no compression" or "with compression" is measurable, but as a human, I don't feel the difference. My fat SQL queries are only fractionally slower with compression turned on, it's almost statistically irrelevant (error margin). With Deduplication+Compression active I noticed a loss in "snappiness" and responsiveness. But "compression only", almost nothing, you could fool me with a placebo.

Honestly, don't get a hard-on about benchmark numbers too much. If it goes like a rocket, it's fast. And vSAN all-flash with proper hardware like you have, goes like a rocket. Trust me.

What can ruin the party though is using crappy switches for vSAN traffic. I've seen people use fat servers connected to cheap-skate switches with small per-port buffers (which saturate quickly) and weak packet-forwarding performance in general and then all your super fast flash storage is slowed down by relatively slow inter-node traffic. Under stress, this aggravates quickly as the switches just can't cope.
It makes a difference for latency if the vSAN vmkernel ports of two nodes have a 0.6ms rtt or a 0.2ms rtt between the two nodes. Rule of thumb:  switches that "think too much" or are simply not very fast (cheap crap), tend to introduce a latency not-befitting the super duper NVMe flash-devices inside the nodes.

ManuelDB
Enthusiast
Enthusiast

From my experience, on hcibench you can expect around these results per node with 2DG per node (on 100% read 100% random 4k):

NVME Cache: 150-170Kiops

SAS Cache: 110-130Kiops

SATA Cache: 60-70Kiops

I consider only the cache because if you run the default test all will be placed on cache, and it's there that you find eventually the bottlenecks.

With 1DG per node just divide them by 2. On Optane I think that 100% read will be just around the NVME performance listed (Optane shines on writes and low latencies, on reads are not much better than NVME)

So for your configuration with 1DG per node I'll expect about 350-400K iops on 100% read 100% random 4K

What are your resutls?

0 Kudos
BB9193
Contributor
Contributor

I posted the results for the 4K/70% Read/100% Random workload in my original post above.  The 256K/0% Read/0% Random workload however has me a little concerned:

Number of VMs: 8
I/O per Second: 11854.95 IO/S
Throughput: 2963.00 MB/s
Read Latency: 0.00 ms
Write Latency: 6.16 ms
95th Percentile Read Latency: 0.00 ms
95th Percentile Write Latency: 12.00 ms

Although the more research I'm doing maybe that's just due to the block size?

Here are the results for 4K/100% Read/100% Random:

Number of VMs: 8
I/O per Second: 330801.05 IO/S
Throughput: 1292.00 MB/s
Read Latency: 0.82 ms
Write Latency: 0.00 ms
95th Percentile Read Latency: 1.00 ms
95th Percentile Write Latency: 0.00 ms

0 Kudos
BB9193
Contributor
Contributor

So I posted a response here a couple of days ago but its gone now, not sure what happened.  I'll post it again with additional info.

My 4K 100% Read 100% Random results are:

Number of VMs: 8
I/O per Second: 330801.05 IO/S
Throughput: 1292.00 MB/s
Read Latency: 0.82 ms
Write Latency: 0.00 ms
95th Percentile Read Latency: 1.00 ms
95th Percentile Write Latency: 0.00 ms

I'm good with these and its what I would expect.  The write side however is much lower than I was anticipating.  Here are my 4K 100% Write 100% Random results:

Number of VMs: 8
I/O per Second: 104066.28 IO/S
Throughput: 406.00 MB/s
Read Latency: 0.00 ms
Write Latency: 2.63 ms
95th Percentile Read Latency: 0.00 ms
95th Percentile Write Latency: 8.00 ms

VMware has initially said this is to be expected due to the redundancy of vSAN.  It doesn't get better than Optane for the cache tier, so I'm confused by this.  We've pushed back on VMware for further verification.

Do these write results look optimal?

0 Kudos
ManuelDB
Enthusiast
Enthusiast

Have you tryed with more vms? I think at least 4 per host, so 16 in total, with 4 core each (you have a lot of cores), and 8vmdk each (this is the default number)

In the Proactive Test on network test, do you get 10Gbps?

One thing that you can do for test if network is the bottleneck, is this:

- Create a VM with 1 disk with FTT0
- Check the position of the vmdk (VM->monitor->VSAN disk placement
- Test vmotioning the VM with the VM with disk no the same host (best performance expected), and VM on every other host in order to test performance of network
- For testing also a CrystalDiskMark will be enough. On the sequential read, you must get the full performance on the same host and the performance maxed out by 25Gbps link on the others. Can you share that results? Both read and write

0 Kudos
BB9193
Contributor
Contributor

The number and size of VM's I'm running is the recommendation by HCI Bench based on my configuration.  The Proactive test shows the full 10 GB, but I think this only tests the VM Network, not actual vSAN traffic.  Our vSAN traffic has 25 GB dedicated to it.

I created a new policy with FTT0, but I'm not following how to vMotion the actual disk as there is only one datastore.

0 Kudos
srodenburg
Expert
Expert

"but I'm not following how to vMotion the actual disk as there is only one datastore."

He is not talking about storage vMotion. He means "find out where the single disk-component" (with FTT=0, there is only one data-component as it's not mirrored) and do a normal vMotion of the VM to that host. That way, the physical disk is the same host as where the VM is running so the network is out of the equation.

BB9193
Contributor
Contributor

Gotcha.  I've vMotioned the VM to the same host where its hard disk resides.  The VM Home and VM Swap components are still on other hosts.

Now how am I supposed to test this, with CrystalDiskMark?

0 Kudos
BB9193
Contributor
Contributor

CrystalDiskMark results.

VM on same host as disk:

[Read]
SEQ 1MiB (Q= 8, T= 1): 2220.505 MB/s [ 2117.6 IOPS] < 3775.22 us>
SEQ 128KiB (Q= 32, T= 1): 2430.383 MB/s [ 18542.4 IOPS] < 1724.83 us>
RND 4KiB (Q= 32, T=16): 472.854 MB/s [ 115442.9 IOPS] < 4430.71 us>
RND 4KiB (Q= 1, T= 1): 59.269 MB/s [ 14470.0 IOPS] < 68.94 us>

[Write]
SEQ 1MiB (Q= 8, T= 1): 2158.187 MB/s [ 2058.2 IOPS] < 3878.06 us>
SEQ 128KiB (Q= 32, T= 1): 1902.626 MB/s [ 14515.9 IOPS] < 2200.66 us>
RND 4KiB (Q= 32, T=16): 378.292 MB/s [ 92356.4 IOPS] < 5474.68 us>
RND 4KiB (Q= 1, T= 1): 11.953 MB/s [ 2918.2 IOPS] < 342.32 us>

VM on different host than disk:

[Read]
SEQ 1MiB (Q= 8, T= 1): 1934.667 MB/s [ 1845.0 IOPS] < 4332.42 us>
SEQ 128KiB (Q= 32, T= 1): 1849.289 MB/s [ 14109.0 IOPS] < 2266.81 us>
RND 4KiB (Q= 32, T=16): 418.366 MB/s [ 102140.1 IOPS] < 5007.07 us>
RND 4KiB (Q= 1, T= 1): 36.948 MB/s [ 9020.5 IOPS] < 110.67 us>

[Write]
SEQ 1MiB (Q= 8, T= 1): 2148.930 MB/s [ 2049.4 IOPS] < 3894.09 us>
SEQ 128KiB (Q= 32, T= 1): 2083.816 MB/s [ 15898.3 IOPS] < 2009.56 us>
RND 4KiB (Q= 32, T=16): 336.430 MB/s [ 82136.2 IOPS] < 6192.49 us>
RND 4KiB (Q= 1, T= 1): 10.387 MB/s [ 2535.9 IOPS] < 393.97 us>

 

0 Kudos
ManuelDB
Enthusiast
Enthusiast

Not bad. For example, in a cluster where I'm working right now with Inel P4610 NVME as cache I get around 2200MB/s read and 1200MB/s write on same host, and 1900/1100 on other hosts.

Considering that 25Gbps can achieve max of 3GB/s and that some overhead is expected, I think that my results and your results are compatible with the installed hardware (you have much stronger writes) and that networking is not an issue/bottleneck (you have also only 1 DG for host)

As you can see, anyway, you are acheaving 100k iops 4k on random Write with a single host, so from a cluster perspective, I'm expecting 350K iops with FTT0 and obviusly a little less than 180K with FTT1 (add also the checksum penalty and you will get 150K iops I think).

With your test on HCI Bench you get only 100K (with FTT1?), so something is not working as expected.

Have you noticed congestion or network packet drop during the test? (You must run VSAN Observer to analyze well packet drops, the charts on VCenter are totally wrong and show no packet drops when it is happening)

ManuelDB
Enthusiast
Enthusiast

Ah, another information: what switches are you using?

0 Kudos
BB9193
Contributor
Contributor

We're running a pair of Dell S5212F-ON switches dedicated for vSAN traffic.

0 Kudos
ManuelDB
Enthusiast
Enthusiast

and what about congestion/packets drop on VSAN nics?

You must check them with VSAN Observer

http://www.vmwarearena.com/how-to-use-vsan-observer/

 

0 Kudos
BB9193
Contributor
Contributor

UPDATE: so after 2.5 months of troubleshooting with Dell and VMware, VMware has finally come back to say there is a vSAN bug limiting vmdk throughput to 40 MB/s and causing the latency spikes we're seeing.  They could give no ETA on a fix or even if a fix was forthcoming.  Their only recommendation was to break up all my vmdk's to no more than 50 GB each, which is not feasible.  A 2 TB file server would require over 40(!) vmdk's.

I can't tell you how disappointing this is and we're starting to doubt vSAN as a viable option for our environment.

0 Kudos
ManuelDB
Enthusiast
Enthusiast

Wow, that's an unexpected findings, but can be useful also for a cluster that I'm following.

Can you give me the SR for reference? So I can ask if I'm subject to the same issue, considering that my configuration is pretty similar to yours.

In my case all benchmarks seems fine, but I get some strange latency spikes... Thanks!

 

Anyway, consider that bugs are present in everything. I know that starting with a bug is annoying, but it could be VSAN as any other piece of Hardware/Software.

Recently a customer has an issue with a phisical storage that corrupt VMDK every times it do a storage vmotion for example...

0 Kudos
ManuelDB
Enthusiast
Enthusiast

another thing: they say to break up vmdk to 50GB each, but can you obtain this result using VSAN policy and number of stripes? or changin VSAN.ClomMaxComponentSizeGB to 50GB until the fix?

Or is it an issue of the size of the VMDK (so not related of the size of object)?

 

This can be a complete gamechanger in the impact of this bug. If we (I put myself also inside this issue!) can solve it using VSAN.ClomMaxComponentSizeGB I think it's not a great issue for small cluster, and we can wait the solution.

Elsewere, if the solution is to user more VMDK I'm with you that this can't be a solution. How do you split a Database of 400GB??

0 Kudos
ManuelDB
Enthusiast
Enthusiast

I have just found that VSAN.ClomMaxComponentSizeGB can't be set lower that 180GB, but the combination of this + stripe could help if it's related to component size and not vmdk size...

0 Kudos
BB9193
Contributor
Contributor

I don't have the SR for the VMware side as VMware was engaged by Dell through my Dell ticket number.  I'll see if I can get that from Dell.

According to VMware there is no known workaround and my understanding is it can happen to any size vmdk.  The issue affects the way the vSAN side handles an IOPS burst to a single vmdk at a time.  Their logic is that breaking up the vmdk's will distribute the IO's across multiple vmdk's, which makes sense, but its just not feasible.

I understand that bugs happen and that's just part of the deal, but the frustrating part for us is that we started reporting problems literally as soon as the cluster was deployed and its taken 2.5 months just for us to get to this point.  We're now in our business critical season so even if I wanted to break up a vmdk its too late.  We're going to have to run this cluster as is for the next few months.

What I don't understand is what the trigger is for this bug?  I've asked but thus far have not heard back.  VMware says only a very small number of vSAN customers are affected, but there is nothing special about our cluster as it consists of Dell Ready Nodes and was configured by the Dell ProDeploy team.  So I don't understand why we would be affected while the majority of customers are not.

0 Kudos
ManuelDB
Enthusiast
Enthusiast

My doubt is that I'm in a similar situation.

Have done a POC with 2 servers and all was blazing fast on 6.7U3

Now, exactly the same hardware but with 5 servers, go slow on 7.0U1 (I haven't tryed 6.7U3 on this configuration unfortunately).

We have digged a lot and given the fault to the switches, but it could also be that bug. I have just asked to the support (I have an open case for performances, that are wonderful on HCIBench (small VMDKs) but not as good in production because of latency spikes)

0 Kudos