VMware Cloud Community
Sharantyr3
Enthusiast
Enthusiast

vsan all flash 100% write performances

Hello there,

I am in // of this thread going to open a SR to get advices from VMware techs, so this is not a "I have a problem please help" thread.

Rather than this, I would just like to know the numbers on your vsan cluster to compare with my own and see if I actually have a tuning problem somewhere.

And also, sorry for my poor english Smiley Happy

I will break this thread in 3 parts :

  • Our setup
  • Global performances disappointment
  • Specific question regarding network
  • Expected performances
  • Specific case study

Our Setup

Our setup is stretched cluster, 4+4+1, all flash.

Each ESXi has 2 diskgroups.

Each diskgroups are 7 data disks (toshiba PX05SR, 3.84TB, read intensive, SAS) + 1 cache disk (toshiba PX05SM, 800GB, write intensive, SAS)

You can see the whole cluster as 56 data SSDs (4*2*7) providing performances for data and 8 cache SSD (4*2*1) providing performances for write caching.

All of this, replicated to another site (stretched cluster) with the same amount of disks.

The network for vsan replication is isolated on separate network switches, 4 * DELL S5248F-ON (2 on each sites).

These switches are linked together with 2*100GBps inside the same site, and the cross site link is a 4*25GBps links.

Each ESXi has 2 dedicated for vsan network ports of 25GBps linked to these switchs, 1 port active, 1 port passive.

The network card model is QLogic FastLinQ 41262

Jumbo frames enabled end to end. inter site link latency reported inside vsan health check arround 1.20 ms.

All VMs have PFTT=1 and SFTT=1 and erasure coding selected.

Global performances disappointment

I noticed, especially during the night time, that latency on vsan cluster sometimes went very high, and by high I mean HIGH :

pastedImage_8.png

The IOPs during this time was not this high :

pastedImage_9.png

Neither throughtput :

pastedImage_10.png

Congestion graph is OK, but Outstanding IOs is raising :

pastedImage_11.png

The backend seems ok :

pastedImage_12.png

pastedImage_13.png

pastedImage_14.png

After some digging I found out this global pressure was mostly because of ONE specific VM.

This VM was doing a cron job of copy files from disk 1 to disk 2.

At first sight, you can see not that much IOPs :

pastedImage_15.png

But latency, LATENCY ! :

pastedImage_17.png

Scratching my head and digging into advanced graphs made me understand the problem. Seems like the VM is issuing big IO size as you can see on the specific drive receiving the data being copied :

pastedImage_18.png

pastedImage_19.png

-> arround 505 KB per io

I understood that looking at normalized IOPs (cut down the numbers as if your iops were 32KB) :

pastedImage_20.png

My disappointment here is the fact that ONE VM can impact the whole cluster like this.

I know the answer is "IOPS limit" but this is not ideal :

First IOPS limit is per object, so per disk.

If you enforce a 3 000 iops limit per object, you may think each VM will not consume more than 3 000iops, but it is WRONG.

If that VM has 5 disks, the limit is 3 000 per disk. If the VM go crazy, it can potentially consume 15 000 iops, way beyond your 3 000 iops limit.

Specific question regarding network

I noticed that during these stress times, one specific counter raised a red flag for me, the TCP congestion :

pastedImage_26.png

This is a graph from one specific host, but others also have this.

Not very much documentation on internet about TCP send Zero Win with vsan (hello future googlers!) neither with TCP zero win, appart that it is an indication that the host cannot proceed the packets fast enough.

Seems like there is a bottleneck somewhere, but can't see where.

Expected performances

Looking at the SSD specs :

https://www.dellemc.com/content/dam/uwaem/production-design-assets/en/Storage/science-of-storage/col...

I would expect a little more performances out of a full flash cluster storage system.

I know there is costs involved here (erasure coding, second site replica, checksum, etc.).

But still, the number of IOPs globally on the cluster seems rather low compared to the tech specs of just one SSD.

To be continued ... max 20 images / post Smiley Happy

8 Replies
Sharantyr3
Enthusiast
Enthusiast

Specific case study

Would you guys be kind enough to show me some numbers for you ?

I use a test VM windows (also tried linux) with paravirtual disk, PFTT 1 SFTT 1 RAID 5, no iops limit.

The test VM has 1 system disk, 1 data "source" disk, 1 data "destination" disk.

I copy a set of 36 files for 90GB.

When I do a copy from source disk to destination disk, here are my numbers on destination disk specifically :

pastedImage_29.png

pastedImage_30.png

pastedImage_31.png

pastedImage_32.png

When I apply a 4000 iops limit, same data security :

pastedImage_33.png

pastedImage_34.png

pastedImage_35.png

pastedImage_36.png

I also tried with PFTT=0, SFTT=1, erasure coding, no iops limit to check without stretched cluster :

pastedImage_0.png

pastedImage_38.png

pastedImage_39.png

pastedImage_40.png

Still, I'm not very impressed by these numbers.

Tests done during production time, but not very big disk activity :

pastedImage_0.png

So, what guys do you think about all this.

Am I right thinking there may be a configuration / tuning issue here, or this is what to expect regarding the number of diskgroups and disks I currently have ?

Thanks for you inputs, don't hesitate to share your numbers too !

Reply
0 Kudos
seamusobr1
Enthusiast
Enthusiast

Quite a lot of info but I will do my best

I assume you have tried pftt=1 with raid 1 I would also try it with SFTT=0

Erasure coding roughly has a 40% increase in read/write amplification depending on your circumstances

also what stats are you getting on your vmnic

esxcli network nic stats get -n vmnicx

see if you are getting dropped or receive packet errors

NIC statistics for vmnic2

   Packets received: 0

   Packets sent: 0

   Bytes received: 0

   Bytes sent: 0

   Receive packets dropped: 0

   Transmit packets dropped: 0

   Multicast packets received: 0

   Broadcast packets received: 0

   Multicast packets sent: 0

   Broadcast packets sent: 0

   Total receive errors: 0

   Receive length errors: 0

   Receive over errors: 0

   Receive CRC errors: 0

   Receive frame errors: 0

   Receive FIFO errors: 0

   Receive missed errors: 0

   Total transmit errors: 0

   Transmit aborted errors: 0

   Transmit carrier errors: 0

   Transmit FIFO errors: 0

   Transmit heartbeat errors: 0

   Transmit window errors: 0

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hello there 😃

Yes I may have over extended a little...

No error on nics Smiley Sad

Just to try a new start on the discussion :

Is there any website referencing various vsan deployment and performances ?

I'd like to compare with what I got.

Also, if you guys are running all flash, can you test to copy a bunch of big files (~5GB each, says 100GB total) and show me IOPS, normalized IOPS, latencies, throughput you get on source disk and destination disk ?

And also specify if you are on stretched or not, PFTT and SFTT levels, erasure coding or not, how many diskgroups, hosts and disks you have.

Reply
0 Kudos
seamusobr1
Enthusiast
Enthusiast

It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact performance

We had mtu set of 9000 and we use stretched clusters but we discovered that the network admin had left it at 1500 on the core so performance on our stretched cluster was impacted

We have about 26 stretched clusters with 12 hosts in each site. Each host has 4 disk groups (all flash) with five capacity drives in each

Storage policy settings depend on the need of an application. So for postgres boxes we would do raid 1 for database and logs

I would strongly recommend that you use hcibench and look at the performance you get with different disk policies

Reply
0 Kudos
depping
Leadership
Leadership

You typically wouldn't find any performance reports where performance testing is based on a file copy from VM to VM to be honest. If you want to understand the capabilities of your system use HCIBench.

Reply
0 Kudos
depping
Leadership
Leadership

Also note, you are doing RAID-5, there's a significant write penalty with RAID-5 compared to RAID-1 for instance. You could indeed limit that single VM from an IOPs perspective, but do note that that simply means the copy process will take longer.

Reply
0 Kudos
bmrkmr
Enthusiast
Enthusiast

actually I've done some similar tests about 2 years ago, in order to find out what could be done for an individual VM requiring a lot of write IO, which seems to be your specific case...

with a somewhat similar setup I can confirm that the figures were quite similar (I remember throughput to be at about 150MB/s with latency in the range of 20-30ms)

the first take-away at that time was that erasure coding is a write performance killer. Second, you *may* be able to get slightly better figures with increased stripe width, and you will get considerably better results only with reduced fault tolerance (wrt to heavy write IO with few vm-disks)

Reply
0 Kudos
Sharantyr3
Enthusiast
Enthusiast

Hello!

It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact performance

We had mtu set of 9000 and we use stretched clusters but we discovered that the network admin had left it at 1500 on the core so performance on our stretched cluster was impacted

We have about 26 stretched clusters with 12 hosts in each site. Each host has 4 disk groups (all flash) with five capacity drives in each

Storage policy settings depend on the need of an application. So for postgres boxes we would do raid 1 for database and logs

I would strongly recommend that you use hcibench and look at the performance you get with different disk policies

The MTU is ok end to end (else vsan health would complain).

What do you think of buffer size of my model of switches ? dell s5248f-on I only found 32MB as "Packet buffer" in spec sheet, no info on "port buffer size". But I'm not a network guy, and couldn't find in the doc neither the switches the actual buffers status (% filled ? is that a metric for switches ?)

actually I've done some similar tests about 2 years ago, in order to find out what could be done for an individual VM requiring a lot of write IO, which seems to be your specific case...

with a somewhat similar setup I can confirm that the figures were quite similar (I remember throughput to be at about 150MB/s with latency in the range of 20-30ms)

the first take-away at that time was that erasure coding is a write performance killer. Second, you *may* be able to get slightly better figures with increased stripe width, and you will get considerably better results only with reduced fault tolerance (wrt to heavy write IO with few vm-disks)

Thanks for the information. My concern about your answer is that I see much higher latencies than 20-30ms.

I also tested without site replication, raid 1, numbers are of course better, but not "wow".

You typically wouldn't find any performance reports where performance testing is based on a file copy from VM to VM to be honest. If you want to understand the capabilities of your system use HCIBench.

I did benchmark the whole cluster during pre production and using I/O Analyzer | VMware Flings and got good results (I guess) :

Test run was 1 io worker per ESXi, total 8 workers.

All VMs were configured for PFTT 1 SFTT 1 erasure coding

io profile 70% read 30% write, 80% random 20% sequential, 4k blocks, 5 minuts test run :

pastedImage_2.png

As you can see the iops numbers are good.

Also note, you are doing RAID-5, there's a significant write penalty with RAID-5 compared to RAID-1 for instance. You could indeed limit that single VM from an IOPs perspective, but do note that that simply means the copy process will take longer.

Sorry, but you are stating the obvious. I'm not a fresh newcomer in IT Smiley Wink

I do know here there is impact on performance because of stretched, because of RAID 5, etc.

What I am wondering is why one single VM can put such an high pressure on vsan.

And also trying to get an idea of how it looks like elsewhere.

If you want to try a fun thing, try cat /dev/urandom > /to/some/file

You will get terrible performances, like this Smiley Happy :

pastedImage_8.png

What I suppose, and I think it may be the root cause of my problem, is because the io size is huge.

Regarding this graph, (normalized io is 1983, io is 53) I conclude the io size of this test is 1983*32/63 =~ 1Mb io size

I've seen poor performances on VMs doing high size IO (>=512Kb), but with this test, it is really visible.

Do you guys have poor performances with big io sizes too ?

Also the support seen some warning about high latencies on vsan uplink, so I may actually have a problem somewhere. I need to align my driver version with supported firmware version first.

I will let you know if I find something usefull.

Reply
0 Kudos