I am in // of this thread going to open a SR to get advices from VMware techs, so this is not a "I have a problem please help" thread.
Rather than this, I would just like to know the numbers on your vsan cluster to compare with my own and see if I actually have a tuning problem somewhere.
And also, sorry for my poor english
I will break this thread in 3 parts :
- Our setup
- Global performances disappointment
- Specific question regarding network
- Expected performances
- Specific case study
Our setup is stretched cluster, 4+4+1, all flash.
Each ESXi has 2 diskgroups.
Each diskgroups are 7 data disks (toshiba PX05SR, 3.84TB, read intensive, SAS) + 1 cache disk (toshiba PX05SM, 800GB, write intensive, SAS)
You can see the whole cluster as 56 data SSDs (4*2*7) providing performances for data and 8 cache SSD (4*2*1) providing performances for write caching.
All of this, replicated to another site (stretched cluster) with the same amount of disks.
The network for vsan replication is isolated on separate network switches, 4 * DELL S5248F-ON (2 on each sites).
These switches are linked together with 2*100GBps inside the same site, and the cross site link is a 4*25GBps links.
Each ESXi has 2 dedicated for vsan network ports of 25GBps linked to these switchs, 1 port active, 1 port passive.
The network card model is QLogic FastLinQ 41262
Jumbo frames enabled end to end. inter site link latency reported inside vsan health check arround 1.20 ms.
All VMs have PFTT=1 and SFTT=1 and erasure coding selected.
Global performances disappointment
I noticed, especially during the night time, that latency on vsan cluster sometimes went very high, and by high I mean HIGH :
The IOPs during this time was not this high :
Neither throughtput :
Congestion graph is OK, but Outstanding IOs is raising :
The backend seems ok :
After some digging I found out this global pressure was mostly because of ONE specific VM.
This VM was doing a cron job of copy files from disk 1 to disk 2.
At first sight, you can see not that much IOPs :
But latency, LATENCY ! :
Scratching my head and digging into advanced graphs made me understand the problem. Seems like the VM is issuing big IO size as you can see on the specific drive receiving the data being copied :
-> arround 505 KB per io
I understood that looking at normalized IOPs (cut down the numbers as if your iops were 32KB) :
My disappointment here is the fact that ONE VM can impact the whole cluster like this.
I know the answer is "IOPS limit" but this is not ideal :
First IOPS limit is per object, so per disk.
If you enforce a 3 000 iops limit per object, you may think each VM will not consume more than 3 000iops, but it is WRONG.
If that VM has 5 disks, the limit is 3 000 per disk. If the VM go crazy, it can potentially consume 15 000 iops, way beyond your 3 000 iops limit.
Specific question regarding network
I noticed that during these stress times, one specific counter raised a red flag for me, the TCP congestion :
This is a graph from one specific host, but others also have this.
Not very much documentation on internet about TCP send Zero Win with vsan (hello future googlers!) neither with TCP zero win, appart that it is an indication that the host cannot proceed the packets fast enough.
Seems like there is a bottleneck somewhere, but can't see where.
Looking at the SSD specs :
I would expect a little more performances out of a full flash cluster storage system.
I know there is costs involved here (erasure coding, second site replica, checksum, etc.).
But still, the number of IOPs globally on the cluster seems rather low compared to the tech specs of just one SSD.
To be continued ... max 20 images / post