8 Replies Latest reply on May 5, 2020 8:30 AM by Sharantyr3

    vsan all flash 100% write performances

    Sharantyr3 Enthusiast

      Hello there,

       

      I am in // of this thread going to open a SR to get advices from VMware techs, so this is not a "I have a problem please help" thread.

      Rather than this, I would just like to know the numbers on your vsan cluster to compare with my own and see if I actually have a tuning problem somewhere.

       

      And also, sorry for my poor english

       

      I will break this thread in 3 parts :

       

      • Our setup
      • Global performances disappointment
      • Specific question regarding network
      • Expected performances
      • Specific case study

       

       

      Our Setup

       

      Our setup is stretched cluster, 4+4+1, all flash.

      Each ESXi has 2 diskgroups.

      Each diskgroups are 7 data disks (toshiba PX05SR, 3.84TB, read intensive, SAS) + 1 cache disk (toshiba PX05SM, 800GB, write intensive, SAS)

      You can see the whole cluster as 56 data SSDs (4*2*7) providing performances for data and 8 cache SSD (4*2*1) providing performances for write caching.

      All of this, replicated to another site (stretched cluster) with the same amount of disks.

       

      The network for vsan replication is isolated on separate network switches, 4 * DELL S5248F-ON (2 on each sites).

      These switches are linked together with 2*100GBps inside the same site, and the cross site link is a 4*25GBps links.

       

      Each ESXi has 2 dedicated for vsan network ports of 25GBps linked to these switchs, 1 port active, 1 port passive.

      The network card model is QLogic FastLinQ 41262

       

      Jumbo frames enabled end to end. inter site link latency reported inside vsan health check arround 1.20 ms.

       

      All VMs have PFTT=1 and SFTT=1 and erasure coding selected.

       

       

      Global performances disappointment

       

      I noticed, especially during the night time, that latency on vsan cluster sometimes went very high, and by high I mean HIGH :

       

      The IOPs during this time was not this high :

      Neither throughtput :

      Congestion graph is OK, but Outstanding IOs is raising :

       

      The backend seems ok :

       

      After some digging I found out this global pressure was mostly because of ONE specific VM.

      This VM was doing a cron job of copy files from disk 1 to disk 2.

      At first sight, you can see not that much IOPs :

       

      But latency, LATENCY ! :

       

      Scratching my head and digging into advanced graphs made me understand the problem. Seems like the VM is issuing big IO size as you can see on the specific drive receiving the data being copied :

      -> arround 505 KB per io

      I understood that looking at normalized IOPs (cut down the numbers as if your iops were 32KB) :

       

      My disappointment here is the fact that ONE VM can impact the whole cluster like this.

      I know the answer is "IOPS limit" but this is not ideal :

      First IOPS limit is per object, so per disk.

      If you enforce a 3 000 iops limit per object, you may think each VM will not consume more than 3 000iops, but it is WRONG.

      If that VM has 5 disks, the limit is 3 000 per disk. If the VM go crazy, it can potentially consume 15 000 iops, way beyond your 3 000 iops limit.

       

      Specific question regarding network

      I noticed that during these stress times, one specific counter raised a red flag for me, the TCP congestion :

      This is a graph from one specific host, but others also have this.

      Not very much documentation on internet about TCP send Zero Win with vsan (hello future googlers!) neither with TCP zero win, appart that it is an indication that the host cannot proceed the packets fast enough.

      Seems like there is a bottleneck somewhere, but can't see where.

       

       

      Expected performances

       

      Looking at the SSD specs :

       

      https://www.dellemc.com/content/dam/uwaem/production-design-assets/en/Storage/science-of-storage/collaterals/Dell_PX05_b…

       

      I would expect a little more performances out of a full flash cluster storage system.

      I know there is costs involved here (erasure coding, second site replica, checksum, etc.).

      But still, the number of IOPs globally on the cluster seems rather low compared to the tech specs of just one SSD.

       

       

       

      To be continued ... max 20 images / post

        • 1. Re: vsan all flash 100% write performances
          Sharantyr3 Enthusiast

          Specific case study

           

          Would you guys be kind enough to show me some numbers for you ?

          I use a test VM windows (also tried linux) with paravirtual disk, PFTT 1 SFTT 1 RAID 5, no iops limit.

          The test VM has 1 system disk, 1 data "source" disk, 1 data "destination" disk.

          I copy a set of 36 files for 90GB.

           

          When I do a copy from source disk to destination disk, here are my numbers on destination disk specifically :

           

           

          When I apply a 4000 iops limit, same data security :

           

           

           

          I also tried with PFTT=0, SFTT=1, erasure coding, no iops limit to check without stretched cluster :

           

           

          Still, I'm not very impressed by these numbers.

           

          Tests done during production time, but not very big disk activity :

           

           

          So, what guys do you think about all this.

          Am I right thinking there may be a configuration / tuning issue here, or this is what to expect regarding the number of diskgroups and disks I currently have ?

           

          Thanks for you inputs, don't hesitate to share your numbers too !

          • 2. Re: vsan all flash 100% write performances
            seamusobr1 Enthusiast

            Quite a lot of info but I will do my best

             

            I assume you have tried pftt=1 with raid 1 I would also try it with SFTT=0

            Erasure coding roughly has a 40% increase in read/write amplification depending on your circumstances

             

            also what stats are you getting on your vmnic

            esxcli network nic stats get -n vmnicx

             

            see if you are getting dropped or receive packet errors

             

            NIC statistics for vmnic2

               Packets received: 0

               Packets sent: 0

               Bytes received: 0

               Bytes sent: 0

               Receive packets dropped: 0

               Transmit packets dropped: 0

               Multicast packets received: 0

               Broadcast packets received: 0

               Multicast packets sent: 0

               Broadcast packets sent: 0

               Total receive errors: 0

               Receive length errors: 0

               Receive over errors: 0

               Receive CRC errors: 0

               Receive frame errors: 0

               Receive FIFO errors: 0

               Receive missed errors: 0

               Total transmit errors: 0

               Transmit aborted errors: 0

               Transmit carrier errors: 0

               Transmit FIFO errors: 0

               Transmit heartbeat errors: 0

               Transmit window errors: 0

            • 3. Re: vsan all flash 100% write performances
              Sharantyr3 Enthusiast

              Hello there =)

               

              Yes I may have over extended a little...

               

              No error on nics

               

              Just to try a new start on the discussion :

              Is there any website referencing various vsan deployment and performances ?

              I'd like to compare with what I got.

               

              Also, if you guys are running all flash, can you test to copy a bunch of big files (~5GB each, says 100GB total) and show me IOPS, normalized IOPS, latencies, throughput you get on source disk and destination disk ?

              And also specify if you are on stretched or not, PFTT and SFTT levels, erasure coding or not, how many diskgroups, hosts and disks you have.

              • 4. Re: vsan all flash 100% write performances
                seamusobr1 Enthusiast

                It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact performance

                We had mtu set of 9000 and we use stretched clusters but we discovered that the network admin had left it at 1500 on the core so performance on our stretched cluster was impacted

                We have about 26 stretched clusters with 12 hosts in each site. Each host has 4 disk groups (all flash) with five capacity drives in each

                Storage policy settings depend on the need of an application. So for postgres boxes we would do raid 1 for database and logs

                I would strongly recommend that you use hcibench and look at the performance you get with different disk policies

                • 5. Re: vsan all flash 100% write performances
                  depping Champion
                  User ModeratorsVMware Employees

                  You typically wouldn't find any performance reports where performance testing is based on a file copy from VM to VM to be honest. If you want to understand the capabilities of your system use HCIBench.

                  • 6. Re: vsan all flash 100% write performances
                    depping Champion
                    VMware EmployeesUser Moderators

                    Also note, you are doing RAID-5, there's a significant write penalty with RAID-5 compared to RAID-1 for instance. You could indeed limit that single VM from an IOPs perspective, but do note that that simply means the copy process will take longer.

                    • 7. Re: vsan all flash 100% write performances
                      bmrkmr Novice

                      actually I've done some similar tests about 2 years ago, in order to find out what could be done for an individual VM requiring a lot of write IO, which seems to be your specific case...

                      with a somewhat similar setup I can confirm that the figures were quite similar (I remember throughput to be at about 150MB/s with latency in the range of 20-30ms)

                       

                      the first take-away at that time was that erasure coding is a write performance killer. Second, you *may* be able to get slightly better figures with increased stripe width, and you will get considerably better results only with reduced fault tolerance (wrt to heavy write IO with few vm-disks)

                      • 8. Re: vsan all flash 100% write performances
                        Sharantyr3 Enthusiast

                        Hello!

                        It might be worth trawling through all the design guides on storagehub as throughput can be impacted even by the type of switches used. Even a small buffer size on the switch port can impact performance

                        We had mtu set of 9000 and we use stretched clusters but we discovered that the network admin had left it at 1500 on the core so performance on our stretched cluster was impacted

                        We have about 26 stretched clusters with 12 hosts in each site. Each host has 4 disk groups (all flash) with five capacity drives in each

                        Storage policy settings depend on the need of an application. So for postgres boxes we would do raid 1 for database and logs

                        I would strongly recommend that you use hcibench and look at the performance you get with different disk policies

                        The MTU is ok end to end (else vsan health would complain).

                        What do you think of buffer size of my model of switches ? dell s5248f-on I only found 32MB as "Packet buffer" in spec sheet, no info on "port buffer size". But I'm not a network guy, and couldn't find in the doc neither the switches the actual buffers status (% filled ? is that a metric for switches ?)

                        actually I've done some similar tests about 2 years ago, in order to find out what could be done for an individual VM requiring a lot of write IO, which seems to be your specific case...

                        with a somewhat similar setup I can confirm that the figures were quite similar (I remember throughput to be at about 150MB/s with latency in the range of 20-30ms)

                         

                        the first take-away at that time was that erasure coding is a write performance killer. Second, you *may* be able to get slightly better figures with increased stripe width, and you will get considerably better results only with reduced fault tolerance (wrt to heavy write IO with few vm-disks)

                        Thanks for the information. My concern about your answer is that I see much higher latencies than 20-30ms.

                        I also tested without site replication, raid 1, numbers are of course better, but not "wow".

                         

                        You typically wouldn't find any performance reports where performance testing is based on a file copy from VM to VM to be honest. If you want to understand the capabilities of your system use HCIBench.

                        I did benchmark the whole cluster during pre production and using I/O Analyzer | VMware Flings and got good results (I guess) :

                        Test run was 1 io worker per ESXi, total 8 workers.

                        All VMs were configured for PFTT 1 SFTT 1 erasure coding

                        io profile 70% read 30% write, 80% random 20% sequential, 4k blocks, 5 minuts test run :

                        As you can see the iops numbers are good.

                        Also note, you are doing RAID-5, there's a significant write penalty with RAID-5 compared to RAID-1 for instance. You could indeed limit that single VM from an IOPs perspective, but do note that that simply means the copy process will take longer.

                        Sorry, but you are stating the obvious. I'm not a fresh newcomer in IT

                        I do know here there is impact on performance because of stretched, because of RAID 5, etc.

                        What I am wondering is why one single VM can put such an high pressure on vsan.

                        And also trying to get an idea of how it looks like elsewhere.

                         

                        If you want to try a fun thing, try cat /dev/urandom > /to/some/file

                        You will get terrible performances, like this :

                         

                        What I suppose, and I think it may be the root cause of my problem, is because the io size is huge.

                        Regarding this graph, (normalized io is 1983, io is 53) I conclude the io size of this test is 1983*32/63 =~ 1Mb io size

                         

                        I've seen poor performances on VMs doing high size IO (>=512Kb), but with this test, it is really visible.

                        Do you guys have poor performances with big io sizes too ?

                         

                        Also the support seen some warning about high latencies on vsan uplink, so I may actually have a problem somewhere. I need to align my driver version with supported firmware version first.

                         

                        I will let you know if I find something usefull.