2 Replies Latest reply on Jun 3, 2019 12:16 PM by ManivelR

    VSAN congestion and latency query

    ManivelR Enthusiast

      Hi Team,

       

      I have a doubt VSAN congestion and from the KB article,im not able to understand  the VSAN behaviour completely.

       

      My setup is as follows:-

      6 ESXi hosts with 6.7.0 U1.Our storage policy is FTT=2(RAID 6)

      Each ESXi has 2 disk groups.1st disk is  1*1 TB disk(cache) and 3* 2TB disks(Capacity) and 2nd disk group also remains same capacity.

      Each ESXi host capacity is 12 TB and the total capacity of all the 6 ESXi hosts= 72 TB(All flash & depue/compression enabled).

       

      In 1st ESXi host, we had a disk failure(1 TB cache disk).To fix this issue,we removed the disk group and recreated a new disk group with new 1 TB replaced disk.

      Soon after creating the disk group,resync components in VSAN started.It was running  continuosly almost 2 days(for 2.4 TB resync data)and in between(still resync running) we received an error called "VSAN congestion".On checking this Congestion(SSD congestion "239",we came to know the congestion occured  from the same replaced 1 TB cache disk(1st HV).

       

      Case logged with VMware.As per VMware advice,We again removed the disk group and recreated a new one with the same disks.

      Again resync started and and after copying 1.5 TB,again congestion ocuured at 880 GB left.But still the resync was going on in the low rate....There is a big VM available which capacity is around 10 TB and those components resync was still going on....

       

      I decided the host put in to maintenance and removed the affected VSAN disk group.Now the HV1 is in maintenance and 5 ESXi hosts are serving now.

      Some objects are health and some other objects are in reduced availability-no rebuild.

       

      My doubts,

      1) I saw. the VM latency was little high from 10 TB VM(both read and write latency)? Is this because of re syncing components at that time?

      2) Now after putting the HV1 in to maintenance,I see some more latency from all the VMs. Why it is still coming now? Im not sure.

      2) Im not sure why this resync components is taking long time? My NIC is 2 * 10 GBPS VSAN network.

      3) When there is any congestion on any disk(Example :-1*1TB Cache disk from 1 HV),all other running VMs latency will be bit high(even the FT=2) ? Is this default behavior?

       

      Please suggest .

       

      Note:- VC GUI VSAN VM consumption and listed below RVC reports are showing different values.When im seeing this latency through RVC,sometime is showing high and some other times it is very low(accepted latency).

       

      +------------+-----------+-------------+---------------+

      | VM/Object  | IOPS      | Tput (KB/s) | Latency (ms)  |

      +------------+-----------+-------------+---------------+

      | Client-fmNP | 0.5r/1.5w | 2.2r/14.3w  | 243.3r/796.2w |

      +------------+-----------+-------------+---------------+

      (54b7317f-4e7c-4d81-ab16-4a579c63d7ca)/vms> vsan.vm_perf_stats Client-fmNP

      2019-05-27 12:58:26 +0000: Querying info about VMs ...

      2019-05-27 12:58:26 +0000: Querying vSAN objects used by the VMs ...

      2019-05-27 12:58:26 +0000: Fetching stats counters once ...

      2019-05-27 12:58:27 +0000: Sleeping for 20 seconds ...

      2019-05-27 12:58:47 +0000: Fetching stats counters again to compute averages ...

      2019-05-27 12:58:47 +0000: Got all data, computing table

      +------------+-----------+-------------+--------------+

      | VM/Object  | IOPS      | Tput (KB/s) | Latency (ms) |

      +------------+-----------+-------------+--------------+

      |Client-fmNP | 0.3r/2.5w | 1.4r/29.6w  | 1.5r/16.0w   |

      +------------+-----------+-------------+--------------+

      (54b7317f-4e7c-4d81-ab16-4a579c63d7ca)/vms> vsan.vm_perf_stats Client-fmNP

      2019-05-27 12:59:00 +0000: Querying info about VMs ...

      2019-05-27 12:59:00 +0000: Querying vSAN objects used by the VMs ...

      2019-05-27 12:59:00 +0000: Fetching stats counters once ...

      2019-05-27 12:59:01 +0000: Sleeping for 20 seconds ...

      2019-05-27 12:59:21 +0000: Fetching stats counters again to compute averages ...

      2019-05-27 12:59:21 +0000: Got all data, computing table

      +------------+-----------+-------------+--------------+

      | VM/Object  | IOPS      | Tput (KB/s) | Latency (ms) |

      +------------+-----------+-------------+--------------+

      |Client-fmNP | 0.3r/3.0w | 1.4r/31.7w  | 1.0r/20.8w   |

      +------------+-----------+-------------+--------------+

       

      Thank you,

      Manivel RR

        • 1. Re: VSAN congestion and latency query
          aNinjaneer Novice

          What kind of congestion are you getting? Congestion can be perfectly normal in some cases. There are six different types of congestion you can have, and the type will help determine where the issue may be, if it's even an issue. To see the type of congestion, you can look at the performance metrics for that particular disk group and it will show the type.

           

          Understanding Congestion

           

          For example, if you're getting SSD congestion, it means your cache tier is able to ingest quicker than you are able to destage to the capacity tier. Log congestion typically means your cache drive is overloaded, and it seems to be more prevalent when deduplication and compression is enabled. If it is becoming an issue where your VMs are slowing down due to resyncing, you can use "resync throttling" to limit the rate at which the resync works.

           

          Is this an all flash cluster, or hybrid? What specific drives are you using for cache and capacity? How much load is vSAN seeing on average without the resync? It sounds like you may just have a fairly slow configuration, and the resync is causing enough additional load that it is causing your congestion.

          • 2. Re: VSAN congestion and latency query
            ManivelR Enthusiast

            Thanks for the response.

             

            Mine is All flash cluster and using Micron/ATA SSD disks.It's SSD congestion and tried out resync throttling option as well but no luck.

             

            Finally logged a case with VMware.As per their response,it is caused due to RAID controller.We are in process to replace the cards.

             

            Thanks,

            Manivel R