4 Replies Latest reply on May 26, 2020 1:36 PM by TheBobkin

    High latency on vsan stretched cluster

    seamusobr1 Enthusiast

      Good afternoon


      We are running a stretched cluster with 16 nodes and 10gbps uplinks

      The version is 6.5 update 3

      We had an alarm raised because some of the VMs experienced read/write latencies of about 800ms

      I think I have traced the issue back to a disk group

      All of the disks in the disk group have been showing results that like that below


      Not seeing any issues with cache destage rates

      Does anyone know why there would be high physical/firmware layer latency on all disks in the group


      Thanks in advance

        • 1. Re: High latency on vsan stretched cluster
          TheBobkin Virtuoso
          VMware EmployeesvExpert

          Hello Seamus,


          Is there only one Disk-Group on that host? If so it could be an issue on the controller.

          If there are multiple Disk-Groups on the host then it is more likely an issue with the Cache-tier or if dedupe is enabled then potentially one Capacity-tier device.

          What do you see in vmkernel.log and vobd.log at the time of the latency occurring?



          • 2. Re: High latency on vsan stretched cluster
            seamusobr1 Enthusiast

            Thanks I will take a look

            • 3. Re: High latency on vsan stretched cluster
              Zifu_invzion Novice



              As TheBobkin , in vSAN 6.5 dedup could be a reason for high latency. If you have the possibility of put the host in maintenance mode and re-create the disk group maybe could resolve the problem.


              • 4. Re: High latency on vsan stretched cluster
                TheBobkin Virtuoso
                vExpertVMware Employees

                Hello Zifu_invzion,


                I don't see how you might get a correlation between dedupe and such issues - for a start, the issue impacted multiple Disk-Groups and vSAN dedupes only per Disk-Group. If you mean device latency/strain from the extra load that enabling/disabling dedupe would do (as it basically has to read and re-write all data), this would also be ruled out from the fact that the graphs indicate the issue occurred over the course of a few minutes not a prolonged duration (and OP likely would have mentioned this if they were performing such activities).

                My assumption would still be a controller issue or potentially a knock-on issue on the controller caused by some misbehaving attached device.

                I also don't really see how wiping and recreating a Disk-Group would help in anyway - the issue doesn't appear to have been prolonged and thus likely was dealt with by automated functions as opposed to human intervention.