Solved: Re: High latency on vsan stretched cluster

seamusobr1 · ‎05-22-2020

Good afternoon

We are running a stretched cluster with 16 nodes and 10gbps uplinks

The version is 6.5 update 3

We had an alarm raised because some of the VMs experienced read/write latencies of about 800ms

I think I have traced the issue back to a disk group

All of the disks in the disk group have been showing results that like that below

Not seeing any issues with cache destage rates

Does anyone know why there would be high physical/firmware layer latency on all disks in the group

Thanks in advance

TheBobkin · ‎05-26-2020

Hello Zifu_invzion,

I don't see how you might get a correlation between dedupe and such issues - for a start, the issue impacted multiple Disk-Groups and vSAN dedupes only per Disk-Group. If you mean device latency/strain from the extra load that enabling/disabling dedupe would do (as it basically has to read and re-write all data), this would also be ruled out from the fact that the graphs indicate the issue occurred over the course of a few minutes not a prolonged duration (and OP likely would have mentioned this if they were performing such activities).

My assumption would still be a controller issue or potentially a knock-on issue on the controller caused by some misbehaving attached device.

I also don't really see how wiping and recreating a Disk-Group would help in anyway - the issue doesn't appear to have been prolonged and thus likely was dealt with by automated functions as opposed to human intervention.

Bob

View solution in original post

TheBobkin · ‎05-22-2020

Hello Seamus,

Is there only one Disk-Group on that host? If so it could be an issue on the controller.

If there are multiple Disk-Groups on the host then it is more likely an issue with the Cache-tier or if dedupe is enabled then potentially one Capacity-tier device.

What do you see in vmkernel.log and vobd.log at the time of the latency occurring?

Bob

seamusobr1 · ‎05-22-2020

Thanks I will take a look

Zifu_invzion · ‎05-26-2020

Hi,

As TheBobkin , in vSAN 6.5 dedup could be a reason for high latency. If you have the possibility of put the host in maintenance mode and re-create the disk group maybe could resolve the problem.

BR!

TheBobkin · ‎05-26-2020

Hello Zifu_invzion,

I don't see how you might get a correlation between dedupe and such issues - for a start, the issue impacted multiple Disk-Groups and vSAN dedupes only per Disk-Group. If you mean device latency/strain from the extra load that enabling/disabling dedupe would do (as it basically has to read and re-write all data), this would also be ruled out from the fact that the graphs indicate the issue occurred over the course of a few minutes not a prolonged duration (and OP likely would have mentioned this if they were performing such activities).

My assumption would still be a controller issue or potentially a knock-on issue on the controller caused by some misbehaving attached device.

I also don't really see how wiping and recreating a Disk-Group would help in anyway - the issue doesn't appear to have been prolonged and thus likely was dealt with by automated functions as opposed to human intervention.

Bob