Good afternoon
We are running a stretched cluster with 16 nodes and 10gbps uplinks
The version is 6.5 update 3
We had an alarm raised because some of the VMs experienced read/write latencies of about 800ms
I think I have traced the issue back to a disk group
All of the disks in the disk group have been showing results that like that below
Not seeing any issues with cache destage rates
Does anyone know why there would be high physical/firmware layer latency on all disks in the group
Thanks in advance
Hello Zifu_invzion,
I don't see how you might get a correlation between dedupe and such issues - for a start, the issue impacted multiple Disk-Groups and vSAN dedupes only per Disk-Group. If you mean device latency/strain from the extra load that enabling/disabling dedupe would do (as it basically has to read and re-write all data), this would also be ruled out from the fact that the graphs indicate the issue occurred over the course of a few minutes not a prolonged duration (and OP likely would have mentioned this if they were performing such activities).
My assumption would still be a controller issue or potentially a knock-on issue on the controller caused by some misbehaving attached device.
I also don't really see how wiping and recreating a Disk-Group would help in anyway - the issue doesn't appear to have been prolonged and thus likely was dealt with by automated functions as opposed to human intervention.
Bob
Hello Seamus,
Is there only one Disk-Group on that host? If so it could be an issue on the controller.
If there are multiple Disk-Groups on the host then it is more likely an issue with the Cache-tier or if dedupe is enabled then potentially one Capacity-tier device.
What do you see in vmkernel.log and vobd.log at the time of the latency occurring?
Bob
Thanks I will take a look
Hi,
As TheBobkin , in vSAN 6.5 dedup could be a reason for high latency. If you have the possibility of put the host in maintenance mode and re-create the disk group maybe could resolve the problem.
BR!
Hello Zifu_invzion,
I don't see how you might get a correlation between dedupe and such issues - for a start, the issue impacted multiple Disk-Groups and vSAN dedupes only per Disk-Group. If you mean device latency/strain from the extra load that enabling/disabling dedupe would do (as it basically has to read and re-write all data), this would also be ruled out from the fact that the graphs indicate the issue occurred over the course of a few minutes not a prolonged duration (and OP likely would have mentioned this if they were performing such activities).
My assumption would still be a controller issue or potentially a knock-on issue on the controller caused by some misbehaving attached device.
I also don't really see how wiping and recreating a Disk-Group would help in anyway - the issue doesn't appear to have been prolonged and thus likely was dealt with by automated functions as opposed to human intervention.
Bob