VMware Cloud Community
lesdotcom
Contributor
Contributor

Disk Loss Causes Read Latency Spike

Earlier this week we had a disk failure on our Compellent SAN at which point the disk failed over to the hot spare. Several, but not all, of our ESXi hosts  experienced a huge spike in read latency after which the management services in ESXi stopped. I noted several "Failed write command to write-quiesced partition" errors in the log. Rebooting the hosts restarted the management services and stopped the errors. Our SAN administrator opened a ticket with Compellent who noted that our current firmware version does not play well with VAAI and we should update it or disable VAAI. I opened a ticket with VMware who suggested doing the same and said VMware logs don't have much insight beyond the HBA so determining root cause probably isn't possible.

SO, here are my questions:

Is there a logical explanation as to why only some of my hosts would experince the latency spike? I would note that the affected hosts were the busiest (highest IOPS) at the time.

Is there anywhere else I can look to investigate this latency spike to determine root cause?

Given what Compellent said, does it sound reasonable to disable VAAI until we can get the SAN firmware upgraded?

Thanks.

0 Kudos
0 Replies