VMware Cloud Community
avasu
Contributor
Contributor

ATS Miscompare seen during heavy load on storage array.

We currently have ESX host version 6.7 update 1. There is a flooding ATS miscompare messages whenever the storage array is under heavy load. siDeviceIO: 3082: Cmd(0x459a60604580) 0x89, CmdSN 0x3cc83 from world 2171298 to dev x failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.                           

I have read these articles https://cormachogan.com/2017/08/24/ats-miscompare-revisited-vsphere-6-5/ and VMware Knowledge Base​ to understand more about why miscompares happen. What I understand is this (and I might be completely wrong):

1. ATS tests an in-memory value with an on-disk value on the storage array. From what I have read, I think the value being written is probably a timestamp. We see miscomapare messages if there's a mismatch between these values.

2. Each ESX host has a separate heartbeat region which has to be updated every 3 seconds to maintain the lock on a specific region of the volume or disk. So, two hosts can never update the same heartbeat region though they can observe the lock, break the lock, etc.

My question is this - How can a heavily loaded array lead to a miscompare? These are the following scenarios I can think of:

1. If the ATS operation times out, then the host would know that the operation has timed out. It will retry with the same ATS image, and it should ultimately succeed. If the operation has succeeded on the storage array but the host had timed out, the next ATS would be a false miscompare, and the operation would eventually succeed.

2. If the host had timed out, and the operations hadn't succeeded on the storage array, then the retry would the timed out test pattern itself would succeed.

I am not able to come up with a situation where we can see a MISCOMPARE. To be honest, after thinking out loud, I don't I can come up with any scenario when we can see a miscompare. Can anyone explain the exact purpose of a MISCOMPARE or lead me to any documents that can explain it further?

Tags (1)
Reply
0 Kudos
4 Replies
daphnissov
Immortal
Immortal

This is probably due to a defect addressed in 6.5 U2 p2 documented here. Regardless of whether this is the case, I'd strongly advise an update due to the sheer number of other issues fixed.

Reply
0 Kudos
avasu
Contributor
Contributor

I thought it might be this, but we are using version 6.7 update 1.

Reply
0 Kudos
daphnissov
Immortal
Immortal

Open a support case on this.

Reply
0 Kudos
continuum
Immortal
Immortal

> 2. Each ESX host has a separate heartbeat region which has to be updated every 3 seconds to maintain the lock
> on a specific region of the volume or disk. So, two hosts can never update the
> same heartbeat region though they can observe the lock, break the lock, etc.
Consider a datastore with some large VMDKs in thin provisioned mode as they are typically used this days.
Thin provisioned vmdks under heavy load may need to update their allocation on the VMFS volume way more often than each 3 seconds.
With highly fragmented VMDKs often split in 100.000 and more fragments I am not surprised at all that the locks cant be updated in time.
If this becomes a problem I highly recommend to reduce the amount of thin provisioning you use.


________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

Reply
0 Kudos