VMware Cloud Community
ArturLorek
Contributor
Contributor

ESXI 6.7u3 PSOD: Tried to unlock non-locked lock

Dear Community users;

I thought I would be able to find sth relevant and sort it out myself, but somehow, but I am getting nowhere and no similar cases could be found.
I would therefore greatly appreciate some help / suggestions / brainstorming with some clever heads that undoubtedly know more than I do.

I have got a single host, running ESXi 6.7 update 3 (no VCenter), which occasionally (and rather randomly) is crashing with a PSOD and a lovely message: Tried to unlock non-locked lock.
The only way out is to shut it down and power up (as the regular reset does not bring the NVMe datastore back).

I am suspecting issues with the NVMe, which is a consumer drive (Integral M Series M.2 2280 1TB - Silicon Motion, Inc. SM2263EN/SM2263XT SSD) running courtesy of the old driver "VMW_bootbank_nvme_1.2.1.34-1vmw.670.0.0.8169922" and the trick of downgrading the 6.7u3 driver to an earlier version (a popular solution, brought to light by William Lam and a few others), but so far could not confirm anything.
The drive has been faultless for the last 8-9 months (host set-up in march 2021) and out of the blue the issues started around late December / early Jan 2022. Yet according to the S.M.A.R.T check, there are no bad sectors on the drive, it is not overheating either - normally operating at around 40 degrees C.  The drive is claimed by HPP, standard queue depth of 1022. 

ZDUMP analysis is showing (some miliseconds before the crash) that a scan operation on the NVMe is aborted and from there starts the Panic Domino Effect 😞
Used VMware ESXi SCSI Sense Code Decoder to decode a Warning that also comes up with regards to that drive:
WARNING: HPP: HppThrottleLogForDevice:570: Error status H:0x5 D:0x22 P:0x0 Invalid sense data: 0x0 0x0 0x0.

But it basically confirms the operation is aborted, but no hint as to more understandable reason behind this. 
My observation is that it happens under heavier load.... But not always, as it happened 2-3 times over night, when the host was "on idle". as the VMs were doing nothing.

esxcli storage core device vaai status get

t10.NVMe____M_Series_NVMe_SSD_1T____________________20421041760099______00000001
VAAI Plugin Name:
ATS Status: unsupported
Clone Status: unsupported
Zero Status: supported
Delete Status: supported

So I disabled ATS suspecting this might have a bearing on its operation, but it did not solve the issue and seemingly has not effect.

I dare to attache an excerpt of the latest ZDUMP which shows the time of shortly before the PSOD up to when the dump is created, as well as one of the PSODs I took a picture of (they always have the same message, differ with the Error stack).

Would someone please have a minute to have a look at it and suggest something ??
I would greatly appreciate your input.

 

0 Kudos
0 Replies