VMware Cloud Community
markotsg80
Enthusiast
Enthusiast

MD mechanical positioning error - ESXI not recognising SCSI sense codes - crashing

We have VSAN 6.2 ESXI 6.0

DL380 G9 hosts with P440 controllers (3X per host) , 7 MD + 1 SSD per host.

We had multiple disk failures so far.

In both instances, the Host becoming unresponsive has been caused by the fact that ESXi does not recognise the SCSI code received for a ‘mechanical positioning error’ as being total hardware failure of the disk and does not mark it as a PDL.  So, VSAN is continually trying to reclaim the disk which eventually causes the processes on the Host to fail eventually

Host would be not manageable and it will disconnect from VCenter

in one instance we where told feature request need to be raised for this?

Has anyone else experienced this?

Reply
0 Kudos
6 Replies
TheBobkin
Champion
Champion

Hello markotsg80​,

What were the specific sense codes in these instance?

If is spouting vendor-specific codes then best thing to do would be to make a feature request that ESXi might react to these prompts appropriately or engage your hardware vendor to use sense codes which ESXi/vSAN will respond to in the desired manner.

Bob

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

I believe its

0x4 0x15 0x1

VMware ESXi SCSI Sense Code Decoder | Virten.net

Host Status

[0x0]OKThis status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status[0x2]CHECK_CONDITIONThis status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status[0x0]GOODNo error. (ESXi 5.x / 6.x only)
Sense Key[0x4]HARDWARE ERROR
Additional Sense Data15/01MECHANICAL POSITIONING ERROR

We are seeing in vmkernel log file lot of sense codes which are illegal requests

Reply
0 Kudos
srodenburg
Expert
Expert

I would actually expect "Dying Disk Handling (DDH)" detect that bad drive, regardless of sense-code (as it can also act on excessive latency of a disk), and act accordingly.

KB Article about DDH: VMware Knowledge Base

(In older version this feature was called "Problematic Disk Handling")

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

From what I ve seen from the Log Insight, there was no latency for this disk prior to failure, as its Mechanical positioning error and disk stopped working, ESXI did not recognise the SCSI code from the device (disk) and did not treat it as PDL.

VSAN tries to use the disk still and eventually esxi host would crash.

Dying Disk Handling did not pick up this disk as there was no evidence of increased latency

Reply
0 Kudos
srodenburg
Expert
Expert

Understood.

Dying Disk Handling did not notice a Dying Disk (pun intended).

Well, than VMware needs to add that sense-code to DDH. Did you open a support-request? They can't help you anymore of course, but at least they become aware so that engineering can put it on their to-do list.

Reply
0 Kudos
markotsg80
Enthusiast
Enthusiast

thanks all. I am relatively new to vsan (love it so far Smiley Happy ), but surprised that esxi does not recognize these SCSI codes for disks from well known and supported vendor such as HPE.

These servers and disks are on VMWare HCL and supported by all VSAN versions

Reply
0 Kudos