Re: How to identify hard disk that shows "Predicti...

peetz · ‎09-24-2018

Greetings,

I have a vSphere 6.5 based hybrid-mode vSAN Cluster using HPE ProLiant DL380 Gen9 nodes.

In one of the host's hardware status one hard disk is shown with "Predictive Failure" status. The shell command

esxcli ssacli cmd -q "ctrl slot=0 pd all show"

outputs this

Smart Array P840ar in Slot 0 (Embedded)

HBA Drives

physicaldrive 1I:3:1 (port 1I:box 3:bay 1, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:2 (port 1I:box 3:bay 2, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:3 (port 1I:box 3:bay 3, SAS HDD, 1.2 TB, Predictive Failure)

physicaldrive 1I:3:4 (port 1I:box 3:bay 4, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:5 (port 1I:box 3:bay 5, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:6 (port 1I:box 3:bay 6, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:7 (port 1I:box 3:bay 7, SAS HDD, 1.2 TB, OK)

physicaldrive 1I:3:8 (port 1I:box 3:bay 8, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:5 (port 2I:box 2:bay 5, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS HDD, 1.2 TB, OK)

physicaldrive 2I:2:7 (port 2I:box 2:bay 7, SAS SSD, 400 GB, OK)

physicaldrive 2I:2:8 (port 2I:box 2:bay 8, SAS SSD, 400 GB, OK)

However, it looks like ESXi has not (yet) identified the disk to be "bad". The vSAN status is still "OK" for all disks.

Now, I have a hard time identifying which of the VSAN disks need to be decommissioned and replaced, because I do not know the naa-id of the bad disk.

The output of

esxcli storage core device list

doesn't tell me anything useful.

The output of

esxcli storage core path list

looks better. For a single disk it outputs something like this:

sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb

UID: sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb

Runtime Name: vmhba2:C2:T8:L0

Device: naa.5000c5009f66edeb

Device Display Name: Local HP Disk (naa.5000c5009f66edeb)

Adapter: vmhba2

Channel: 2

Target: 8

LUN: 0

Plugin: NMP

State: active

Transport: sas

Adapter Identifier: sas.5001438040e9a460

Target Identifier: sas.1438040e9a460

Adapter Transport Details: 5001438040e9a460

Target Transport Details: 1438040e9a460

Maximum IO Size: 4194304

Actually all hard disks are shown to be on Adapter vmhba2 and Channel 2, only the target number counts from 0 to 15.

Now how do I match the port/box/bay notation of the ssacli tool to the adapter/channel/target notation to find the actual ESXi device id of the bad disk?

Thank you for any pointers ...

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

RickVerstegen · ‎09-24-2018

Hi,

Maybe you could use hpssacli utility to see the state of the device and verify that info with iLO.

Was I helpful? Give a kudo for appreciation!
Blog: https://rickverstegen84.wordpress.com/
Twitter: https://twitter.com/verstegenrick

peetz · ‎09-24-2018

Thanks, but that doesn't help to find the naa device id of the failed disk.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

peetz · ‎09-24-2018

I did some more digging and finally found the information I needed:

1. Look at the details of the failed disk with

esxcli ssacli cmd -q "ctrl slot=0 pd 1I:3:3 show"

Among other details this outputs the WWID of the disk (5000C5009F67BDA5 in my case). This WWID looks like an naa.id, but I could not find a disk device that has exactly this ID.

2. However, I was able to find a disk that has "almost" this naa.id and only differs in the last digit (naa.5000c5009f67bda7 in my case). By comparing the other disks' WWIDs and looking at the available naa.ids I found that this is the case for all the WWIDs and disks.

So, I'm pretty sure now that this is a good way to match the output of ssacli to the naa.ids and that I found the bad disk.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

RickVerstegen · ‎09-25-2018

Good to hear that you find the information you needed. I came across this blog https://www.perthorn.com/vsan-operations-local-disk-identification/
It also provides kind of a workaround to find the bad disk.

Was I helpful? Give a kudo for appreciation!
Blog: https://rickverstegen84.wordpress.com/
Twitter: https://twitter.com/verstegenrick

anubav · ‎09-30-2018

The easiest way to figure the bad disk would be to identify the "Runtime Name: vmhba2:C2:T8:L0" and then blink the rest of the DISKS. the one that does not blink would the one you want to remove and replace.

I have dealt with failed SSDs many times in our environment and that has always worked for me.

peetz · ‎10-02-2018

This helps to physically identify the disk and have it properly replaced, but my intention was to identify the right ESXi disk before it is physically removed, because I want to remove it from the vSAN disk group and evacuate all data from it before I physically replace it.

This is best practice and the safest way when removing a hard disk that is part of vSAN. With SSDs that belong to the vSAN cache tier it is even more important to properly replace it, because removing it invalidates the whole disk group that it belongs to.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

Petr_HPE · ‎07-22-2021

Late response. but still can be useful .

Get from HPE server WWN +SN of installed drives (acucli , adu report ..)

Example

***** Discovered Devices - Additional Information *****

Device ,WWN ,WWN hash, Handle

D000 p0|0x1 [01]P2I:01:05,5000C500589F4955, 0x2D244B,05060005

***** Discovered Devices *****

Device [BoxIndex]Port:BoxOnPort:Bay

Path|Paths ,Type Vendor ,Product ,Rev ,SerialNumber [,misc]

D000 p0|0x1 [01]P2I:01:05,Disk HP ,MM1000FBFVR ,HPD9,9XG6P2SB000094406240,07K,SCFW=11,SCTYPE=1

Run on ESXi host : esxcfg-scsidevs -l

Example

mpx.vmhba0:C0:T64:L0

Device Type: Direct-Access

Size: 1831420 MB

Display Name: Local ATA Disk (mpx.vmhba0:C0:T64:L0)

Multipath Plugin: NMP

Console Device: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0

Devfs Path: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0

Vendor: ATA Model: VK001920GWEZE Revis: HPGE

SCSI Level: 6 Is Pseudo: false Status: on

Is RDM Capable: true Is Removable: false

Is Local: true Is SSD: true

Other Names:

vml.02000000005000C500589F4955b30303139

Resolution

Device WWN 5000C500589F4955 ins included in Other Names : vml .02000

This help you identify mpx.vmhba0:C0:T64:L0 .

All

How to identify hard disk that shows "Predictive Failure" (HPE server)