VMware Cloud Community
peetz
Leadership
Leadership

How to identify hard disk that shows "Predictive Failure" (HPE server)

Greetings,

I have a vSphere 6.5 based hybrid-mode vSAN Cluster using HPE ProLiant DL380 Gen9 nodes.

In one of the host's hardware status one hard disk is shown with "Predictive Failure" status. The shell command

   esxcli ssacli cmd -q "ctrl slot=0 pd all show"

outputs this

Smart Array P840ar in Slot 0 (Embedded)

   HBA Drives

      physicaldrive 1I:3:1 (port 1I:box 3:bay 1, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:2 (port 1I:box 3:bay 2, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:3 (port 1I:box 3:bay 3, SAS HDD, 1.2 TB, Predictive Failure)

      physicaldrive 1I:3:4 (port 1I:box 3:bay 4, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:5 (port 1I:box 3:bay 5, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:6 (port 1I:box 3:bay 6, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:7 (port 1I:box 3:bay 7, SAS HDD, 1.2 TB, OK)

      physicaldrive 1I:3:8 (port 1I:box 3:bay 8, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:5 (port 2I:box 2:bay 5, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS HDD, 1.2 TB, OK)

      physicaldrive 2I:2:7 (port 2I:box 2:bay 7, SAS SSD, 400 GB, OK)

      physicaldrive 2I:2:8 (port 2I:box 2:bay 8, SAS SSD, 400 GB, OK)

However, it looks like ESXi has not (yet) identified the disk to be "bad". The vSAN status is still "OK" for all disks.

Now, I have a hard time identifying which of the VSAN disks need to be decommissioned and replaced, because I do not know the naa-id of the bad disk.

The output of

  esxcli storage core device list

doesn't tell me anything useful.

The output of

  esxcli storage core path list

looks better. For a single disk it outputs something like this:

sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb

   UID: sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb

   Runtime Name: vmhba2:C2:T8:L0

   Device: naa.5000c5009f66edeb

   Device Display Name: Local HP Disk (naa.5000c5009f66edeb)

   Adapter: vmhba2

   Channel: 2

   Target: 8

   LUN: 0

   Plugin: NMP

   State: active

   Transport: sas

   Adapter Identifier: sas.5001438040e9a460

   Target Identifier: sas.1438040e9a460

   Adapter Transport Details: 5001438040e9a460

   Target Transport Details: 1438040e9a460

   Maximum IO Size: 4194304

Actually all hard disks are shown to be on Adapter vmhba2 and Channel 2, only the target number counts from 0 to 15.

Now how do I match the port/box/bay notation of the ssacli tool to the adapter/channel/target notation to find the actual ESXi device id of the bad disk?

Thank you for any pointers ...

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
Tags (3)
Reply
0 Kudos
7 Replies
RickVerstegen
Expert
Expert

Hi,

Maybe you could use hpssacli utility to see the state of the device and verify that info with iLO.

Was I helpful? Give a kudo for appreciation!
Blog: https://rickverstegen84.wordpress.com/
Twitter: https://twitter.com/verstegenrick
Reply
0 Kudos
peetz
Leadership
Leadership

Thanks, but that doesn't help to find the naa device id of the failed disk.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
Reply
0 Kudos
peetz
Leadership
Leadership

I did some more digging and finally found the information I needed:

1. Look at the details of the failed disk with

    esxcli ssacli cmd -q "ctrl slot=0 pd 1I:3:3 show"

Among other details this outputs the WWID of the disk (5000C5009F67BDA5 in my case). This WWID looks like an naa.id, but I could not find a disk device that has exactly this ID.

2. However, I was able to find a disk that has "almost" this naa.id and only differs in the last digit (naa.5000c5009f67bda7 in my case). By comparing the other disks' WWIDs and looking at the available naa.ids I found that this is the case for all the WWIDs and disks.

So, I'm pretty sure now that this is a good way to match the output of ssacli to the naa.ids and that I found the bad disk.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
Reply
0 Kudos
RickVerstegen
Expert
Expert

Good to hear that you find the information you needed. I came across this blog https://www.perthorn.com/vsan-operations-local-disk-identification/
It also provides kind of a workaround to find the bad disk.

Was I helpful? Give a kudo for appreciation!
Blog: https://rickverstegen84.wordpress.com/
Twitter: https://twitter.com/verstegenrick
Reply
0 Kudos
anubav
Contributor
Contributor

The easiest way to figure the bad disk would be to identify the "Runtime Name: vmhba2:C2:T8:L0" and then blink the rest of the DISKS. the one that does not blink would the one you want to remove and replace.

I have dealt with failed SSDs many times in our environment and that has always worked for me.

Reply
0 Kudos
peetz
Leadership
Leadership

This helps to physically identify the disk and have it properly replaced, but my intention was to identify the right ESXi disk before it is physically removed, because I want to remove it from the vSAN disk group and evacuate all data from it before I physically replace it.

This is best practice and the safest way when removing a hard disk that is part of vSAN. With SSDs that belong to the vSAN cache tier it is even more important to properly replace it, because removing it invalidates the whole disk group that it belongs to.

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de
Reply
0 Kudos
Petr_HPE
Contributor
Contributor

Late response. but still can be useful .

Get from HPE server WWN +SN of installed drives (acucli , adu report ..)

Example

  ***** Discovered Devices - Additional Information *****

Device ,WWN ,WWN hash, Handle

D000 p0|0x1 [01]P2I:01:05,5000C500589F4955, 0x2D244B,05060005

***** Discovered Devices *****

Device [BoxIndex]Port:BoxOnPort:Bay

Path|Paths ,Type Vendor ,Product ,Rev ,SerialNumber [,misc]

D000 p0|0x1 [01]P2I:01:05,Disk HP ,MM1000FBFVR ,HPD9,9XG6P2SB000094406240,07K,SCFW=11,SCTYPE=1

 

Run on ESXi host :  esxcfg-scsidevs -l

Example

mpx.vmhba0:C0:T64:L0

   Device Type: Direct-Access

   Size: 1831420 MB

   Display Name: Local ATA Disk (mpx.vmhba0:C0:T64:L0)

   Multipath Plugin: NMP

   Console Device: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0

   Devfs Path: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0

   Vendor: ATA       Model: VK001920GWEZE     Revis: HPGE

   SCSI Level: 6  Is Pseudo: false Status: on

   Is RDM Capable: true  Is Removable: false

   Is Local: true  Is SSD: true

   Other Names:

      vml.02000000005000C500589F4955b30303139

 

Resolution

Device WWN 5000C500589F4955 ins included in Other Names : vml .02000

This help you identify mpx.vmhba0:C0:T64:L0 .

Reply
0 Kudos