Greetings,
I have a vSphere 6.5 based hybrid-mode vSAN Cluster using HPE ProLiant DL380 Gen9 nodes.
In one of the host's hardware status one hard disk is shown with "Predictive Failure" status. The shell command
esxcli ssacli cmd -q "ctrl slot=0 pd all show"
outputs this
Smart Array P840ar in Slot 0 (Embedded)
HBA Drives
physicaldrive 1I:3:1 (port 1I:box 3:bay 1, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:2 (port 1I:box 3:bay 2, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:3 (port 1I:box 3:bay 3, SAS HDD, 1.2 TB, Predictive Failure)
physicaldrive 1I:3:4 (port 1I:box 3:bay 4, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:5 (port 1I:box 3:bay 5, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:6 (port 1I:box 3:bay 6, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:7 (port 1I:box 3:bay 7, SAS HDD, 1.2 TB, OK)
physicaldrive 1I:3:8 (port 1I:box 3:bay 8, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:1 (port 2I:box 2:bay 1, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:2 (port 2I:box 2:bay 2, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:3 (port 2I:box 2:bay 3, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:4 (port 2I:box 2:bay 4, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:5 (port 2I:box 2:bay 5, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:6 (port 2I:box 2:bay 6, SAS HDD, 1.2 TB, OK)
physicaldrive 2I:2:7 (port 2I:box 2:bay 7, SAS SSD, 400 GB, OK)
physicaldrive 2I:2:8 (port 2I:box 2:bay 8, SAS SSD, 400 GB, OK)
However, it looks like ESXi has not (yet) identified the disk to be "bad". The vSAN status is still "OK" for all disks.
Now, I have a hard time identifying which of the VSAN disks need to be decommissioned and replaced, because I do not know the naa-id of the bad disk.
The output of
esxcli storage core device list
doesn't tell me anything useful.
The output of
esxcli storage core path list
looks better. For a single disk it outputs something like this:
sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb
UID: sas.5001438040e9a460-sas.1438040e9a460-naa.5000c5009f66edeb
Runtime Name: vmhba2:C2:T8:L0
Device: naa.5000c5009f66edeb
Device Display Name: Local HP Disk (naa.5000c5009f66edeb)
Adapter: vmhba2
Channel: 2
Target: 8
LUN: 0
Plugin: NMP
State: active
Transport: sas
Adapter Identifier: sas.5001438040e9a460
Target Identifier: sas.1438040e9a460
Adapter Transport Details: 5001438040e9a460
Target Transport Details: 1438040e9a460
Maximum IO Size: 4194304
Actually all hard disks are shown to be on Adapter vmhba2 and Channel 2, only the target number counts from 0 to 15.
Now how do I match the port/box/bay notation of the ssacli tool to the adapter/channel/target notation to find the actual ESXi device id of the bad disk?
Thank you for any pointers ...
Andreas
Hi,
Maybe you could use hpssacli utility to see the state of the device and verify that info with iLO.
Thanks, but that doesn't help to find the naa device id of the failed disk.
I did some more digging and finally found the information I needed:
1. Look at the details of the failed disk with
esxcli ssacli cmd -q "ctrl slot=0 pd 1I:3:3 show"
Among other details this outputs the WWID of the disk (5000C5009F67BDA5 in my case). This WWID looks like an naa.id, but I could not find a disk device that has exactly this ID.
2. However, I was able to find a disk that has "almost" this naa.id and only differs in the last digit (naa.5000c5009f67bda7 in my case). By comparing the other disks' WWIDs and looking at the available naa.ids I found that this is the case for all the WWIDs and disks.
So, I'm pretty sure now that this is a good way to match the output of ssacli to the naa.ids and that I found the bad disk.
Good to hear that you find the information you needed. I came across this blog https://www.perthorn.com/vsan-operations-local-disk-identification/
It also provides kind of a workaround to find the bad disk.
The easiest way to figure the bad disk would be to identify the "Runtime Name: vmhba2:C2:T8:L0" and then blink the rest of the DISKS. the one that does not blink would the one you want to remove and replace.
I have dealt with failed SSDs many times in our environment and that has always worked for me.
This helps to physically identify the disk and have it properly replaced, but my intention was to identify the right ESXi disk before it is physically removed, because I want to remove it from the vSAN disk group and evacuate all data from it before I physically replace it.
This is best practice and the safest way when removing a hard disk that is part of vSAN. With SSDs that belong to the vSAN cache tier it is even more important to properly replace it, because removing it invalidates the whole disk group that it belongs to.
Late response. but still can be useful .
Get from HPE server WWN +SN of installed drives (acucli , adu report ..)
Example
***** Discovered Devices - Additional Information *****
Device ,WWN ,WWN hash, Handle
D000 p0|0x1 [01]P2I:01:05,5000C500589F4955, 0x2D244B,05060005
***** Discovered Devices *****
Device [BoxIndex]Port:BoxOnPort:Bay
Path|Paths ,Type Vendor ,Product ,Rev ,SerialNumber [,misc]
D000 p0|0x1 [01]P2I:01:05,Disk HP ,MM1000FBFVR ,HPD9,9XG6P2SB000094406240,07K,SCFW=11,SCTYPE=1
Run on ESXi host : esxcfg-scsidevs -l
Example
mpx.vmhba0:C0:T64:L0
Device Type: Direct-Access
Size: 1831420 MB
Display Name: Local ATA Disk (mpx.vmhba0:C0:T64:L0)
Multipath Plugin: NMP
Console Device: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0
Devfs Path: /vmfs/devices/disks/mpx.vmhba0:C0:T64:L0
Vendor: ATA Model: VK001920GWEZE Revis: HPGE
SCSI Level: 6 Is Pseudo: false Status: on
Is RDM Capable: true Is Removable: false
Is Local: true Is SSD: true
Other Names:
vml.02000000005000C500589F4955b30303139
Resolution
Device WWN 5000C500589F4955 ins included in Other Names : vml .02000
This help you identify mpx.vmhba0:C0:T64:L0 .