Solved: Re: RAID6 cachecade drive took a dump. Datastore n...

Ohgodcomeon · ‎02-24-2022

Ohgodcomeon · ‎02-24-2022

Alright, I have solved the issue. No idea why, but MegaRAID blocks virtual drive access after a drive failure of this type. Here's how it went:

[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 show
Controller = 1
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
-------------------------------------------------------------
0/1   RAID6 Optl  B      Yes     NRWTC -   ON  5.455 TB
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Notice the "Access B" in the virtual drive? I had re-enabled the cachecade drive, but not the virtual drive, which was "blocked".

To fix this, I had to unblock the drive as follows:

[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 set accesspolicy=rmvblkd
Controller = 1
Status = Success
Description = None

Detailed Status :
===============

------------------------------------------
VD Property  Value   Status  ErrCd ErrMsg
------------------------------------------
 1 AccPolicy RmvBlkd Success     0 -
------------------------------------------


[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 show
Controller = 1
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
-------------------------------------------------------------
0/1   RAID6 Optl  RW     Yes     RWBC  -   ON  5.455 TB
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Access now shows RW, and wouldn't you know it? After a refresh in the storage menu, my RAID is back to normal with my VMs intact.

Man am I glad to get this back up and running.

View solution in original post

Ohgodcomeon · ‎02-24-2022

Hey all. I went to bed last night, and during the evening, one of my RAIDs took a dump. It's an LSI MegaRAID controller which I installed a SSD in to utilize cachecade.

Upon waking up the next day, I was unable to access most of the VMs on my server, and taking a look at it, the cachecade SSD drive activity was blinking in a strange fashion. After a reboot, I saw that the RAID seemed to have failed, though the drives were able to properly initialize in their same capacity. My assumption is that something with the cachecade drive got weird. I'm probably going to remove that soon.

I noticed that storcli showed a UBad state for the SSD shown below:

PD LIST :
=======

------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp
------------------------------------------------------------------------
252:0 127 UBad - 465.25 GB SATA SSD Y N 512B CT500MX500SSD1 U
252:1 146 Onln 0 1.090 TB SAS HDD N N 512B STHB1200S5xeN010 U
252:2 152 Onln 0 1.090 TB SAS HDD N N 512B HCBF1200S5xeN010 U
252:3 151 Onln 0 1.090 TB SAS HDD N N 512B HCBF1200S5xeN010 U
252:4 156 Onln 0 1.090 TB SAS HDD N N 512B STHB1200S5xeN010 U
252:5 155 Onln 0 1.090 TB SAS HDD N N 512B HCBF1200S5xeN010 U
252:6 154 Onln 0 1.090 TB SAS HDD N N 512B HCBF1200S5xeN010 U
252:7 153 Onln 0 1.090 TB SAS HDD N N 512B HCBF1200S5xeN010 U
------------------------------------------------------------------------

I set the drive back to good in the hopes that it would simply be detected after the recovery. Sadly, it was not. I know the data in this RAID has not disappeared, but am not sure how to go about recovering it. It seems like there's been a problem with the partition.

partedUtil shows the following:

[root@vmware:/vmfs/volumes] partedUtil getptbl /vmfs/devices/disks/naa.600605b006882f80253dbff41a5c1a2d
unknown
729279 255 63 11715870720

There is no results when using esxcfg-volume -l

If I want to, I can create a new datastore on the RAID, but I feel like that would be a mistake on account of the fact that it would overwrite the data.

In the vmkernel.log, I am showing the following errors:

2022-02-25T00:08:29.308Z cpu15:2098050)ScsiDeviceIO: 3449: Cmd(0x459a40b136c0) 0x28, CmdSN 0x1961 from world 0 to dev "naa.600605b006882f80253dbff41a5c1a2d" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0x3.
2022-02-25T00:08:29.308Z cpu25:2101601)Partition: 430: Failed read for "naa.600605b006882f80253dbff41a5c1a2d": I/O error
2022-02-25T00:08:29.308Z cpu25:2101601)Partition: 1108: Failed to read protective mbr on "naa.600605b006882f80253dbff41a5c1a2d" : I/O error
2022-02-25T00:08:29.308Z cpu25:2101601)WARNING: Partition: 1261: Partition table read from device naa.600605b006882f80253dbff41a5c1a2d failed: I/O error
2022-02-25T00:08:29.309Z cpu15:2098050)NMP: nmp_ThrottleLogForDevice:3802: Cmd 0x28 (0x459a40b136c0, 2101601) to dev "naa.600605b006882f80253dbff41a5c1a2d" on path "vmhba2:C2:T1:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x2 0x4 0x3. Act:NONE

Any help would be appreciated. It seems like the MBR got hosed.

Ohgodcomeon · ‎02-24-2022

Oh, further info:

[root@vmware:/dev/disks] partedUtil getUsableSectors naa.600605b006882f80253dbff41a5c1a2d
Unknown partition table on disk naa.600605b006882f80253dbff41a5c1a2d

Ohgodcomeon · ‎02-24-2022

Alright, I have solved the issue. No idea why, but MegaRAID blocks virtual drive access after a drive failure of this type. Here's how it went:

[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 show
Controller = 1
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
-------------------------------------------------------------
0/1   RAID6 Optl  B      Yes     NRWTC -   ON  5.455 TB
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Notice the "Access B" in the virtual drive? I had re-enabled the cachecade drive, but not the virtual drive, which was "blocked".

To fix this, I had to unblock the drive as follows:

[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 set accesspolicy=rmvblkd
Controller = 1
Status = Success
Description = None

Detailed Status :
===============

------------------------------------------
VD Property  Value   Status  ErrCd ErrMsg
------------------------------------------
 1 AccPolicy RmvBlkd Success     0 -
------------------------------------------


[root@vmware:/opt/lsi/storcli] ./storcli /c1/v1 show
Controller = 1
Status = Success
Description = None


Virtual Drives :
==============

-------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC     Size Name
-------------------------------------------------------------
0/1   RAID6 Optl  RW     Yes     RWBC  -   ON  5.455 TB
-------------------------------------------------------------

Cac=CacheCade|Rec=Recovery|OfLn=OffLine|Pdgd=Partially Degraded|dgrd=Degraded
Optl=Optimal|RO=Read Only|RW=Read Write|HD=Hidden|TRANS=TransportReady|B=Blocked|
Consist=ConsistentR=Read Ahead Always|NR=No Read Ahead|WB=WriteBack|
AWB=Always WriteBack|WT=WriteThrough|C=Cached IO|D=Direct IO|sCC=Scheduled
Check Consistency

Access now shows RW, and wouldn't you know it? After a refresh in the storage menu, my RAID is back to normal with my VMs intact.

Man am I glad to get this back up and running.

All

RAID6 cachecade drive took a dump. Datastore not being detected after.