VSAN disk group failure "Propagated permanent disk...

ManivelR · ‎12-17-2019

Hi All,

I have a VSAN setup with 4 nodes(all flash VSAN) each host has only one disk group(1 * 1 TB for cache & 2 *1 TB for capacity).

In one host---> we had an issue as "Propagated permanent disk failure in disk group".

in IPMI,we are not seeing any disk failures and all 3* 1 TB disks looks good and no errors reported. when we delete the disk group and create the disk group with same disks(3 * 1 TB disks),it fails again.I mean(after DG creation),it reports the same DG issue""Propagated permanent disk failure in disk group".(during resync components). Any idea to fix this issue ?

As per the screenshot,it looks like physical disks issues.how all these 3 disks gone bad ? any ideas ?

Host cluster screenshot

2019-12-18T06:20:10.373Z: [vSANCorrelator] 1414859421us: [esx.problem.vob.vsan.lsom.diskerror] vSAN device 52223d82-f6f3-940f-18d7-fb732e4f8afa is under permanent error.

2019-12-18T06:20:10.373Z: [vSANCorrelator] 1414850610us: [vob.vsan.lsom.diskerror] vSAN device 52223d82-f6f3-940f-18d7-fb732e4f8afa is under permanent error.

2019-12-18T06:20:10.373Z: [vSANCorrelator] 1414859556us: [esx.problem.vob.vsan.lsom.diskerror] vSAN device 52223d82-f6f3-940f-18d7-fb732e4f8afa is under permanent error.

2019-12-18T06:20:10.374Z: [vSANCorrelator] 1414850692us: [vob.vsan.lsom.diskpropagatedpermerror] vSAN device 5216a066-08a1-76fd-ef8b-a71795499031 is under propagated permanent error.

2019-12-18T06:20:10.374Z: [vSANCorrelator] 1414859604us: [esx.problem.vob.vsan.lsom.diskpropagatedpermerror] vSAN device 5216a066-08a1-76fd-ef8b-a71795499031 is under propagated permanent error.

2019-12-18T06:20:10.374Z: [vSANCorrelator] 1414850711us: [vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52348100-2e98-e5f1-5678-0175ec7b1b37 is under propagated permanent error.

2019-12-18T06:20:10.374Z: [vSANCorrelator] 1414859674us: [esx.problem.vob.vsan.lsom.diskpropagatedpermerror] vSAN device 52348100-2e98-e5f1-5678-0175ec7b1b37 is under propagated permanent error.

2019-12-18T06:20:10.372Z cpu14:2102350)WARNING: PLOG: DDPCompleteDDPWrite:3015: Throttled: DDP write failed I/O error callback PLOGDDPCallbackFn@com.vmware.plog#0.0.0.1, diskgroup 52348100-2e98-e5f1-5678-0175ec7b1b37

2019-12-18T06:20:10.372Z cpu14:2102350)WARNING: PLOG: PLOGDDPCallbackFn:239: Throttled: DDP write failed on device 52223d82-f6f3-940f-18d7-fb732e4f8afa :I/O error

2019-12-18T06:20:10.372Z cpu29:2098599)WARNING: PLOG: PLOGPropagateError:2899: DDP: Propagating error state from original device 52223d82-f6f3-940f-18d7-fb732e4f8afa

2019-12-18T06:20:10.372Z cpu29:2098599)WARNING: PLOG: PLOGPropagateError:2941: DDP: Propagating error state to MDs in device 52348100-2e98-e5f1-5678-0175ec7b1b37

2019-12-18T06:20:10.373Z cpu29:2098599)WARNING: PLOG: PLOGPropagateErrorInt:2840: Permanent error event on 52223d82-f6f3-940f-18d7-fb732e4f8afa

2019-12-18T06:20:10.373Z cpu5:2102401)LSOM: LSOMLogDiskEvent:5668: Disk Event permanent error for MD 52223d82-f6f3-940f-18d7-fb732e4f8afa (naa.600304801c924d01243f90db170293fa:2)

2019-12-18T06:20:10.373Z cpu5:2102401)WARNING: LSOM: LSOMEventNotify:6976: vSAN device 52223d82-f6f3-940f-18d7-fb732e4f8afa is under permanent error.

2019-12-18T06:20:10.373Z cpu5:2102401)LSOM: LSOMLogDiskEvent:5668: Disk Event permanent error propagated for MD 5216a066-08a1-76fd-ef8b-a71795499031 (naa.600304801c924d0123f85a0c2533d4ad:2)

2019-12-18T06:20:10.373Z cpu5:2102401)WARNING: LSOM: LSOMEventNotify:6987: vSAN device 5216a066-08a1-76fd-ef8b-a71795499031 is under propagated permanent error.

2019-12-18T06:20:10.373Z cpu29:2098599)WARNING: PLOG: PLOGPropagateErrorInt:2856: Error/unhealthy propagate event on 5216a066-08a1-76fd-ef8b-a71795499031

2019-12-18T06:20:10.373Z cpu5:2102401)LSOM: LSOMLogDiskEvent:5668: Disk Event permanent error propagated for SSD 52348100-2e98-e5f1-5678-0175ec7b1b37 (naa.600304801c924d0123f859d621fac674:2)

2019-12-18T06:20:10.373Z cpu5:2102401)WARNING: LSOM: LSOMEventNotify:6987: vSAN device 52348100-2e98-e5f1-5678-0175ec7b1b37 is under propagated permanent error.

2019-12-18T06:20:10.373Z cpu29:2098599)WARNING: PLOG: PLOGPropagateErrorInt:2856: Error/unhealthy propagate event on 52348100-2e98-e5f1-5678-0175ec7b1b37

Thanks,

Manivel RR

T180985 · ‎12-17-2019

Its not uncommon to see a difference between what you're seeing with OOB management and what vSAN sees. vSAN can judge a disk to be failing if a disks metrics fall outside of its own specific requirements which may be stricter than a hardware vendors.
I would raise a support request with your hardware vendor and explain you're using the disks for vSAN and that theyre starting to report as failed

Please mark helpful or correct if my answer resolved your issue. How to post effectively on VMTN https://communities.vmware.com/people/daphnissov/blog/2018/12/05/how-to-ask-for-help-on-tech-forums

TheBobkin · ‎12-18-2019

Hello Manivel,

"in IPMI,we are not seeing any disk failures and all 3* 1 TB disks looks good and no errors reported. when we delete the disk group and create the disk group with same disks(3 * 1 TB disks),it fails again.I mean(after DG creation),it reports the same DG issue""Propagated permanent disk failure in disk group".(during resync components). Any idea to fix this issue ?"

Simply recreating a Disk-Group won't fix a physically impaired disk and thus why this fails again as you have not dealt with the problem.

"how all these 3 disks gone bad ?"

They almost certainly haven't - it clearly states in the log that this is a propagated failure (as dedupe is enabled) from I/O errors on a single device (naa.600304801c924d01243f90db170293fa - 52223d82-f6f3-940f-18d7-fb732e4f8afa)

Prior to this in the vmkernel.log you will likely see the actual reason we kicked it out (e.g. Sense codes for Medium Error 0x3 0x11 or other hardware failure 0x4 0xXX).

Recreate the Disk-Group without the failing device, replace the failing device at the nearest convenience and add the new replacement device to the Disk-Group.

Bob

ManivelR · ‎01-01-2020

Sorry for the late reply.I was in Vacation.

Thanks much @T180985 & Bob for your message.

I saw multiple SCSI sense codes from vmkernel.log,

H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0. & H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2019-12-18T05:19:20.416Z cpu20:2098012)ScsiDeviceIO: 3082: Cmd(0x45a4facec940) 0x28, CmdSN 0x16818d6 from world 0 to dev "naa.600304801c924d01243f90db170293fa" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

2019-12-18T05:19:20.416Z cpu26:6494577)WARNING: PLOG: DDPCompleteDDPWrite:3015: Throttled: DDP write failed I/O error callback PLOGDDPCallbackFn@com.vmware.plog#0.0.0.1, diskgroup 5267fed9-a8aa-c7cd-6fb8-a918a803ae99

2019-12-18T05:21:36.523Z cpu17:2098012)ScsiDeviceIO: 3068: Cmd(0x45a4ff3c3300) 0x1a, CmdSN 0x3603 from world 2099168 to dev "naa.600304801c924d0123f859d621fac674" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2019-12-18T05:21:36.583Z cpu17:2098012)ScsiDeviceIO: 3068: Cmd(0x45a4ff3c3300) 0x1a, CmdSN 0x3604 from world 2099168 to dev "naa.600304801c924d01243f90db170293fa" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2019-12-18T05:21:36.632Z cpu17:2098012)ScsiDeviceIO: 3068: Cmd(0x45a4ff3c3300) 0x1a, CmdSN 0x3605 from world 2099168 to dev "naa.600304801c924d0123f85a0c2533d4ad" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

2019-12-18T05:22:14.975Z cpu19:6494636)WARNING: LSOM: LSOMVsiGetVirstoInstanceStats:800: Throttled: Attempt to get Virsto stats on unsupported disk5267fed9-a8aa-c7cd-6fb8-a918a803ae99

However we managed to fix the issue with the same disks.

In IPMI,all the disks reported as normal(green)however in the right side of each disks(LOCATOR LED indicated as "RED").This is weird and now all the disks has no RED Locator LED indication.

This steps resolved the issue.

1) Deleted the disk group after recurrent disk group failure.

2) Changed all the SSD disks 3 *1 TB to HDD and changed back as "FLASH" again.

3) while recreating the disk group,selected first 2 disks as capacity and last disk as cache.

4) Resync components started and completed in 5 hours.

In the mean time, we are going to replace all the 3 SSD disks ASAP.

Thanks for your time and Happy new year 2020.

Manivel RR

TheBobkin · ‎01-10-2020

Hello Manivel,

'H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0'

This is a completely benign sense code and can be safely ignored - you don't need to replace all the SSDs just the one that showed medium errors (0x3 0x11).

Bob

ManivelR · ‎02-21-2020

Thanks Bob for the response.

Much appreciated.

All

VSAN disk group failure "Propagated permanent disk failure in disk group"