VMware Cloud Community
ahmad_ali
Contributor
Contributor

virtual san device is under permanent failure

Hello Everyone,

I need assistance on vSAN alert.

On one of the Cluster we are getting an error as, Virtual SAN device is under permanent failure.

- Failed : Physical disk

- Failed : Component metadata health

- Failed : Overall disks health

I have gone through with couple of KBs and community.

VSAN health check - component metadata health

Component metadata health check fails with invalid state error (2145347) | VMware KB

ESXi host :

VMware ESXi 6.0.0 build-3620759

VMware ESXi 6.0.0 Update 2

vSAN Version:

Name        : VMware-vsan-health           Relocations: (not relocatable)

Version     : 6.2.0                             Vendor: VMware, Inc.

Release     : 3547697                       Build Date: Sat Feb 13 03:04:16 2016

Install Date: Thu Oct 13 18:12:01 2016         Build Host: sc-bld-lin1268.eng.vmware.com

Group       : Applications/Management       Source RPM: VMware-vsan-health-6.2.0-3547697.src.rpm

Size        : 52872114                         License: commercial

Signature   : (none)

Summary     : VMware Virtual SAN Health Service

Description :

VMware Virtual SAN Health Service

Distribution: (none)

pastedImage_0.png

pastedImage_1.png

pastedImage_2.png

vmkernel.log

2017-04-24T10:17:07.853Z cpu16:42460)PLOG: PLOG_QuiesceDevice:8531: : Got quiesce reason 1 on disk naa.600605b00991a3f0202de2c45f900beb:2 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:07.853Z cpu7:33656)PLOG: PLOG_CleanupElevator:1473: Waiting for Elevator from UUID 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:07.863Z cpu32:2341680)WARNING: LSOM: LSOMEventNotify:6450: Virtual SAN device 5296f94a-d540-efa9-e0e4-d7a2788d97ce has gone offline.

2017-04-24T10:17:09.857Z cpu4:33662)PLOG: PLOGGarbageCollectDevice:1542: Throttled: Device naa.600605b00991a3f0202de2c45f900beb:1 5296f94a-d540-efa9-e0e4-d7a2788d97ce is prepared to delete

2017-04-24T10:17:09.857Z cpu4:33662)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x430cdf26f030 naa.600605b00991a3f0202de2c45f900beb:1 0x419 5296f94a-d540-efa9-e0e4-d7a2788d97ce is being freed SSD 52cec8b9-4703-a9ad-aa5b-eaccb9b6f0e8

2017-04-24T10:17:09.867Z cpu9:33662)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x430cdf270070 naa.600605b00991a3f0202de2c45f900beb:2 0x41d 5296f94a-d540-efa9-e0e4-d7a2788d97ce is being freed SSD 52cec8b9-4703-a9ad-aa5b-eaccb9b6f0e8

2017-04-24T10:17:11.369Z cpu36:41665)PLOG: PLOGNotifyDisks:4010: MD 3 with UUID 5296f94a-d540-efa9-e0e4-d7a2788d97ce with state 0 formatVersion 4 backing SSD 52cec8b9-4703-a9ad-aa5b-eaccb9b6f0e8 notified

2017-04-24T10:17:11.418Z cpu0:7034782)PLOG: PLOGGetRecoveredState:6637: Last LSN recoverd 5296f94a-d540-efa9-e0e4-d7a2788d97ce 46544828

2017-04-24T10:17:12.421Z cpu0:7034782)PLOG: PLOG_OpenDevHandles:1228: Registered APD callback for naa.600605b00991a3f0202de2c45f900beb:2 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:12.424Z cpu0:7034782)PLOG: PLOG_OpenDevHandles:1228: Registered APD callback for naa.600605b00991a3f0202de2c45f900beb:2 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:12.425Z cpu0:7034782)PLOG: PLOGInitAndAnnounceMD:6987: Successfully announced VSAN MD (naa.600605b00991a3f0202de2c45f900beb:2) with UUID 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:12.530Z cpu26:43820)WARNING: LSOM: LSOMEventNotify:6440: Virtual SAN device 5296f94a-d540-efa9-e0e4-d7a2788d97ce is under permanent error.

2017-04-24T10:17:07.853Z cpu8:7034742)PLOG: PLOGValidateDiskGroupOpFn:1415: Issuing PLOG Op DISKGROUP UNMOUNT for MD :naa.600605b00991a3f0202de2c45f900beb

2017-04-24T10:17:07.853Z cpu16:42460)PLOG: PLOG_QuiesceDevice:8531: : Got quiesce reason 1 on disk naa.600605b00991a3f0202de2c45f900beb:2 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:07.853Z cpu32:41665)LSOM: LSOMEventNotify:6413: Throttled: Waiting for component cleanup

2017-04-24T10:17:07.853Z cpu7:33656)PLOG: PLOG_CleanupElevator:1473: Waiting for Elevator from UUID 5296f94a-d540-efa9-e0e4-d7a2788d97ce

2017-04-24T10:17:07.863Z cpu32:2341680)WARNING: LSOM: LSOMEventNotify:6450: Virtual SAN device 5296f94a-d540-efa9-e0e4-d7a2788d97ce has gone offline.

2017-04-24T10:17:07.863Z cpu32:2341680)LSOM: LSOMEventNotify:6519: Throttled: Waiting for open component countto drop to zero

2017-04-24T10:17:07.872Z cpu29:36378)PLOG: PLOGIsPlogUnloading:100: Elevator exit for device is set

2017-04-24T10:17:07.872Z cpu29:36378)PLOG: PLOGElevBaseHandler:617: Elevator exiting due to unload operation

2017-04-24T10:17:07.974Z cpu8:33711)Global: Virsto_DetachInstance:301: INFO: Detaching Virsto Instance 0x430b680a9060 from PLOG device

2017-04-24T10:17:08.855Z cpu21:33659)PLOG: PLOG_CleanupDefence:6346: Waiting for defence task for naa.600605b00991a3f0202de2c45f900beb:1

2017-04-24T10:17:09.856Z cpu21:33659)Destroyed VSAN Slab PLOGIORetry_slab_0000000000 (maxCount=0 failCount=0)

2017-04-24T10:17:09.857Z cpu21:33659)Destroyed VSAN Slab PLOGIORetry_slab_0000000001 (maxCount=1 failCount=0)

2017-04-24T10:17:09.857Z cpu21:33659)ScsiEvents: 353: EventSubsystem: Device Events, Event Mask: 20, Parameter: 0x430cdde547e0, UnRegistered!

2017-04-24T10:17:09.857Z cpu3:7034742)PLOG: PLOGValidateDiskGroupOpFn:1415: Issuing PLOG Op DISKGROUP UNMOUNT for MD :naa.600605b00991a3f0202de2c45f900beb

2017-04-24T10:17:09.857Z cpu4:33662)PLOG: PLOGGarbageCollectDevice:1542: Throttled: Device naa.600605b00991a3f0202de2c45f900beb:1 5296f94a-d540-efa9-e0e4-d7a2788d97ce is prepared to delete

2017-04-24T10:17:09.857Z cpu4:33662)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x430cdf26f030 naa.600605b00991a3f0202de2c45f900beb:1 0x419 5296f94a-d540-efa9-e0e4-d7a2788d97ce is being freed SSD 52cec8b9-4703-a9ad-aa5b-eaccb9b6f0e8

2017-04-24T10:17:09.857Z cpu4:33662)PLOG: PLOG_FreeDevice:496: Throttled: Waiting for ops to complete on device: 0x430cdf26f030 naa.600605b00991a3f0202de2c45f900beb:1

2017-04-24T10:17:09.867Z cpu9:33662)PLOG: PLOG_FreeDevice:325: PLOG in-mem device 0x430cdf270070 naa.600605b00991a3f0202de2c45f900beb:2 0x41d 5296f94a-d540-efa9-e0e4-d7a2788d97ce is being freed SSD 52cec8b9-4703-a9ad-aa5b-eaccb9b6f0e8

2017-04-24T10:17:09.867Z cpu9:33662)PLOG: PLOG_FreeDevice:454: Unregistering diskAttrHandle:0x430cdf2708b0 on disk naa.600605b00991a3f0202de2c45f900beb

2017-04-24T10:17:09.867Z cpu9:33662)LSOMCommon: LSOM_UnregisterDiskAttrHandle:136: DiskAttrHandle:0x430cdf2708b0 is removed from moduleID 86 for disk:naa.600605b00991a3f0202de2c45f900beb

2017-04-24T10:17:09.868Z cpu9:33662)Destroyed VSAN Slab PLOGIORetry_slab_0000000000 (maxCount=26 failCount=0)

2017-04-24T10:17:09.868Z cpu9:33662)Destroyed VSAN Slab PLOGIORetry_slab_0000000001 (maxCount=9 failCount=0)

2017-04-24T10:17:09.868Z cpu9:33662)ScsiEvents: 353: EventSubsystem: Device Events, Event Mask: 20, Parameter: 0x430cdf2720d0, UnRegistered!

2017-04-24T10:17:09.906Z cpu28:33528)WARNING: DVFilter: 1181: Couldn't enable keepalive: Not supported

2017-04-24T10:17:09.982Z cpu46:7034760)VSAN Device Monitor: Successfully unmounted failed VSAN disk naa.600605b00991a3f0202de2c45f900beb

Regards,

Ali

Reply
0 Kudos
3 Replies
admin
Immortal
Immortal

Greetings!

This is a drive failure case and you need to replace the faulted drive.

______________________

Was your question answered correctly? If so, please remember to mark your question as "Correct" or "Helpful" when you get the appropriate answer. This helps others searching for a similar issue.

Cheers!

Shivam

Reply
0 Kudos
SureshKumarMuth
Commander
Commander

Check if the device in question is shown as predictive failure or failed in hardware logs. Replace the disk if you see errors at hardware level.

Ensure the firmware of the devices are supported for vSAN as per VMWare HCL and update them if required.

Regards,
Suresh
https://vconnectit.wordpress.com/
Reply
0 Kudos
admin
Immortal
Immortal

as mentioned above.

you need to replace drive. but make sure you follow step.

VMware Virtual SAN Operations: Replacing Disk Devices - Virtual Blocks - VMware Blogs

Login to the vSphere Web Client

Navigate to the Hosts and Clusters view and select the Virtual SAN enabled cluster

Go to the manage tab and select Disk management under the Virtual SAN section

Select the disk group with the failed magnetic device

Select the failed magnetic device and click the delete button

take out failed drive from your host and replace it. make sure esxi detected new drive, than re-add newly replace drive to disk group

from your screenshot, you are using pass-through configuration so that you don't need extra step for raid 0 device. above step will be enough.

Reply
0 Kudos