VMware Cloud Community
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

removing failed diskgroup from vSAN got "Operation timed out"

We have experienced two incidents of  "permanent disk failure" in last three months, happened on two Micron SSD cards on two different hosts ( almost brand new hardware in compliance with vSAN's HCL ) .   Micron's application shows the disks are healthy. So far, the only explanation I got from Vmware is "this appears to have been a temporary hiccup".

Right now I can't even remove the faulty diskgroup ( vSAN is configured as manual disk claiming ) , I have put this host under maintenance mode, rebooted it, then tried to remove the "faulty" SSD disk so I can remove the diskgroup, but it always fail with "operation timed out".

Now I remember, last time I had this "permanent disk failure" problem, I had to physically remove the SSD card out of the host, then I can delete the diskgroup. I am wondering if this is the only way to do it?

tempted to follow VSAN Part 16 - Reclaiming disks for other uses - CormacHogan.com to just wipe out the vSAN partiton now. ( all VMs are already migrated out of this host when I turned it into maintenance mode ).

Getting very frustrated with vSAN now.

1 Solution

Accepted Solutions
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

Thanks Chogan.

The SR number is 16865002201 . I just had a long webex with Vmware support Jesse, he is very helpful.

We tried Vmware 's "secret" disk wipeout weapon, but it also didn't work. Finally we used Micron application rssdm to secure erased the disk, then the diskgroup was rebuilt.

Now our vSAN is back to normal.

View solution in original post

8 Replies
zdickinson
Expert
Expert
Jump to solution

Good morning, do you have it set to automatically claim disks for vSAN?  If so, try setting it to manual.  I have seen a removal fail because as it's being removed, it's also trying to reclaim the disks for vSAN.  Thank you, Zach.

0 Kudos
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

Our vSAN is configured as manual disk claiming. 


"esxcli  vsan storage remove -s t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24" also just hang.

I think when vSAN tries to remove the disk, it want to write something into the disk, but that operation just hang.

SSD disk manager shows the disk is healthy:

"/opt/micron/bin/rssdm -L -n 1"

Drive Id             : 1

Device Name          : mtip_rssd1

Model No             : Micron P420m-MTFDGAR700MAX

Serial No            : 0000000015050F0B8C24

FW-Rev               : B2180108

Total Size           : 700.15GB

Drive Status         : Drive is in good health

PCI Path (B:D.F)     : 44:00.0

Vendor               : Micron

Temp(C)              : 61

Drive Id     : 1

Drive information is retrieved successfully

CMD_STATUS   : Success

STATUS_CODE  : 0

0 Kudos
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

It just doesn't want to go away 😞

Tried to follow CHogan 's suggestion VSAN Part 16 - Reclaiming disks for other uses - CormacHogan.com , but stuck on

vmkload_mod -u vsan
vmkload_mod -u plog
vmkload_mod -u lsomcommon

vsan module was unloaded successfully, but the other two module won’t go and gave me error

mkload_mod: Can not remove module plog: module symbols in use
vmkload_mod: Can not remove module lsomcommon: module symbols in use

and when I tried to wipeout the partition, I got

partedUtil delete /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24 1
Error: Read-only file system during write on /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24
Unable to delete partition 1 from device /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24

vmkernel log has constant repeated log:
2016-01-27T16:04:10.039Z cpu2:32935)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2
2016-01-27T16:04:40.043Z cpu3:32884)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2
2016-01-27T16:05:10.047Z cpu15:32813)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2

I guess I have to go physical now 😞

0 Kudos
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

It's getting more and more unpleasant now 😞

I removed the SSD from the host physically, then booted the machine, it shows the SSD absent, then I am able to delete the faulty diskgroup. Then I shut the machine down and added the “faulty” SSD back, vSAN recognized it as the already deleted diskgroup with 6 absent Disk.

Tried to delete it, won’t go. Tried to manually wipe out the partition, still stuck in

vmkload_mod -u plog

vmkload_mod -u lsomcommon

2016-01-27T21:09:50.126Z cpu24:37977)WARNING: Mod: 5084: Unload of module plog failed : Busy

2016-01-27T21:11:44.461Z cpu18:38068)WARNING: Mod: 5084: Unload of module lsomcommon failed : Busy

partedUtil delete /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24 1

Error: Read-only file system during write on /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24

Unable to delete partition 1 from device /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24

So vSAN can’t do anything with it. Meanwhile micron ‘s application shows the disk is healthy.

Opened support ticket with Vmware a week ago, so far besides the first call from a general engineer ( not a vSAN support ) , only got several emails and not much helpful information.

0 Kudos
CHogan
VMware Employee
VMware Employee
Jump to solution

Is this node still in the cluster? "esxcli vsan cluster get"

If it is still part of the cluster, use "esxcli vsan cluster leave"

You won't be able to unload those modules if VSAN is still running, and those modules have a hold on the disk so you won't be able to delete the partition.

http://cormachogan.com
0 Kudos
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

Did run "esxcli vsan cluster leave" .


esxcli vsan cluster get

VSAN Clustering is not enabled on this host

vmkload_mod -l | grep "vsan\|plog\|comm"

vsanbase                 9    64

vsanutil                 7    160

lsomcommon               4    260

plog                     8    424

vsan                     0    1068

vmkload_mod -u vsan

Module vsan successfully unloaded

vmkload_mod -u plog

vmkload_mod: Can not remove module plog: module symbols in use

vmkload_mod -u lsomcommon

vmkload_mod: Can not remove module lsomcommon: module symbols in use

Maybe there is another module. I just don't know which one.

0 Kudos
CHogan
VMware Employee
VMware Employee
Jump to solution

Well, to be honest, we shouldn't even need to be here.

Can you share the SR number here? I will ask support guys to follow up with you.

http://cormachogan.com
0 Kudos
jackchentoronto
Enthusiast
Enthusiast
Jump to solution

Thanks Chogan.

The SR number is 16865002201 . I just had a long webex with Vmware support Jesse, he is very helpful.

We tried Vmware 's "secret" disk wipeout weapon, but it also didn't work. Finally we used Micron application rssdm to secure erased the disk, then the diskgroup was rebuilt.

Now our vSAN is back to normal.