We have experienced two incidents of "permanent disk failure" in last three months, happened on two Micron SSD cards on two different hosts ( almost brand new hardware in compliance with vSAN's HCL ) . Micron's application shows the disks are healthy. So far, the only explanation I got from Vmware is "this appears to have been a temporary hiccup".
Right now I can't even remove the faulty diskgroup ( vSAN is configured as manual disk claiming ) , I have put this host under maintenance mode, rebooted it, then tried to remove the "faulty" SSD disk so I can remove the diskgroup, but it always fail with "operation timed out".
Now I remember, last time I had this "permanent disk failure" problem, I had to physically remove the SSD card out of the host, then I can delete the diskgroup. I am wondering if this is the only way to do it?
tempted to follow VSAN Part 16 - Reclaiming disks for other uses - CormacHogan.com to just wipe out the vSAN partiton now. ( all VMs are already migrated out of this host when I turned it into maintenance mode ).
Getting very frustrated with vSAN now.
Thanks Chogan.
The SR number is 16865002201 . I just had a long webex with Vmware support Jesse, he is very helpful.
We tried Vmware 's "secret" disk wipeout weapon, but it also didn't work. Finally we used Micron application rssdm to secure erased the disk, then the diskgroup was rebuilt.
Now our vSAN is back to normal.
Good morning, do you have it set to automatically claim disks for vSAN? If so, try setting it to manual. I have seen a removal fail because as it's being removed, it's also trying to reclaim the disks for vSAN. Thank you, Zach.
Our vSAN is configured as manual disk claiming.
"esxcli vsan storage remove -s t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24" also just hang.
I think when vSAN tries to remove the disk, it want to write something into the disk, but that operation just hang.
SSD disk manager shows the disk is healthy:
"/opt/micron/bin/rssdm -L -n 1"
Drive Id : 1
Device Name : mtip_rssd1
Model No : Micron P420m-MTFDGAR700MAX
Serial No : 0000000015050F0B8C24
FW-Rev : B2180108
Total Size : 700.15GB
Drive Status : Drive is in good health
PCI Path (B:D.F) : 44:00.0
Vendor : Micron
Temp(C) : 61
Drive Id : 1
Drive information is retrieved successfully
CMD_STATUS : Success
STATUS_CODE : 0
It just doesn't want to go away 😞
Tried to follow CHogan 's suggestion VSAN Part 16 - Reclaiming disks for other uses - CormacHogan.com , but stuck on
vmkload_mod -u vsan
vmkload_mod -u plog
vmkload_mod -u lsomcommon
vsan module was unloaded successfully, but the other two module won’t go and gave me error
mkload_mod: Can not remove module plog: module symbols in use
vmkload_mod: Can not remove module lsomcommon: module symbols in use
and when I tried to wipeout the partition, I got
partedUtil delete /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24 1
Error: Read-only file system during write on /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24
Unable to delete partition 1 from device /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24
vmkernel log has constant repeated log:
2016-01-27T16:04:10.039Z cpu2:32935)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2
2016-01-27T16:04:40.043Z cpu3:32884)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2
2016-01-27T16:05:10.047Z cpu15:32813)WARNING: LSOMCommon: SSDLOG_WriteLogEntry:592: Throttled: Log has encountered (Maximum kernel-level retries exceeded) error device: t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24:2
I guess I have to go physical now 😞
It's getting more and more unpleasant now 😞
I removed the SSD from the host physically, then booted the machine, it shows the SSD absent, then I am able to delete the faulty diskgroup. Then I shut the machine down and added the “faulty” SSD back, vSAN recognized it as the already deleted diskgroup with 6 absent Disk.
Tried to delete it, won’t go. Tried to manually wipe out the partition, still stuck in
vmkload_mod -u plog
vmkload_mod -u lsomcommon
2016-01-27T21:09:50.126Z cpu24:37977)WARNING: Mod: 5084: Unload of module plog failed : Busy
2016-01-27T21:11:44.461Z cpu18:38068)WARNING: Mod: 5084: Unload of module lsomcommon failed : Busy
partedUtil delete /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24 1
Error: Read-only file system during write on /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24
Unable to delete partition 1 from device /dev/disks/t10.ATA_____Micron_P420m2DMTFDGAR700MAX______________0000000015050F0B8C24
So vSAN can’t do anything with it. Meanwhile micron ‘s application shows the disk is healthy.
Opened support ticket with Vmware a week ago, so far besides the first call from a general engineer ( not a vSAN support ) , only got several emails and not much helpful information.
Is this node still in the cluster? "esxcli vsan cluster get"
If it is still part of the cluster, use "esxcli vsan cluster leave"
You won't be able to unload those modules if VSAN is still running, and those modules have a hold on the disk so you won't be able to delete the partition.
Did run "esxcli vsan cluster leave" .
esxcli vsan cluster get
VSAN Clustering is not enabled on this host
vmkload_mod -l | grep "vsan\|plog\|comm"
vsanbase 9 64
vsanutil 7 160
lsomcommon 4 260
plog 8 424
vsan 0 1068
vmkload_mod -u vsan
Module vsan successfully unloaded
vmkload_mod -u plog
vmkload_mod: Can not remove module plog: module symbols in use
vmkload_mod -u lsomcommon
vmkload_mod: Can not remove module lsomcommon: module symbols in use
Maybe there is another module. I just don't know which one.
Well, to be honest, we shouldn't even need to be here.
Can you share the SR number here? I will ask support guys to follow up with you.
Thanks Chogan.
The SR number is 16865002201 . I just had a long webex with Vmware support Jesse, he is very helpful.
We tried Vmware 's "secret" disk wipeout weapon, but it also didn't work. Finally we used Micron application rssdm to secure erased the disk, then the diskgroup was rebuilt.
Now our vSAN is back to normal.