VMware Cloud Community
tinnh
Contributor
Contributor

Operation time out when removing disk from vSan

I had a SSD cache disk failed, i have tried to remove it to handle but stuck at operation time out. I tried two option Full data miration and Ensure accessibilty, both didn't work.

An event appeared before time out event.

How to resolve this annoying issue and i wonder if it's safe completely to physically replace failed SSD disk.

Capture.JPG

4 Replies
TheBobkin
Champion
Champion

Hello tinnh,

The first thing to do in this situation is to verify the data health e.g. that all Objects that had components residing on the failed Disk-Group have been rebuilt on the remaining nodes/DGs in the cluster (provided there is an adequate number of available Fault Domains and space).

This can be verified via the Web Client - Cluster > Monitor > Health > Data

or via the CLI on the host using cmmds-tool e.g. this prints the number of Objects with each Config Status (state 7 = Healthy):

#cmmds-tool find -f python | grep CONFIG_STATUS -B 4 -A 6 | grep 'uuid\|content' | grep -o 'state\\\":\ [0-9]*' | sort | uniq -c

If all data has resynced and is healthy then it *should* be safe to remove the disk-group via alternative methods either by deleting the disk-group with 'No Action' or wiping the partitions on the drives via the Web Client (Host > Configure > Storage Devices > Select Device > All Actions > Erase Partitions NOTE: CAREFULLY check that the correct drives are being worked on here as this is PERMANENT).

If neither of the above is possible (hostd can hold locks on badly failed disks) then the remaining option would be to boot the host with the vSAN modules disabled and then wipe the partitions.

If this is not a lab-cluster and if possible, do open a support request with VMware GSS and/or proceed with caution here.

Hope this helps.

Bob

tinnh
Contributor
Contributor

Hi Bob,

Thanks for your advices. I followed your instruction to check data health and it shows as below

Capture2.JPG

I tried to Repair Object Immidiately and check resync status but it is empty even do a refresh, it seems not to happen any resynchronization.

Capture3.JPG

Do you have any idea?

Regards!

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello tinnh,

This may be unable to resync the data due to insufficient space on the appropriate fault domains due to the disk-group with issues.

How many node and disk-groups in this cluster?

What is the space-utilisation per disk as per RVC? (vsan.disks_stats <pathToCluster>)

If no VMs/data is currently inaccessible then it is likely safe to remove this disk-group (via partition-wipe) and rebuild the disk-group - double-check that this disk-group is properly failed using esxtop 'u' and verify that there are 0 IOs to all devices in this disk-group.

As I said previously - if you do have the ability to open an SR with VMware GSS please do this as someone can check this better live via WebEx than I can advise without seeing the cluster live.

Bob

Reply
0 Kudos
hkg2581
VMware Employee
VMware Employee

tinnh

Please raise a support ticket with Vmware for a TSE to review if this is a production cluster , please refrain from deleting an disk group with no ata migration , you may cause a potential data loss . I see that you have multiple objects with reduced availability and non-compliance .

Thanks, Hareesh K G Personal Blog : http://virtuallysensei.com
Reply
0 Kudos