VMware Cloud Community
Ram8
Enthusiast
Enthusiast

SSD Failure on VSAN

I have an ESXi cluster with 3 hosts and storage fired from VSAN. The VSAN cluster has 2 SAS disks and 1 SSD disk from each host. 1 of my SSDs in the disk group has failed. I need to replace 1 SSD as it reported failure but it warns me of removing the disk group. How should I go forward as removing a disk group on 1 host might lead to data loss ?

7 Replies
zdickinson
Expert
Expert

Down the host, replace SSD, power on, claim the disk for the disk group.and it should start resyncing data.  Thank you, Zach.

0 Kudos
jonretting
Enthusiast
Enthusiast

Since the SSD is the used for all r/w cache, losing it causes the loss of the disk group, and all the data in it. Having three hosts is the bare minimum, and caries significant risks when failures do happen. Four should really be the minimum. Yes -- you will have to remove that disk group in order to rebuild after reclaiming the disks. So best practices might say check the availability of all your virtual disks and vms. Be sure the loss of the one host hasn't caused any degradation to your systems besides the loss of a disk group. Moreover check the storage policies, they will all be non-compliant, and should have only one disk in its physical layout marked as absent. Since you have only three hosts in your VSAN, no auto-rebuild will take place until that host is brought back up. This means you are operating in a degraded environment, without much failure capacity. So yes -- go ahead and remove the disk. I would recommend after doing this using "partedutil" to check/remove any leftover partition data (relabel the disk). Assuming you are already in "manual" VSAN mode, claim all three disks back into the VSAN. Sometimes its also good measure to stay in maintenance mode and move the host out of the cluster, then back in. Just in case there is any storage provider registration issues that should clear it.  Cheers, -Jon

0 Kudos
mdangel1
Enthusiast
Enthusiast

Does the same apply if you lose one of your spinning disk?

0 Kudos
jonretting
Enthusiast
Enthusiast

If a data disk fails it will be marked absent, and eventually rebuilt to another disk. Thanks, -Jon

0 Kudos
zdickinson
Expert
Expert

I need to amend my post on replacing the SSD as we just had to.  First to do is delete the SSD from the web interface.  This will effectively delete the disk group.  A note, you need to uncheck the box to evacuate data.  Even though the amount to be evacuated will/should be 0, it will fail if you don't uncheck it.  Now you can down the host, replace the SSD, power it back on, and re-create the disk group.  If you're using vSAN 6 I would recommend doing a proactive re-balance.  VSAN 6.0 Part 9 - Proactive Re-balance - CormacHogan.com  Once it is started it will run for 24 hours.  Depending on the size of the disk group you may have to start it again.  Our disk group was 7 TB I had to start it again, total time to rebalance was about 36 hours.  One note here is that Cormac stats you monitor the status of the data move with vsan.resync_dashboard, for whatever reason that didn't work for me.  Instead I used vsan.proactive_rebalance_info.

To expand on what Retting had said about replacing a spinning disk, it depends on whether you're using RAID 0 or passthrough.  If using RAID 0, you will need to down the host, replace the drive, re-create the RAID 0, power the host on, and claim the drive.  If using passthrough you can hot swap the drive.  Not sure if you need reclaim the drive or not.  It might depend on if you have the cluster set to auto or manual for disk claim.

Thank you, Zach.

Sud2009
Contributor
Contributor

Hello ,

Please check the below link if it gets you some help.

How VSAN handles a disk or host failure

Kisan_VMware
Enthusiast
Enthusiast

Hi,

Check the below link, we have step by step information

VMware Virtual SAN Operations: Replacing Disk Devices - Virtual Blocks - VMware Blogs

0 Kudos