VMware Cloud Community
jonathanp
Expert
Expert
Jump to solution

Replacing a failed capacity device.

Hi,

We have a vSAN cluster with 5 hosts and FTT initially set to 2, and stripe of 1.

I temporary have set FTT to 1 using below command that VMware support told me to use, because I wasn't able to enter the host with the failed drive in maintenance mode

vsan.cluster_set_default_policy . '(("hostFailuresToTolerate" i1) ("forceProvisioning" i1))'

Well, even at 1 it is not working. still having the same error :

Failed to enter maintenance mode in the current VSAN data migration mode due to insufficient nodes or disks in the cluster...

I have few questions here.

1- I restarted the host, looked in the lsi megaraid controller and the disk is not present there.

    so I want to replace it.

    I need to decomission the failed disk from the disk group, right?

2- by following VMware steps, which is to delete the disk from the group, I have the following message when I try to delete it.

    Action not available when Deduplication and compression is enabled on cluster.

3- So what is the correct step here to replace a failed drive on a 5 host cluster that is (initially set to FTT2)?

Jonathan

Reply
0 Kudos
1 Solution

Accepted Solutions
GreatWhiteTec
VMware Employee
VMware Employee
Jump to solution

When dedupe and compression is enabled, you lose the entire Disk Group, similar behavior to when you lose a cache device. This is because of dedupe and pointers to the data blocks.

Is vSAN set to Manual mode for disk claiming? If not, change it to manual and try again. Also make sure that the policy applied, and that you truly have FTT=1, if you changed the policy but VMs are not compliant, you still have 3 copies of the data.

  1. Delete the DG with full evac
  2. replace drive
  3. Recreate the DG

View solution in original post

Reply
0 Kudos
3 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Jonathan,

Is the disk gone from the disk-group?

Via Web-Client: Cluster > Configure(assuming 6.5, it is 'Manage' in 6.0) > vSAN > Disk Management

Via SSH to host that the Disk-group is mounted on:

#esxcli vsan storage list

If it is gone and rebooting the host doesn't cause ANY VMs to become inaccessible or go offline then powering the host down and replacing the disk should be okay but I can't be sure of how it will rebuild (see next point).

Otherwise looks like you will have to evacuate the entire disk-group that this failed disk is located on to do this in a more clean way due to the fact that Dedupe+Compression works on a per disk-group basis:

How to Manage Disks in a Cluster with Deduplication and Compression

■ You cannot remove a single disk from a disk group. You must remove the entire disk group to make modifications.

https://pubs.vmware.com/vsphere-65/index.jsp?topic=%2Fcom.vmware.vsphere.virtualsan.doc%2FGUID-3D2D8...

Any chance you could PM me the SR number? (I work there :smileygrin: )

Bob

-o- If you found this comment useful or answer please select as 'Answer' and/or click the 'Helpful' button, please ask follow-up questions if you have any -o-

Reply
0 Kudos
TheBobkin
Champion
Champion
Jump to solution

*Supporting Documentation for 6.0 to show it is the same as later versions

Adding or Removing Disks when Deduplication and Compression Is Enabled

■ Deduplication and compression is implemented at a disk group level. You cannot remove a capacity disk from the cluster with enabled deduplication and compression. You must remove the entire disk group.

https://pubs.vmware.com/vsphere-60/index.jsp?topic=%2Fcom.vmware.vsphere.virtualsan.doc%2FGUID-AA72C...

Reply
0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee
Jump to solution

When dedupe and compression is enabled, you lose the entire Disk Group, similar behavior to when you lose a cache device. This is because of dedupe and pointers to the data blocks.

Is vSAN set to Manual mode for disk claiming? If not, change it to manual and try again. Also make sure that the policy applied, and that you truly have FTT=1, if you changed the policy but VMs are not compliant, you still have 3 copies of the data.

  1. Delete the DG with full evac
  2. replace drive
  3. Recreate the DG
Reply
0 Kudos