VMware Cloud Community
Anton_N
Contributor
Contributor

Removing disk groups

I have a 4-node hybrid VSAN 6.2 cluster with two disk groups on each host. One DG is with 7.2k capacity drives and the other is with faster, 10k capacity drives (was added recently).

I need to remove slower DGs from the cluster to run VSAN on faster drives.

What is the best way to do this with a minimal risk of failure?

And I have some additional questions:

- Is it necessary to put the host to maintenance mode before deleting a DG in VSAN 6.2?

- When the data is being evacuated from the host (or DG) what happens with evacuation process and evacuated data when another host in the cluster fails and VSAN starts rebuilding?

- When the data is being evacuated from the host and the host fails - what happens with evacuated data?

Reply
0 Kudos
3 Replies
TheBobkin
Champion
Champion

Hello Anton

"What is the best way to do this with a minimal risk of failure?"

The safest method is to remove the disk-group with 'Full Data Evacuation' selected - this basically replicates all the data on the disk-group to other available disks on the cluster and once this has been completed it remove the disks and the newly created data-components are used instead of the ones on the now deleted disk-group.

As this is a Hybrid cluster and thus not deduped you can do this operation one capacity-drive at a time instead of entire disk-group for smaller resyncs at a time.

"- Is it necessary to put the host to maintenance mode before deleting a DG in VSAN 6.2?"

No it is not mandatory and if you do that then it won't have the other new disk-group available as target for resyncing the data off the disk-group you are decommissioning and thus may be slower.

"- When the data is being evacuated from the host (or DG) what happens with evacuation process and evacuated data when another host in the cluster fails and VSAN starts rebuilding?"

If the target location for data-placement of the resync becomes inaccessible(e.g. PSOD, disk dies etc.) vSAN will start the resync of these components in another available location.

"- When the data is being evacuated from the host and the host fails - what happens with evacuated data?"

Provided the Objects are compliant with their Storage Policy and are at least FTT=1 then the Object should remain acccessible and the resync will continue but will be copying the data for the resync only from the remaining data-component(s).

Hope this helps.

Bob

Reply
0 Kudos
Anton_N
Contributor
Contributor

Thanks, Bob!

If the target location for data-placement of the resync becomes inaccessible(e.g. PSOD, disk dies etc.) vSAN will start the resync of these components in another available location.

But at the same time we have components on the "dead" host that should also be resynced if the host doesn't come back to work. When vSAN do this? Immediately after the 60 minutes timeout? Or after the previous resync finishes? Which process has the highest privilege - data evacuation from the "good" host or re-creating replicas from the "dead" one?

Provided the Objects are compliant with their Storage Policy and are at least FTT=1 then the Object should remain acccessible and the resync will continue but will be copying the data for the resync only from the remaining data-component(s).

vSAN copies data from both replicas (when FTT=1) during resync, right?

Reply
0 Kudos
TheBobkin
Champion
Champion

Hello Anton,

"But at the same time we have components on the "dead" host that should also be resynced if the host doesn't come back to work"

No, as when it has completed resyncing and created a new data-replica on a newly selected disk-group the Object vSAN will have no need for the partial data it wrote to the failed host/DG and this data will be discarded.

Actually, later vSAN versions (6.5/6.6) are smart enough about this and if for example a host PSODs 50% through resyncing a data-replica to it and it starts resyncing it elsewhere but then the PSODed host gets brought back up, vSAN will calculate which is less effort of either: continuing resync in the new space or cancelling this and finishing resyncing the 50% complete data-replica.

Depending on how components becomes unavailable determines whether they are marked as 'Absent' (starts clom repair timer, default of 60 minutes) or 'Degraded' (resync started immediately).

Here is a handy doc outlining these scenarios/conditions:

Failure States of Virtual SAN Components

"Which process has the highest privilege - data evacuation from the "good" host or re-creating replicas from the "dead" one?"

Good question, I have a recollection of reading/hearing/seeing 'availability related rebuild' having priority over 'data move' and it likely does but cannot recall for certain.

"vSAN copies data from both replicas (when FTT=1) during resync, right?"

If both are accessible, yes.

Bob