VMware Cloud Community
andreaspa
Hot Shot
Hot Shot
Jump to solution

Change Cache Disk for disk group

Have anyone tried to swap their current cache SSD for a new SSD (With the old one still working)

Any tips, or common pitfalls?

I've read that you need to do a full data migration, delete the disk group and create it again, then do the same with the next hosts again. Surely there's gotta be a more simple way to do this?

Reason for swap is to gain performance..

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Andreas,

Yes, evacuation(if possible) and removal of the entire disk-group is necessary for replacing cache-tier SSD.

The reason for this is that the cache-tier SSD stores all the metadata describing the structure of the capacity-tier.

This can be done via the Web Client via Cluster > Manage/Configure > vSAN > Disk Management > Select Disk-group > Delete > Select Option (Use 'Full-evacuation' if 4 or more nodes in cluster and enough space or multiple disk-groups per host, or 'Ensure Accessibility' if only 3 nodes/single disk-group per host)

docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.virtualsan.doc/GUID-16EBFE47-28BE-48DA-8B62-C99B2A7DC5C0.html

http://cormachogan.com/2014/02/21/vsan-part-17-removing-a-disk-group-from-a-host/

Bob

View solution in original post

0 Kudos
4 Replies
GreatWhiteTec
VMware Employee
VMware Employee
Jump to solution

That is correct. You will need to evacuate the DG, and create a new one with the new Cache device. You can only have 1 cache device per DG, so once that device is not available, either failure or replacement, the DG will go offline. Hence the reason of DG delete/re-create.

Replacing the cache device is pretty straight forward as long as you follow that procedure. You cannot add a second cache device to a DG...

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Andreas,

Yes, evacuation(if possible) and removal of the entire disk-group is necessary for replacing cache-tier SSD.

The reason for this is that the cache-tier SSD stores all the metadata describing the structure of the capacity-tier.

This can be done via the Web Client via Cluster > Manage/Configure > vSAN > Disk Management > Select Disk-group > Delete > Select Option (Use 'Full-evacuation' if 4 or more nodes in cluster and enough space or multiple disk-groups per host, or 'Ensure Accessibility' if only 3 nodes/single disk-group per host)

docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.virtualsan.doc/GUID-16EBFE47-28BE-48DA-8B62-C99B2A7DC5C0.html

http://cormachogan.com/2014/02/21/vsan-part-17-removing-a-disk-group-from-a-host/

Bob

0 Kudos
andreaspa
Hot Shot
Hot Shot
Jump to solution

Thanks for the answers!

Imagine if this would be built in for future versions, replacing cache drives before they break.

Add the new drive to the system, edit disk group and choose "Change Cache Device" and choose the new device and everything can be migrated smoothly.

Anyways, looks like I have my work cut out for me. Smiley Happy

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Andreas,

You raise a good point, though it is quite likely that this was considered at some point and not feasible/possible/reliable for some reason, only really scratching the surface here as it gets a lot more complex at the deeper levels of LSOM.

I don't think it would be wise to try replicating data from a dying disk - what if it fails mid-way? How could we be sure or verify that the data it is providing is sane and complete?

The current method for replacement/recovery works well - I have had customers who had disk-groups die and data recover via resync, another fault occur, recover, another occur and only then notice (as stuff became inaccessible and not enough healthy fault domains to recover - this was back in the old days of earlier 5.5 with no Health check in GUI, would be much more apparent if this occurred in modern versions).

I would also mention - if removing a disk-group with 'Ensure Accessibility' option, ensure you have good back-ups first as this is running off a single copy of data until the data has been resynced to the new disk-group so another component failure while this is ongoing can lead to real issues.

Bob

0 Kudos