That is correct. You will need to evacuate the DG, and create a new one with the new Cache device. You can only have 1 cache device per DG, so once that device is not available, either failure or replacement, the DG will go offline. Hence the reason of DG delete/re-create.
Replacing the cache device is pretty straight forward as long as you follow that procedure. You cannot add a second cache device to a DG...
Yes, evacuation(if possible) and removal of the entire disk-group is necessary for replacing cache-tier SSD.
The reason for this is that the cache-tier SSD stores all the metadata describing the structure of the capacity-tier.
This can be done via the Web Client via Cluster > Manage/Configure > vSAN > Disk Management > Select Disk-group > Delete > Select Option (Use 'Full-evacuation' if 4 or more nodes in cluster and enough space or multiple disk-groups per host, or 'Ensure Accessibility' if only 3 nodes/single disk-group per host)
Thanks for the answers!
Imagine if this would be built in for future versions, replacing cache drives before they break.
Add the new drive to the system, edit disk group and choose "Change Cache Device" and choose the new device and everything can be migrated smoothly.
Anyways, looks like I have my work cut out for me.
You raise a good point, though it is quite likely that this was considered at some point and not feasible/possible/reliable for some reason, only really scratching the surface here as it gets a lot more complex at the deeper levels of LSOM.
I don't think it would be wise to try replicating data from a dying disk - what if it fails mid-way? How could we be sure or verify that the data it is providing is sane and complete?
The current method for replacement/recovery works well - I have had customers who had disk-groups die and data recover via resync, another fault occur, recover, another occur and only then notice (as stuff became inaccessible and not enough healthy fault domains to recover - this was back in the old days of earlier 5.5 with no Health check in GUI, would be much more apparent if this occurred in modern versions).
I would also mention - if removing a disk-group with 'Ensure Accessibility' option, ensure you have good back-ups first as this is running off a single copy of data until the data has been resynced to the new disk-group so another component failure while this is ongoing can lead to real issues.