VMware Cloud Community
m13316
Contributor
Contributor

Dual VSAN cache tier failure + dead vCenter

Hello all,   looking for your expert advice on the best way to proceed with a really ugly failure scenario.  We recently physically moved our lab to a new facility and as we were bringing things back online, found that we had lost a flash cache disk on two out of our five servers.  To complicate matters further,  we found that vCenter was sitting on a local disk outside of vSAN and it appears to have experienced some corruption on the disk because it fails to start with a "Unable to enumerate all disks" error.  In researching that further, five of the twelve vmdks report an i/o error when trying to read them and the flat file is nowhere to be found.  I thought we might be able to re-build the vmdks,  but seems without the flat file we are out of luck (right?).  Unfortunately there is no backup.

With regard to VSAN,  my understanding is that when the cache disk fails the entire disk group is removed from service.  That is what appears to be the case here.  The failed flash disks have been replaced on both servers and there is also another flash disk on each that can be used.  However it appears that the disk groups are still not in service because in the local ESXi console all of the disks are reported as not operational (see attached picture).  Also many of our VMs are currently showing up in the local console as Invalid.  I think the reason for that is probably because there is not enough storage on the remaining three servers to accommodate storage for all of the VMs.   What is the best way to recover from this multiple-failure scenario while preserving our data?   I am thinking of creating a new vCenter,  putting all hosts in maintenance mode and then adding them to the new vCenter.  Then I will replace the failed cache disk on each server using the new vCenter.  Would that work,  or is there a better/safer strategy?  Also what is the procedure to replace the failed cache disks in vSAN to bring the disk groups back into service without losing data? 

All hosts and vCenter are running version 6.5. 

Thanks,

Matt

0 Kudos
1 Reply
TheBobkin
Champion
Champion

Hello Matt,

Welcome to Communities, but also sorry that first time posting here is in a bad situation.

Is dedupe&compression enabled on this cluster? If so then potentially the Disk-Group failed due to a Capacity-tier device failing.

What exact cause was determined for failure of the Cache-tier devices? e.g. did you see Medium errors (0x3 0x11) for these devices in vmkernel.log or some other cause of failure of the Disk-Group when first initialising at boot?

The data on a Disk-Group is not going to become accessible by replacing a failed Cache-tier device - this will merely allow creation of a new (blank) Disk-Group with the original Capacity-tier devices. If you haven't determined the cause of the failure and/or there is anything that can be done to remediate it AND you haven't deleted the old Disk-Groups then I would advise putting the failed Cache-tier devices back in their respective servers and trying to figure out the problem(s).

The invalid VMs are likely so because their namespaces are inaccessible due to the double failure, it is technically possible that some vmdk Objects from these VMs may exist on storage and accessible but have no accessible descriptors (because they were in the now inaccessible namespace) - there are means of recreating these descriptors but it can be time consuming (these stray vmdks should now also appear as Unassociated Objects in RVC vsan.obj_status_report -t). If neither of the Disk-Groups can be recovered then unfortunately you will be looking at restoring from back-up or rebuilding VMs, though if you have anything extremely important to the business stored here and inaccessible, then consider a data recovery specialist.

As for the Local VMFS vCenter issues - if the vCenter wasn't heavily customised, didn't plug in to a dozen different products, didn't manage a huge detailed inventory and manage a large number of hosts: recreate it, even as a temporary measure.

Conversely, If the vCenter holds data and configurations that the business cannot afford to lose then consider engaging GSS and/or a VMFS specialist e.g. continuum

Bob

0 Kudos