We have an issue with a VSAN cluster after removing some HDDs from a disk group and adding them to a new disk group within the same server.
Maintenance mode was "Ensure Accessibility".
Some of the VM objects are now degraded, unfortunately two of three components (witness and one RAID1 component). See sreenshot.
Is there any way to tell VSAN to resync from the one RAID1 component which is left manually?
Or is the vdisk data lost?
Normally it should start syncing instantly when degraded... did you check through RVC if it is syncing?
No activity in the vsan.resync_dashboard since 24 hours
Moreover, vsan.check_state shows missing objects...
One VM got completely lost as it seems.
For me it would have been interesting if there is a way of resyncing only from one object left in our three cluster server.
We are now waiting for a VMware support webex session...
How this has happened is still a mystery, we have to investigate further.
Does the '-r' option (refresh_state) to check_state help?
A few things to check...
1. Is your network between the hosts ok? Check the Network status in the VSAN Summary tab, and make sure there is no misconfiguration detected message.
2. Are all of your disks healthy? Go into the disk group view and make sure none are in an unhealthy state. You can probably use RVC to double check the disks too.
I have checked that before, also via RVC commands:
Network status is "Normal"
Disks are all "healthy"
Could it be possible that the disks which we removed form the disk group had still data on them while adding them to the new disk group?
I would assume that VSAN disks are initialized while removing or adding?
Nope - as long as you removed them through the UI or esxcli before reusing them, the partition information should be cleared.
Yes we did that.
So we are still in contact with the VMware support, no solution yet.
I see now two possibilites:
- Delete the affected vdisks (which includes the reconfiguration of many VMs)
- Force VSAN to sync from a RAID1 component without having the witness and the other RAID1 component available
I would prefer the second option: is there any "secret" command to resync from an object without having the majority?
Update: This issue is now under investigation at the VSAN escalation team.
Update: we have now the reply from the VSAN team. They told us that the vdisk data is lost if the majority is gone.
Nevertheless, we tried a workaround which finally worked:
We just copied the degraded vdisk data and all other VM files directly on the file level to a new folder - and voila, the vdisk was replicated again.
Somehow, I was not that surprised about this workaround... when there is still one RAID1 component available, the data must be there.
So there should also be a way to rebuild the data by ignoring the usual "resync only when there is a majority" procedure.
We are now in contact with the VSAN team about this workaround and if it could be a lifesaver in such a worst case scenario.
Hi,
Could you explain a bit more what you have done to recover your machine.
Thanks
Kevin