VMware Cloud Community
jkoebrunner
Enthusiast
Enthusiast

VSAN degraded: how to solve?

We have an issue with a VSAN cluster after removing some HDDs from a disk group and adding them to a new disk group within the same server.

Maintenance mode was "Ensure Accessibility".

Some of the VM objects are now degraded, unfortunately two of three components (witness and one RAID1 component). See sreenshot.

Is there any way to tell VSAN to resync from the one RAID1 component which is left manually?

Or is the vdisk data lost?

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
11 Replies
depping
Leadership
Leadership

Normally it should start syncing instantly when degraded... did you check through RVC if it is syncing?

0 Kudos
jkoebrunner
Enthusiast
Enthusiast

No activity in the vsan.resync_dashboard since 24 hours Smiley Sad

Moreover, vsan.check_state shows missing objects...

One VM got completely lost as it seems.

For me it would have been interesting if there is a way of resyncing only from one object left in our three cluster server.

We are now waiting for a VMware support webex session...

How this has happened is still a mystery, we have to investigate further.

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
0 Kudos
CHogan
VMware Employee
VMware Employee

Does the '-r' option (refresh_state) to check_state help?

http://cormachogan.com
0 Kudos
jkoebrunner
Enthusiast
Enthusiast

Unfortunately not, see attachment

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
0 Kudos
CHogan
VMware Employee
VMware Employee

A few things to check...

1. Is your network between the hosts ok? Check the Network status in the VSAN Summary tab, and make sure there is no misconfiguration detected message.

2. Are all of your disks healthy? Go into the disk group view and make sure none are in an unhealthy state. You can probably use RVC to double check the disks too.

http://cormachogan.com
0 Kudos
jkoebrunner
Enthusiast
Enthusiast

I have checked that before, also via RVC commands:

Network status is "Normal"

Disks are all "healthy"

Could it be possible that the disks which we removed form the disk group had still data on them while adding them to the new disk group?

I would assume that VSAN disks are initialized while removing or adding?

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
0 Kudos
CHogan
VMware Employee
VMware Employee

Nope - as long as you removed them through the UI or esxcli before reusing them, the partition information should be cleared.

http://cormachogan.com
0 Kudos
jkoebrunner
Enthusiast
Enthusiast

Yes we did that.

So we are still in contact with the VMware support, no solution yet.

I see now two possibilites:

- Delete the affected vdisks (which includes the reconfiguration of many VMs)

- Force VSAN to sync from a RAID1 component without having the witness and the other RAID1 component available

I would prefer the second option: is there any "secret" command to resync from an object without having the majority?

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
0 Kudos
jkoebrunner
Enthusiast
Enthusiast

Update: This issue is now under investigation at the VSAN escalation team.

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
0 Kudos
jkoebrunner
Enthusiast
Enthusiast

Update: we have now the reply from the VSAN team. They told us that the vdisk data is lost if the majority is gone.

Nevertheless, we tried a workaround which finally worked:

We just copied the degraded vdisk data and all other VM files directly on the file level to a new folder - and voila, the vdisk was replicated again. Smiley Happy

Somehow, I was not that surprised about this workaround... when there is still one RAID1 component available, the data must be there.

So there should also be a way to rebuild the data by ignoring the usual "resync only when there is a majority" procedure.

We are now in contact with the VSAN team about this workaround and if it could be a lifesaver in such a worst case scenario.

Johannes Köbrunner IT Solutions Architect Virtualization, Network and Storage Systems Frequentis AG VTSP, VCP, VCAP-DCD
Bishop1337
Contributor
Contributor

Hi,

Could you explain a bit more what you have done to recover your machine.

Thanks

Kevin

0 Kudos