"Remove the diskgroup from vSAN and add it back"

BHagenSPI · ‎04-07-2021

I've begun the process of updating my vsan format from v7 to v13. Almost immediately, I got messages about being close to running out of disk space in my vsan datastore, and also got this error on one of my cache drives:

Operational Health

Observed excessive log congestion, data evacuation is complete
Remove the diskgroup from vSAN and add it back

I have 6 hosts, each with 1 ssd and 6 hdds, and a disk group per host. The above error is pointing to the ssd on one of the hosts.

Does this mean:

1. That my disk group is now not adding capacity to vsan, and I can simply remove it and add it back with no impact?

2. That if I remove/re-add this group I'll lose this significant amount of storage, which *will* cause me to run out of disk space?

3. Something else?

I've searched everywhere, but I can't find an answer. I've created a ticket, but had to call it sev 2 since I'm not technically "down".

Support Request Confirmation Number:

21211390204

paudieo · ‎04-08-2021

I believe this is effect of DDH
Dying Disk Handling is a method that vSAN uses to check the health of disks/diskgroups in order to detect an impending disk/diskgroup failure or a poorly performing diskgroup due to congestion.

https://kb.vmware.com/s/article/2148358

I suspect the DG was unmounted automtically as vSAN detected congestion on the SSD cache tier and thus it was not contributing to capacity. more reading here
https://core.vmware.com/resource/degraded-device-handling

There may have been congestion due to movent of VMs or objects after the upgrade, this may have lead to congestion on a particular disk-group.
Another posibility of movment of data is large VMDKs may have been reformatted after doing the object conversion which may have triggered a resync or movement of data
see https://cormachogan.com/2021/02/09/vsan-7-0u1-object-format-health-warning-after-disk-format-v13-upg...

I would suspect putting the affected host in MM mode , deleting affected DG and re-creating it , and letting the objects to resync back will probably help.
vSAN also has a protection mechanisim in case a resync fills up a diskgroup and will suspend reysncs if it runs low on space.

Prob leave to support to advise you best though as they will prob want a closer look.

All

"Remove the diskgroup from vSAN and add it back"