Just out of curiosity were you within a terabyte or so of free space on your VSAN datastore before experiencing issues? Last night I deliberately ballooned the storage on my VSAN, attaching a ...
See more...
Just out of curiosity were you within a terabyte or so of free space on your VSAN datastore before experiencing issues? Last night I deliberately ballooned the storage on my VSAN, attaching a 2TB VMDK to a VM and filled it up. The point was to simulate something a client might do, and freshen up some automation tools in the process of fixing it. However the event did produce a very similar outcome to yours, and thought i should share. Maybe some of the events i describe you experienced before/after your problem. At the time VSAN had 4TB reported free capacity. The point of failure came when my VSAN hit around 1.2TB free space, and each disk per group was at 800-900GB (1TB each). The final VMDK to see mass data writes, become completely unstable. After forcing a VM power down that virtual disk and in turn the VM got marked "inaccessible" (i/o errors). Further inspection revealed a laundry list of absent/failed disk components. In order to locate the UUID for the problematic object I opened up RVC and ran "vsan.obj_status_report ~<host> --print-table --filter-table 18/32". This printed out a table just showing me objects which are inaccessible. "vsan.object_info ~0 <UUID>" reported 1.4TB of addressed space, VSAN Re-sync was frozen attempting to sync one 82GB component; however that didn't prevent it from carrying out other sync/policy related events. Since I was using the default storage policy for that disk, 1.4TB logical and 2.8TB physical. Wanting to salvage the file server, I removed the VM from inventory, then added it back in. Also removed the affected disk, but did not delete the disk (doing so would obviously result in an i/o error, and VM becoming marked "inaccessible" again). However I was still left with 90% usage on my VSAN Datastore, and the health check plugin was unable to repair the related object. So I jumped on one of the hosts and ran "/usr/lib/vmware/osfs/bin/objtool delete -f -u <UUID>". There was also the side affect of some other objects on other disks losing some of their redundancy. Probably a result of a scramble to re-arrange objects due to the ballooning use. The Health plugin quickly remedied that problem. I conducted the test with a max com size of 255GB, and plan on reducing it to 180GB for the next run. Best, -Retting