Re: vSAN - Remediate Inaccessible Objects

JeremeyWise · ‎01-30-2023

I started the thread here Force Delete Partition - VMware Technology Network VMTN but figure this really should be an independent / thread.

Goal: Clean up orphaned objects so all vSAN errors are clear so I can upgrade disk to latest version of vSAN

I have a list of disk objects that are orphaned but I need to get better direction and root cause of how these objects are made, and when the issue is critical, vs just something to clear out.

I will use one orphan object on one host as example for thread, and build out workflow logic to remediate.

If there is a document and workflow to do this already: please direct me to it.

Example:

So object: d48fd262-e480-2ab0-5434-a0423f35e8ee

So starting with host which has absent object and disk. I did check that disk and it is listed as "healthy" so I figured I would see if I could map the object to "is it being used" , or is it just garbage orphaned and I can delete.

Tried to review all options for command "/usr/lib/vmware/osfs/bin/objtool" that are not destroy / create

####

getAttr Get attributes of an object
-W/--cid Container UUID
-u/--uuid UUID of the object
-c/--isComponent Specified uuid is a component uuid
-x/--diskUuid Disk uuid of the component
--bypassDom Bypass DOM and read from LSOM
-I/--snapId Snapshot ID (optional)
--format Pretty print the output
(Currently, only JSON)

getExtents Get extents information for a range of an object
-W/--cid <value> Optional container UUID.
-u/--uuid UUID of the object
-o/--offset Offset in bytes (or KB, MB, GB or TB)
-e/--length Length in bytes (or KB, MB, GB or TB)
-b/--strict Request exact extent information
-I/--snapshotId Optional snapshot ID.
-P/--physical Request the physical extents information
(Only for VSAN/ZDOM object)

getSnapshotDiff Get snapshot diff for a range of an object
-W/--cid <value> Optional container UUID.
-u/--uuid UUID of the object
-o/--offset Offset in bytes (or KB, MB, GB or TB)
-e/--length Length in bytes (or KB, MB, GB or TB)
-B/--firstSnapId Optional base snapshot ID.
-I/--snapId snapshot ID.

###

[root@odin:~] /usr/lib/vmware/osfs/bin/objtool getAttr -u 523c50be-adf7-575e-e5b6-84dba98e3365
Failed to get object attributes : No such file or directory 131076.
object getAttr error: Failure
[root@odin:~]

So not a great start to build out workflow.

Questions:

1) Are objects within context of vSAN a reflection of a "chunklet" of block object that is tagged with UUID, and then replication set / done against that object ID against a specific VM. vs where each object is a "block" / "chunklet" from total logical vSAN volume and not associated with any VM, as vSAN would be ignorant of the VM data placed within the chunklet / object. If it is based on VM vdisk, then can we query which "VM" that object was underpinning, so we could at least bind to which VM would be negativly effected if change made.

2) Are there any means to tag an object into "garbage" state, and then have vSAN do heath check and so remove an object within comfort that the if the object was under pinning something important, removal is now repaired and so just delete now that data has passed cleanup check.

PS:

As this is a home lab.. and I have backups.. I just figured what the heck, lets try to force delete one.. role the die and see how bad things happen.

Ex: d48fd262-6556-dfd7-a658-a0423f35e8ee also on same host "odin"

Before:

[root@odin:~] /usr/lib/vmware/osfs/bin/objtool delete -u d48fd262-6556-dfd7-a658-a0423f35e8ee -f -v10
[root@odin:~]

After:

So.. I guess we sit back and see who dies 🙂

Nerd needing coffee

JeremeyWise · ‎01-31-2023

Due to timeline to get the cluster repaired.. and move forward to finish upgrade.. I just pulled trigger and deleted all of the orphaned objects.

I did test a few servers (starting with one listed as having the orphaned object) to put them into maintenance mode... with full data migration. Trying to see if this triggered error or fault. Nothing happened.. so I A$$ume that the data is ok.

After this object cleanup.. disk upgrade to vSAN version 17 completed without error.

Nerd needing coffee

peetz · ‎02-03-2023

Hello Jeremy,

okay, you did the right thing. Just some thoughts here, because I have been in the same situation a few times:

If a VMs is shown as available (in the inventory) and its home and virtual disk objects' components are compliant with the assigned storage policy then it won't be affected by cleaning any inaccessible objects. If an inaccessible object is part of an existing VM then this VMs would also be shown as inaccessible or orphaned.

So whenever you are seeing inaccessible vSAN objects you should just check if all your VMs are accessible and compliant with the assigned storage policy. If yes then you can go ahead and clean the inaccessible objects. A good how-to reference is this blog post: https://www.thinkcharles.net/blog/2018/2/16/removing-inaccessible-objects-in-vsan.

So what causes objects to become inaccessible unexpectedly? For me this always happened after a host in a vSAN cluster went down for a longer time (e.g. because of a hardware issue). After 60 minutes of downtime vSAN will repair all objects that are affected by the downtime (i.e. that have copies stored on the failed host) and restore their compliance with the assigned storage policy by recreating the missing copies on the remaining hosts. If then the failed host gets online again it has still the old (now redundant) object copies stored on its disks, and under some weird circumstances (which I cannot reliably reproduce) these artifacts are incorrectly identified as parts of unknown objects that are now incomplete and unrecoverable, hence inaccessible. In this scenario you will notice that all the inaccessible objects are owned by the host that failed previously.

Maybe this also happened to you.

Andreas

Twitter: @VFrontDe, @ESXiPatches | https://esxi-patches.v-front.de | https://vibsdepot.v-front.de

All

vSAN - Remediate Inaccessible Objects