Solved: Re: 7.0.1c VSAN resync stuck with 1 object in the ...

BHagenSPI · ‎01-19-2021

Hi; I just upgraded our VCSA and our 6 ESXi hosts from 6.7 to 7.0.1, 17325551. We use VSAN in the cluster (I believe it was 6.2). I followed the prompts to upgrade the disk version. It took about 6 days, and as of yesterday was saying there was 1 object left to resync, with 4hrs left.

This morning, I still have 1 object left to resync, with 568 days left, and climbing. ??

How can I fix this?

Under Cluster - Configure - VSAN - Disk Management, I see this message:
"All 42 disks on version 13.0 but with older vSAN objects."

Under Cluster - Monitor - Resyncing Objects - Object List - filter set to Queues, I see this message:

"1 object in the queue (unable to provide a list of queued object)"

TheBobkin · ‎01-26-2021

@BHagenSPI, That likely isn't 'stuck' in the conventional sense and is just a bit more complicated than the average Object - vsan.stats is where the historical vSAN Performance information that backs the pretty graphs in the UI are stored (and hence why it can go back quite a long time) - why this is a mite complicated is that this Object is delegated to be written to by a single node (the stats Master) whom pulls the local data from all the other nodes etc., due to it's function it can basically be 'held' by a number of services and thus it is relatively common for these to prevent it being updated/changed etc. .

I would advise first seeing if you can change the SP for the stats Object (Cluster > Configure > vSAN > Performance > edit the Storage Policy assigned), if this doesn't help and provided you don't have any critical need for the historical performance data (e.g. not tracking metrics or working on some performance issue) then you can consider removing it from the equation by disabling the performance service (which deletes the Object) and re-enabling it to create a new one.

View solution in original post

TheBobkin · ‎01-19-2021

@BHagenSPI ,So I have some knowledge on this issue and had intended to assist with documentation (and did inform some relevant parties) but not had the time to sit down and write any kbs on this yet.

For larger Objects (e.g. those that are auto-striped due to being larger than the max component size of 255GB) there is a relatively significant change in component layout when updating to Object format v13 - this change of layout is actually what enables our now significantly lowered guidance on slack-space requirements for Storage Policy reconfigurations as this changes how Objects are rebuilt when doing any Storage Policy change that requires a 'deep-reconfig' (e.g. whole new layout of the Object before removal of the previous components, e.g. changing from stripe-width=1 to stripe-width=2 or from RAID1 to RAID5).

But, in order to facilitate these changes, all Objects need to undergo one last reconfiguration that actually requires the same space as if they were doing a deep-reconfig e.g. a 1TB vmdk (2TB FTT=1,FTM=RAID1 physically used on disk with, assuming full or thick-provisioned for clarity sake) will require 2TB of space across viable Fault Domains (e.g. in a 2-node cluster, 3TB free on one node and 500GB free on the other will not suffice) to perform this reconfiguration.
This process can be problematic in certain scenarios e.g. a recent one I came across where there was an iSCSI target that was basically consuming more than half of the available storage of a 2-node cluster - performing a deep-reconfig of such an object without as much space as it consuming (and relative to the datastore size) is like asking a semi-truck to turn around in a normal sized house's driveway.

"This morning, I still have 1 object left to resync, with 568 days left, and climbing. ??"
From engineering communication that I am aware of, if there is not sufficient space (and in applicable Fault Domains) this is intended to timeout instead of such behaviour, perhaps this requires further tooling.

"How can I fix this?"
The Object in question can be fairly easily identified as it will be the only one on a lower Object format:
# esxcli vsan debug object list -all > /tmp/objout
And then just run the following to see what version the recalcitrant Object is (e.g. it should be the only one not v13 and we are not going to guess what version it is):
# grep Version /tmp/objout | sort | uniq
Then either less the /tmp/objout output file and search the file for the Object in question e.g. 'ESC' '/' Version: XX (use whatever version it is from the above that is not Version: 13) or use grep against it (e.g. grep "Version: XX" -B1 -A200) to determine the identity and size (and used size) of the Object.
Once it has been identified, (assuming it is just a space issue) it should be more clear how much space is required to reconfigure this Object (and accounting for the fact that we don't resync anything that pushes a disk >95%), this may be a case of deleting unneeded test/detritus, consolidating snapshots, temporarily moving something off this cluster or if possible (e.g. have backups) temporarily FTT=0 this Object or another large Object (NOTE, THIS IS ONLY APPLICABLE IF IT IS CURRENTLY FTM=RAID1) so that there is enough space to perform this one last true deep-reconfig.

BHagenSPI · ‎01-20-2021

Thanks a ton, @TheBobkin ! Your commands were spot-on (except in the first one I needed to type --all rather than -all!) and I was able to identify the object. Of course, the offending object is part of a replica of our largest VM (27TB), so no wonder it's having issues. (We have 70TB free on that datastore.)

Interestingly, this investigation has led me to see that replication of this VM had been failing (even though I'd been getting "success" messages), so this has been a double-win for us.

Is there a way to "stop" the reconfig of this component until I can get that vm replicated correctly, and then re-start the reconfig?

TheBobkin · ‎01-20-2021

@BHagenSPI, More than happy to be able to provide some insight (and sorry about the typo, typos in commands are taboo in my opinion, would have spotted it if running the command as opposed to Notepad++ and from memory 😉).
Good to hear this also helped spot another issue with this as this is not something you want to discover 'after the fact' if you needed to fail this over.

What is the Storage Policy applied to the Object in question, how many nodes in the cluster, in what configuration (e.g. stretched 3+3+1) and how much free space is there on each node? (via RVC with vsan.disks_stats <pathToCluster> is clearest option - you can PM me this output if do not want to post it here)
I am not aware of any means of cancelling this - it is intended to timeout after some time (15 hours if I recall correctly or something in that region) if there is no progress made, however that might be part of the problem here if it thinks it *is* making progress (though with a non-helpful estimated completion date!).
In the the vSphere Client under Cluster > Monitor > vSAN > Resyncing components do you see any activity that might indicate it is trying to move some data around in order to accommodate the deep-reconfig of this VM? (e.g. 'intent' showing rebalance)

Is there any particular reason you wouldn't/can't get the replication issue corrected so that you might then be able to consider potentially less desirable options such as temporarily FTT=0'ing that Object? (and yes I am thoroughly aware of the implications of this and am not suggesting this as Plan A)

BHagenSPI · ‎01-20-2021

@TheBobkin , no worries about the typo; it held me up for approximately 1.2 minutes, so...no biggie. 🙂

I actually have gotten the replication issue corrected; I deleted all the snapshots of that specific VM and am re-running the replication job. As you can imagine, it's going to take a while! (The job's been running for 5hr thus far, and it's 21% complete...2.8TB transferred.)

I'm guessing I won't start seeing movement on the resync until that job is complete, so I'll put this post on "pause" in my mind, and report back with the results. At that point if I still have questions I'll pm you our setup.

Thanks again!

PS - to answer your question, Cluster > Monitor > vSAN > Resyncing components still shows pretty much exactly the same thing as the screenshot I posted yesterday. Just one object "queued", but not able to tell what that object is. Again, no worries right now; we'll wait for the job to finish and see what happens then.

BHagenSPI · ‎01-21-2021

Update: the replication job finished a few hours ago and must have released a logjam, because now I have 27 resysncing objects, 4.8TB worth, that'll take another 13hrs.

I went ahead and set that vm for FTT0 temporarily, (from FTT1 + 12 stripes).

Guess I'll report back tomorrow!

Good thing I did this on our DR cluster first, where 99% of the vms are powered off. I don't think it's going to be doable on our production cluster; seems like it would take a year to complete. 😞

TheBobkin · ‎01-22-2021

"27 resysncing objects"
@BHagenSPI, Do you mean replication sync or vSAN resync and what is the stated intent? (e.g. rebalance)
I wonder if it was previously unable to show the identity of the Object in question due to it being seen as unassociated (e.g. if it was not directly attached to a VM registered in inventory due to replication or the VM not being registered).

"FTT1 + 12 stripes"
Stripe-width=12 here could be implicated in it having issues with placing components but this depends on how many capacity-tier disks per node, how many nodes, free space per device and whether stretched or not.

"Good thing I did this on our DR cluster first, where 99% of the vms are powered off. I don't think it's going to be doable on our production cluster; seems like it would take a year to complete."
This is another maybe not so apparent benefit of having a DR site - these can often be basically a less beefy replicas of the production cluster they replicate from and thus their storage and data composition are likely as close as one can get to a copy of a cluster and thus any unforeseen issues with update/upgrade/migrations can be ironed out and potentially better prepared for in the primary cluster.

In this case before considering doing the same on the primary cluster, I think you should be taking a good thorough look at what is on both clusters, Storage policies used and what used on, how much free space there is and where (and anything that might pose problematic in utilising this as one might expect).
Next step would be to do housecleaning, consolidate any large snapshots (you would be astounded with the frequency I hear the phrase 'nothing is running on snapshots' or 'we don't have any snapshots' and be able to find dozens-hundreds using up to multiple TB total in extreme cases - a simple find / | grep 0000 should find them or check against the debug object list), delete or trim down (e.g remove just unneeded large disks on) any VMs no longer needed (and don't assume there are none based on just what is registered in inventory, there could be unregistered ones), validate that there are not any unexpected unassociated objects (e.g. anything incompletely deleted or unregistered and abandoned), configure and run TRIM/UNMAP if you think there could be significant savings (and the data is suited to this e.g. RAID1 not RAID5), if there are any awkward objects (e.g. stripe-width=12), then maybe find a way (at least for the time being) to make them less awkward or if there is any feasibility to temporarily move them off the vsandatastore and then back (which basically will create them as new objects in the new format) to take them out of the equation.

BHagenSPI · ‎01-25-2021

Once again, excellent advise @TheBobkin ; I'd already started putting together a list of things to check for the production cluster, and had forgotten about snapshots...taking care of those will make a huge difference.

As for the resync: I still have one object that's "stuck" re-syncing. This time, it's a 450Gb object, intent is "compliance", status is "queued".

I ran the commands you gave me earlier and I have found the object uuid of the only "version: 7" object. It's a FTT1 + Stripe 12, but for the life of me I can't figure out what VM that object is attached to (or where it's orphaned). Is there a way to lookup a UUID and translate it to a file that I can see in the datastore?

TheBobkin · ‎01-25-2021

@BHagenSPI Hopefully this was not another case of 'there was certainly not TB of snapshots' but if it was, happy to help you free up some space.

If you look in 'esxcli debug object list --all' output you can find the Object via the 'Object UUID' and it should state the Object path in the listing of this Object (unless perhaps if it is severely inaccessible or otherwise impaired in which case /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u ObjectUUIDHere from a node that has an active component should be able to identify it (provided it has an active component, which is a different story)), in the debug output does it perchance say '(Missing)' after the path?

BHagenSPI · ‎01-26-2021

I'm seeing that *all* the replica VMs have one snapshot...I'd guess because I'm keeping 2 replicas per VM (I'm using Veeam to do the replication from backup copy jobs, and have set the replica retention to 2). Changing the 27tb replica to ftt=0 temporarily seems to have fixed that "stuck" object; the currently stuck object isn't associated with that huge vm...it seems to be in the .vsan.stats directory?

Type: vmnamespace
Path: /vmfs/volumes/vsan:uuid-of-stuck-object/.vsan.stats (Exists)
Group UUID: UUID
Directory Name: .vsan.stats

The object is FTT1 + Stripe 12; so there are lots of entries under "components" (raid_0). I don't see an obvious way to set that folder to ftt=0 to get them unstuck...thoughts?

TheBobkin · ‎01-26-2021

@BHagenSPI, That likely isn't 'stuck' in the conventional sense and is just a bit more complicated than the average Object - vsan.stats is where the historical vSAN Performance information that backs the pretty graphs in the UI are stored (and hence why it can go back quite a long time) - why this is a mite complicated is that this Object is delegated to be written to by a single node (the stats Master) whom pulls the local data from all the other nodes etc., due to it's function it can basically be 'held' by a number of services and thus it is relatively common for these to prevent it being updated/changed etc. .

I would advise first seeing if you can change the SP for the stats Object (Cluster > Configure > vSAN > Performance > edit the Storage Policy assigned), if this doesn't help and provided you don't have any critical need for the historical performance data (e.g. not tracking metrics or working on some performance issue) then you can consider removing it from the equation by disabling the performance service (which deletes the Object) and re-enabling it to create a new one.

rwmastel · ‎02-01-2021

I was upgrading disks from vSAN 4 to 10 and had this same problem, also was a stats object. Turned off performance and back on, problem solved. Thanks!

BHagenSPI · ‎02-02-2021

Well, I've been out sick the last week, so I'm just getting back to this. I had licensing issues when I moved to esxi 7.0.1; the hosts got the eval license, which I didn't realize, and then I couldn't assign our "standard" license (note to self: updated licenses *before* upgrading esxi!!). So I had to disconnect each host from vcenter server, upgrade the license on the host itself, and then re-connect it to vcenter server.

Obviously that cleared the performance counters, and now I have no stuck files any more. 🙂

So, I guess this "case" is closed. But I've learned a lot about the upgrade process, and the things I need to do to prep our production environment for the upgrade.

Thank you very much for your excellent help @TheBobkin!!

All

7.0.1c VSAN resync stuck with 1 object in the queue