Solved: Reduced availability with no rebuild

lspin · ‎02-26-2023

Hi,

I'm looking for assistance resolving an unhealthy object in our vSAN. We are in the process of upgrading our environment from 6.7 to vSphere 7.0 and will need to correct this issue before upgrading. Unfortunately, we cannot open a VMware support request until we complete the upgrade to 7.

Skyline health displays the following for the object:

When I check the physical placement of the object it shows the following:

I haven't been able to find a whole lot of vSAN troubleshooting KBs to help resolve so I'm reaching out here. If anyone can help, it would be greatly appreciated. Thanks.

TheBobkin · ‎02-27-2023

@lspin "Unfortunately, we cannot open a VMware support request until we complete the upgrade to 7."
Actually this isn't true - vSAN/ESXi 6.X is in past end of general support period but still in end of technical guidance phase e.g. you can still open a Support Request, this just means you cannot open a P1, we cannot engage engineering for any issues and VMware GS are not bound to joining live support calls (e.g. primary means of providing assistance and analysis will be via log analysis and email but if an engineer decides they want to look at the issue live they can).

So, we can infer lots of things from the RVC output there but I will keep to just a few summary points:

1. What the object is/was (from the 'path' field) can likely be determined by running the following on node xxx.xxx.xxx.135
# /usr/lib/vmware/osfs/bin/objtool getAttr -u 21cba663-06db-837b-6275-20677ceb34c8 --bypassDom

2. Should that not be possible (for whatever reason), it can at the very least be determined what VM this belonged to (assuming it still exists) from the Group UUID of the object ("groupUuid"=>"4ab55063-9481-3be5-e26c-b883034df502") via the same means (assuming the VM is still around and bypassDom not being necesary assuming the VM is healthy, if it is no longer present then nothing will be returned):
# /usr/lib/vmware/osfs/bin/objtool getAttr -u 4ab55063-9481-3be5-e26c-b883034df502

3. Finally an interesting point regarding the state of the object it has flag "deleteState"=>1, what this means is that delete of this object was triggered but clearly not completed, I would advise running the following on node xxx.xxx.xxx.135 which is the DOM Owner of this object - what this will do is try to refresh ownership and state of the object, this may result in the object deletion completing OR the object may become healthy or otherwise change state (generally to the point that if previous identification with getAttr failed, that this will then work and it can be deleted after identification):
# vsish -e set /vmkModules/vsan/dom/ownerAbdicate 21cba663-06db-837b-6275-20677ceb34c8

Edit: there was a typo in last command.

View solution in original post

lspin · ‎02-27-2023

I've checked disk management for any disks that may be failed on the host, but none show as "Evacuated" or failed. I also checked the iLO's storage to see if there were any disks in predictive failure or degraded state and all show as healthy.

I then ran the RVC commands to locate the object which reports as "Unknown disks" and generated the output in the attached "RFC_outputs.txt" file.

I'm not sure where to go from here. Seems like it could be left over from a VM that has sense been deleted, but I want to be sure before deleting the object.

TheBobkin · ‎02-27-2023

@lspin "Unfortunately, we cannot open a VMware support request until we complete the upgrade to 7."
Actually this isn't true - vSAN/ESXi 6.X is in past end of general support period but still in end of technical guidance phase e.g. you can still open a Support Request, this just means you cannot open a P1, we cannot engage engineering for any issues and VMware GS are not bound to joining live support calls (e.g. primary means of providing assistance and analysis will be via log analysis and email but if an engineer decides they want to look at the issue live they can).

So, we can infer lots of things from the RVC output there but I will keep to just a few summary points:

1. What the object is/was (from the 'path' field) can likely be determined by running the following on node xxx.xxx.xxx.135
# /usr/lib/vmware/osfs/bin/objtool getAttr -u 21cba663-06db-837b-6275-20677ceb34c8 --bypassDom

2. Should that not be possible (for whatever reason), it can at the very least be determined what VM this belonged to (assuming it still exists) from the Group UUID of the object ("groupUuid"=>"4ab55063-9481-3be5-e26c-b883034df502") via the same means (assuming the VM is still around and bypassDom not being necesary assuming the VM is healthy, if it is no longer present then nothing will be returned):
# /usr/lib/vmware/osfs/bin/objtool getAttr -u 4ab55063-9481-3be5-e26c-b883034df502

3. Finally an interesting point regarding the state of the object it has flag "deleteState"=>1, what this means is that delete of this object was triggered but clearly not completed, I would advise running the following on node xxx.xxx.xxx.135 which is the DOM Owner of this object - what this will do is try to refresh ownership and state of the object, this may result in the object deletion completing OR the object may become healthy or otherwise change state (generally to the point that if previous identification with getAttr failed, that this will then work and it can be deleted after identification):
# vsish -e set /vmkModules/vsan/dom/ownerAbdicate 21cba663-06db-837b-6275-20677ceb34c8

Edit: there was a typo in last command.

lspin · ‎02-27-2023

Thanks @TheBobkin @TheBobkin

I ran the commands as suggested, output is attached.

Looks like it's a vmdk for one of our place holder VMs. This is a DR cluster that holds all the replica VMs from the primary site using vSphere Replication and SRM. This VM is not powered on but syncs periodically.

Ran the "vsish -e set" command against the objects UUID. vSAN object health now lists the object with the "Reduced availability with no rebuild - delay timer".

Resync objects shows this is scheduled to be resynced in the next 60 minutes.

TheBobkin · ‎02-27-2023

You are most welcome @lspin - happy to help.

If it is an active VM (e.g. in-use/in-inventory) then it is possible this hbrdisk object is part of active replication data, however this wouldn't explain why it had "deleteState"=>1 flag on it - that would suggest that possibly something went wrong with creating the replication object or validation of that failed and it tried to cancel it (e.g. how some backup solutions will mark a backup-created snapshot object as failed and just create a new one).

Either way, you can try resync that object back to a healthy state (Cluster > Monitor > vSAN > Skyline Health > Data > 'Repair Objects Immediately' button) or wait for repair delay timer (default 60 minutes) to expire and start this - whether it will repair back to a healthy state can depend on other factors (e.g. maybe it didn't repair previously due to medium errors on disk or something of that ilk.

If you wanted to be super-cautious, you could analyse further whether the object in question is an active part of replication data, and/or stop replication for this VM and restart replication to ensure all replicated data is reliable/intact/as-should-it-be.

All

Reduced availability with no rebuild