Solved: Re: Health Error - Inaccessible Objects

alainrussell · ‎04-27-2018

Hi,

Running vSAN 6.6, vSphere 6.5d.

We upgraded firmware on some of the our Dell R730xd Machines this morning and have hit an error with VSAN when the upgrade was happening, on the 3rd host we upgraded one of the disks was marked as permanently failed by VSAN and data was rebuilt to other hosts, while the data was rebuilding the failed disk went to healthy status again (and has since swapped back to permanently failed, and again to healthy). I’ve since powered down this host and restarted it to see if it was a transient error (iDrac shows the disk as healthy).

Everything appears to have redistributed through the VSAN cluster and all VMs are showing as healthy, their storage policy is showing as compliant. Unfortunately, the VSAN health check is showing an object error for 2 objects that appear to be orphaned on the disk that was marked as failed, these show as “other” objects and are not linked to current VMs as far as I can tell.

I’ve checked the state of VSAN and these show as below.

> vsan.check_state vcsa.domain.com/DataCenter/computers/Cluster

2018-04-28 04:09:56 +0000: Step 1: Check for inaccessible vSAN objects

Detected 2 objects to be inaccessible

Detected 7bc0e35a-4c43-381c-3580-ecf4bbe027b8 on esx03.domain.com to be inaccessible

Detected b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 on esx03.domain.com to be inaccessible

Trying to purge them does not appear to work.

> vsan.purge_inaccessible_vswp_objects vcsa.domain.com/DataCenter/computers/Cluster

2018-04-28 04:39:44 +0000: Collecting all inaccessible vSAN objects...

2018-04-28 04:39:44 +0000: Found 2 inaccessbile objects.

2018-04-28 04:39:44 +0000: Selecting vswp objects from inaccessible objects by checking their extended attributes...

2018-04-28 04:39:45 +0000: Found 0 inaccessible vswp objects.

I’ve seen some blog posts online about using the objtool to delete the objects, but this says to check with support before running it to make sure you are running it correctly. If I try to get details of the failed object, I get an error as below currently (this is run from the host the objects are on).

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 -c

Failed to find lsom object: Not found

Failed to get disk uuid

object getAttr error: Failure

Any idea on how to handle these orphaned objects?

(I've opened a support case with Dell (they provide our VMWare support, but they can be very slow for anything vSAN related)

Thanks.

TheBobkin · ‎04-28-2018

Hello Alain,

"cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy"

This needs to be run on the host that the VM is registered on.

"neither of the 2 machines have any snapshots showing in the web interface."

To be sure for sure for sure, check the disks that the VMs are pointing to either in the .vmx as above or Click VM > Edit Settings > Hard Disk > Disk File

The reason I say this is that it is technically feasible to have a VM using snapshots that are not present in the .vmsd and thus Snapshot Manager.

" /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10"

Correct.

"Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots."

This is the type of scenario I had imagined was the cause but didn't want to assume.

Bob

View solution in original post

TheBobkin · ‎04-28-2018

Hello Alain,

Welcome to Communities.

We (GSS support) advise opening an SR not just so that Objects are deleted correctly but so that it is resolved what is being removed and how it got I to that state.

When trying to identify inaccessible components you need to specify the active (state 5) component and run this on the host that holds this component (thus the bypassDom flag) - in your example you are specifying the Object UUID.

You can find the remaining active componentUUID using:

# cmmds-tool find -t DOM_OBJECT -f json -u 7bc0e35a-4c43-381c-3580-ecf4bbe027b8

Then use this to query the active component (on the host that is the DOM_Owner) as you were doing.

Bob

alainrussell · ‎04-28-2018

Thanks Bob

The command you sent outputs as attached (for the 2 objects, nothing in state 5?), I'm happy to wait for GSS to be engaged through Dell (I'm currently dumping hardware logs for them which I'm not sure are going to help .. so might be a bit of a waiting game).

For now, all the VMs are reporting healthy - I think I'd struggle to put the host in MM based on the objects reporting inaccessible but should not need to do that in the short-term.

[EDIT] Change attachments, TXT files would not download.

TheBobkin · ‎04-28-2018

Hello Alain,

Yes, all components are state 6 there. So it probably was either marked for deletion or is a leftover component from something being removed with a host out of the loop at the time. Owner Abdication should result in either being removed (if it has delete flag) or a state 5 component which can be queried, run this from the host that is current owner:

#vsish -e set /vmkModules/vsan/dom/ownerAbdicate <ObjectUUID>

Is this a VxRail deployment?

Bob

alainrussell · ‎04-28-2018

Thanks again Bob,

No - not a VXRail deployment, these are 5 Dell 730xd's (hybrid)

I've run the commends, most are still state 6

Alain

alainrussell · ‎04-28-2018

One object that can be queried outputs as below (the VM in question is showing as healthy)

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -c

Object Attributes --

UUID:b1c0e35a-0212-8bbf-e276-ecf4bbe027b8

Object type:vsan

Object size:107374182400

User friendly name:(null)

HA metadata:(null)

Allocation type:Thin

Policy:((\"stripeWidth\" i3) (\"cacheReservation\" i0) (\"proportionalCapacity\" (i0 i100)) (\"hostFailuresToTolerate\" i2) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"36187868-4722-41c8-a55b-e6a95208f450\") (\"spbmProfileGenerationNumber\" l+1) (\"objectVersion\" i5) (\"CSN\" l559) (\"SCSN\" l584))

Object class: vdisk

Object capabilities: STRICT_GWE

Object path: /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name_1-000001.vmdk

Group uuid: 71001856-a9f8-1ff3-781f-ecf4bbd92408

TheBobkin · ‎04-28-2018

Hello Alain,

"I've run the commends, most are still state 6"

This is expected - if it had enough healthy components it would be accessible, inaccessible (usually) means only a single component or stripe remaining.

Is this snapshot part of on active chain?

# cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx | grep vmdk

If the VMs disks are pointing to snapshots then query the chain and see if 000001.vmdk is part of it and if it is, find out if it is the same 000001.vmdk referenced here or a newer one.

If this VM is pointing to all base-disks then it should be safe to manually delete this inaccessible snapshot Object (using Objtool). Though of course ensure the VM is functional, current and has a backup (as all should :smileygrin:).

Is the other inaccessible Object also a snapshot?

Bob

alainrussell · ‎04-28-2018

The other object appears to be a snapshot as well (output below), neither of the 2 machines have any snapshots showing in the web interface.

Running your command results in:

cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy

Object Attributes --

UUID:7bc0e35a-37ce-d21d-19a3-ecf4bbe027b8

Object type:vsan

Object size:1099511627776

User friendly name:(null)

HA metadata:(null)

Allocation type:Thin

Policy:((\"stripeWidth\" i3) (\"cacheReservation\" i0) (\"proportionalCapacity\" (i0 i100)) (\"hostFailuresToTolerate\" i2) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"36187868-4722-41c8-a55b-e6a95208f450\") (\"spbmProfileGenerationNumber\" l+1) (\"objectVersion\" i5) (\"CSN\" l462) (\"SCSN\" l492))

Object class: vdisk

Object capabilities: STRICT_GWE

Object path: /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/6f001856-72c1-f6fe-45e1-ecf4bbd92408/vm-name2_1-000001.vmdk

Group uuid: 6f001856-72c1-f6fe-45e1-ecf4bbd92408

A delete of these is as below? (after a backup of each VM in question )

/usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

alainrussell · ‎04-28-2018

Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots.

TheBobkin · ‎04-28-2018

Hello Alain,

"cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy"

This needs to be run on the host that the VM is registered on.

"neither of the 2 machines have any snapshots showing in the web interface."

To be sure for sure for sure, check the disks that the VMs are pointing to either in the .vmx as above or Click VM > Edit Settings > Hard Disk > Disk File

The reason I say this is that it is technically feasible to have a VM using snapshots that are not present in the .vmsd and thus Snapshot Manager.

" /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10"

Correct.

"Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots."

This is the type of scenario I had imagined was the cause but didn't want to assume.

Bob

alainrussell · ‎04-28-2018

Thanks (again)

I verified both disks are not running from snapshots:

[root@esx05:~] cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name1.vmx | grep vmdk

scsi0:0.fileName = "vm-name1.vmdk"

scsi1:0.fileName = "vm-name1_1.vmdk"

[root@esx05:~] cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/6f001856-72c1-f6fe-45e1-ecf4bbd92408/vm-name2.vmx | grep vmdk

scsi0:0.fileName = "vm-name2.vmdk"

scsi1:0.fileName = "vm-name2_1.vmdk"

Trying a delete on the first (after VM backup) returned:

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

Deleting object b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 with force mode

Opening vsan namespace control node

Marshaling delete arguments

Issuing delete ioctl

object deletion ioctl failed: No such file or directory

object delete error: Failure

TheBobkin · ‎04-28-2018

Hello Alain,

I am not positive if it should matter in this case but are you deleting that from the node that is DOM-owner and/or that the healthy component(s) reside on? There are other feasible methods of cleansing these but do try from other hosts first.

Bob

alainrussell · ‎04-28-2018

Yes, originally from the owner host (I've only tried this object, not the others yet)

Tried from another couple - same error unfortunately.

[root@esx04:~] /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

Deleting object b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 with force mode

Opening vsan namespace control node

Marshaling delete arguments

Issuing delete ioctl

object deletion ioctl failed: No such file or directory

object delete error: Failure

alainrussell · ‎04-28-2018

FYI - Same error on the other 3 objects related to the other VM.

TheBobkin · ‎04-28-2018

Hello Alain,

You are specifying the LSOM-Component UUID - not the DOM-Object UUID (which is what you are trying to remove here).

The only time that you should be using the LSOM-Component UUID is when using --bypassDom -c for identification.

Try deletion with the Object UUID.

Bob

alainrussell · ‎04-28-2018

Ok, thanks - so the delete should be specified using the Object UUID from this initial check?

2018-04-28 23:07:43 +0000: Step 1: Check for inaccessible vSAN objects

Detected 2 objects to be inaccessible

Detected 7bc0e35a-4c43-381c-3580-ecf4bbe027b8 on esx03 to be inaccessible

Detected b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 on esx03 to be inaccessible

eg. /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 -f -v 10

Apologies for the confusion!

TheBobkin · ‎04-28-2018

Hello Alain,

Yes, the Object UUID.

"Apologies for the confusion!"

My bad, I should have checked what you were using before - I only checked the syntax of the command not the UUID.

Bob

alainrussell · ‎04-28-2018

Thanks for all your help, the 2 objects are now deleted and health is reporting all green