VMware Cloud Community
alainrussell
Enthusiast
Enthusiast
Jump to solution

Health Error - Inaccessible Objects

Hi,

Running vSAN 6.6, vSphere 6.5d.

We upgraded firmware on some of the our Dell R730xd Machines this morning and have hit an error with VSAN when the upgrade was happening, on the 3rd host we upgraded one of the disks was marked as permanently failed by VSAN and data was rebuilt to other hosts, while the data was rebuilding the failed disk went to healthy status again (and has since swapped back to permanently failed, and again to healthy). I’ve since powered down this host and restarted it to see if it was a transient error (iDrac shows the disk as healthy).

Everything appears to have redistributed through the VSAN cluster and all VMs are showing as healthy,  their storage policy is showing as compliant. Unfortunately, the VSAN health check is showing an object error for 2 objects that appear to be orphaned on the disk that was marked as failed, these show as “other” objects and are not linked to current VMs as far as I can tell.

I’ve checked the state of VSAN and these show as below.

> vsan.check_state vcsa.domain.com/DataCenter/computers/Cluster

2018-04-28 04:09:56 +0000: Step 1: Check for inaccessible vSAN objects

Detected 2 objects to be inaccessible

Detected 7bc0e35a-4c43-381c-3580-ecf4bbe027b8 on esx03.domain.com to be inaccessible

Detected b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 on esx03.domain.com to be inaccessible

Trying to purge them does not appear to work.

> vsan.purge_inaccessible_vswp_objects vcsa.domain.com/DataCenter/computers/Cluster

2018-04-28 04:39:44 +0000: Collecting all inaccessible vSAN objects...

2018-04-28 04:39:44 +0000: Found 2 inaccessbile objects.

2018-04-28 04:39:44 +0000: Selecting vswp objects from inaccessible objects by checking their extended attributes...

2018-04-28 04:39:45 +0000: Found 0 inaccessible vswp objects.

I’ve seen some blog posts online about using the objtool to delete the objects, but this says to check with support before running it to make sure you are running it correctly. If I try to get details of the failed object, I get an error as below currently (this is run from the host the objects are on).

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 -c

Failed to find lsom object: Not found

Failed to get disk uuid

object getAttr error: Failure

Any idea on how to handle these orphaned objects?

(I've opened a support case with Dell (they provide our VMWare support, but they can be very slow for anything vSAN related)

Thanks.

inacessible.png

0 Kudos
1 Solution

Accepted Solutions
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

"cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy"

This needs to be run on the host that the VM is registered on.

"neither of the 2 machines have any snapshots showing in the web interface."

To be sure for sure for sure, check the disks that the VMs are pointing to either in the .vmx as above or Click VM > Edit Settings > Hard Disk > Disk File

The reason I say this is that it is technically feasible to have a VM using snapshots that are not present in the .vmsd and thus Snapshot Manager.

" /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10"

Correct.

"Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots."

This is the type of scenario I had imagined was the cause but didn't want to assume.

Bob

View solution in original post

0 Kudos
17 Replies
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

Welcome to Communities.

We (GSS support) advise opening an SR not just so that Objects are deleted correctly but so that it is resolved what is being removed and how it got I to that state.

When trying to identify inaccessible components you need to specify the active (state 5) component and run this on the host that holds this component (thus the bypassDom flag) - in your example you are specifying the Object UUID.

You can find the remaining active componentUUID using:

# cmmds-tool find -t DOM_OBJECT -f json -u 7bc0e35a-4c43-381c-3580-ecf4bbe027b8

Then use this to query the active component (on the host that is the DOM_Owner) as you were doing.

Bob

alainrussell
Enthusiast
Enthusiast
Jump to solution

Thanks Bob

The command you sent outputs as attached (for the 2 objects, nothing in state 5?), I'm happy to wait for GSS to be engaged through Dell (I'm currently dumping hardware logs for them which I'm not sure are going to help .. so might be a bit of a waiting game).

For now, all the VMs are reporting healthy - I think I'd struggle to put the host in MM based on the objects reporting inaccessible but should not need to do that in the short-term.

[EDIT] Change attachments, TXT files would not download.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

Yes, all components are state 6 there. So it probably was either marked for deletion or is a leftover component from something being removed with a host out of the loop at the time. Owner Abdication should result in either being removed (if it has delete flag) or a state 5 component which can be queried, run this from the host that is current owner:

#vsish -e set /vmkModules/vsan/dom/ownerAbdicate <ObjectUUID>

Is this a VxRail deployment?

Bob

alainrussell
Enthusiast
Enthusiast
Jump to solution

Thanks again Bob,

No - not a VXRail deployment, these are 5 Dell 730xd's (hybrid)

I've run the commends, most are still state 6

Alain

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

One object that can be queried outputs as below (the VM in question is showing as healthy)

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool getAttr --bypassDom -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -c

Object Attributes --

UUID:b1c0e35a-0212-8bbf-e276-ecf4bbe027b8

Object type:vsan

Object size:107374182400

User friendly name:(null)

HA metadata:(null)

Allocation type:Thin

Policy:((\"stripeWidth\" i3) (\"cacheReservation\" i0) (\"proportionalCapacity\" (i0 i100)) (\"hostFailuresToTolerate\" i2) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"36187868-4722-41c8-a55b-e6a95208f450\") (\"spbmProfileGenerationNumber\" l+1) (\"objectVersion\" i5) (\"CSN\" l559) (\"SCSN\" l584))

Object class: vdisk

Object capabilities: STRICT_GWE

Object path: /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name_1-000001.vmdk

Group uuid: 71001856-a9f8-1ff3-781f-ecf4bbd92408

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

"I've run the commends, most are still state 6"

This is expected - if it had enough healthy components it would be accessible, inaccessible (usually) means only a single component or stripe remaining.

Is this snapshot part of on active chain?

# cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx | grep vmdk

If the VMs disks are pointing to snapshots then query the chain and see if 000001.vmdk is part of it and if it is, find out if it is the same 000001.vmdk referenced here or a newer one.

If this VM is pointing to all base-disks then it should be safe to manually delete this inaccessible snapshot Object (using Objtool). Though of course ensure the VM is functional, current and has a backup (as all should :smileygrin:).

Is the other inaccessible Object also a snapshot?

Bob

alainrussell
Enthusiast
Enthusiast
Jump to solution

The other object appears to be a snapshot as well (output below), neither of the 2 machines have any snapshots showing in the web interface.

Running your command results in:

cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy

Object Attributes --

UUID:7bc0e35a-37ce-d21d-19a3-ecf4bbe027b8

Object type:vsan

Object size:1099511627776

User friendly name:(null)

HA metadata:(null)

Allocation type:Thin

Policy:((\"stripeWidth\" i3) (\"cacheReservation\" i0) (\"proportionalCapacity\" (i0 i100)) (\"hostFailuresToTolerate\" i2) (\"forceProvisioning\" i0) (\"spbmProfileId\" \"36187868-4722-41c8-a55b-e6a95208f450\") (\"spbmProfileGenerationNumber\" l+1) (\"objectVersion\" i5) (\"CSN\" l462) (\"SCSN\" l492))

Object class: vdisk

Object capabilities: STRICT_GWE

Object path: /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/6f001856-72c1-f6fe-45e1-ecf4bbd92408/vm-name2_1-000001.vmdk

Group uuid: 6f001856-72c1-f6fe-45e1-ecf4bbd92408

A delete of these is as below? (after a backup of each VM in question Smiley Happy)

/usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

"cat: can't open '/vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name.vmx': Device or resource busy"

This needs to be run on the host that the VM is registered on.

"neither of the 2 machines have any snapshots showing in the web interface."

To be sure for sure for sure, check the disks that the VMs are pointing to either in the .vmx as above or Click VM > Edit Settings > Hard Disk > Disk File

The reason I say this is that it is technically feasible to have a VM using snapshots that are not present in the .vmsd and thus Snapshot Manager.

" /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10"

Correct.

"Worth noting that Veeam backups would have been running when we had issues with the disk reporting as failed - this is what will have been making the snapshots."

This is the type of scenario I had imagined was the cause but didn't want to assume.

Bob

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

Thanks (again)

I verified both disks are not running from snapshots:

[root@esx05:~] cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/71001856-a9f8-1ff3-781f-ecf4bbd92408/vm-name1.vmx | grep vmdk

scsi0:0.fileName = "vm-name1.vmdk"

scsi1:0.fileName = "vm-name1_1.vmdk"

[root@esx05:~] cat /vmfs/volumes/vsan:52a7e821daafd45b-f96f0b58ae58a180/6f001856-72c1-f6fe-45e1-ecf4bbd92408/vm-name2.vmx | grep vmdk

scsi0:0.fileName = "vm-name2.vmdk"

scsi1:0.fileName = "vm-name2_1.vmdk"

Trying a delete on the first (after VM backup) returned:

[root@esx03:~] /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

Deleting object b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 with force mode

Opening vsan namespace control node

Marshaling delete arguments

Issuing delete ioctl

object deletion ioctl failed: No such file or directory

object delete error: Failure

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

I am not positive if it should matter in this case but are you deleting that from the node that is DOM-owner and/or that the healthy component(s) reside on? There are other feasible methods of cleansing these but do try from other hosts first.

Bob

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

Yes, originally from the owner host (I've only tried this object, not the others yet)

Tried from another couple - same error unfortunately.

[root@esx04:~] /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 -f -v 10

Deleting object b1c0e35a-0212-8bbf-e276-ecf4bbe027b8 with force mode

Opening vsan namespace control node

Marshaling delete arguments

Issuing delete ioctl

object deletion ioctl failed: No such file or directory

object delete error: Failure

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

FYI - Same error on the other 3 objects related to the other VM.

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,

You are specifying the LSOM-Component UUID - not the DOM-Object UUID (which is what you are trying to remove here).

The only time that you should be using the LSOM-Component UUID is when using --bypassDom -c for identification.

Try deletion with the Object UUID.

Bob

alainrussell
Enthusiast
Enthusiast
Jump to solution

Ok, thanks - so the delete should be specified using the Object UUID from this initial check?

2018-04-28 23:07:43 +0000: Step 1: Check for inaccessible vSAN objects

Detected 2 objects to be inaccessible

Detected 7bc0e35a-4c43-381c-3580-ecf4bbe027b8 on esx03 to be inaccessible

Detected b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 on esx03 to be inaccessible

eg. /usr/lib/vmware/osfs/bin/objtool delete -u b1c0e35a-5d82-c0bd-b943-ecf4bbe027b8 -f -v 10

Apologies for the confusion!

0 Kudos
TheBobkin
Champion
Champion
Jump to solution

Hello Alain,


Yes, the Object UUID.

"Apologies for the confusion!"

My bad, I should have checked what you were using before - I only checked the syntax of the command not the UUID.

Bob

0 Kudos
alainrussell
Enthusiast
Enthusiast
Jump to solution

Thanks for all your help, the 2 objects are now deleted and health is reporting all green Smiley Happy

0 Kudos