VMware Cloud Community
pzin2
Contributor
Contributor

VSAN failed test. Objects inaccessible. All object components marked STALE.

I configured a 3 node All flash VSAN 6.2 (v3) cluster in my lab.

Before going into production I wanted to see how the cluster will tolerate a datacenter power loss. I decided to add some delays to power cut offs assuming that the battery backups on the servers will have slightly different capacities.

The cluster had 8 VMs all running a Crystal Mark benchmark at the time of power disconnect.

Let’s name the cluster nodes A, B, C.

First I disconnected the node C. Then after 5 minutes nodes A and B with 2 seconds interval.

Waited for a couple of minutes and started to power them on but in reverse.

I deliberately started the node C first. I assumed this would be a worst case scenario because data on C will be outdated and wanted to see how well the situation will be handled. Still it can easily happen in the production environment due to BIOS boot delays.

After 2 minutes I turned on two remaining hosts.

During the boot process ESXi spends several minutes initializing the VSAN disks. Even though the C started first there was a period of time when all three host were in VSAN initialization process at the same time.

I thought it will be enough for the system to resynchronize but I was wrong!

After all 3 hosts were online 3 out of 8 MWs were in the Inaccessible STATE. 3 other VMs were accessible but out of sync. 2 were healthy.

The cluster is stuck on rebuilding of one object. The object is only 1GB but after 10 hours it is in the same state:

/localhost/DC74/computers/CLSTR01> vsan.resync_dashboard ~cluster

2016-04-09 05:41:16 -0500: Querying all VMs on VSAN ...

2016-04-09 05:41:16 -0500: Querying all objects in the system from b1200. ...

2016-04-09 05:41:17 -0500: Got all the info, computing table ...

+-----------------------------------------------------------------------+-----------------+---------------+

| VM/Object | Syncing objects | Bytes to sync |

+-----------------------------------------------------------------------+-----------------+---------------+

| A_temp_moving | 1 | |

| [vsan_ssd1] 3574e73e-d79b-d092-8bfb-00266cf2880c/A_temp_moving.vmx | | 1.00 GB       |

+-----------------------------------------------------------------------+-----------------+---------------+

| Total | 1 | 1.00 GB       |

+-----------------------------------------------------------------------+-----------------+---------------+

3 VSAN objects corresponding to the 3 inaccessible VMs are also marked inaccessible:

/localhost/DC74/computers/CLSTR01> vsan.check_state ~cluster

2016-04-09 05:45:42 -0500: Step 1: Check for inaccessible VSAN objects

Detected 3 objects to be inaccessible

Detected cef8e73e-955e-f306-1078-00266cf2880c on b1200. to be inaccessible

Detected 31f8e73e-ab51-c751-5bb5-00266cf2880c on b1200. to be inaccessible

Detected e8f8e73e-7056-0d95-1f51-00266cf2880c on b9100. to be inaccessible

2016-04-09 05:45:42 -0500: Step 2: Check for invalid/inaccessible VMs

Detected VM 'A_temp_moving8' as being 'inaccessible'

Detected VM 'A_temp_moving6' as being 'inaccessible'

2016-04-09 05:45:42 -0500: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

Found VMs for which VC/hostd/vmx are out of sync:

A_temp_moving9

A_temp_moving7

A_temp_moving4

A_temp_movin3

Examination of the inaccessible objects further showed that some of them have All components ACTIVE, but marked as STALE! Other have 2 components ACTIVE one missing but all marked STALE. Here is an example:

<LSTR01> vsan.object_info ~cluster 31f8e73e-ab51-c751-5bb5-00266cf2880c

DOM Object: 31f8e73e-ab51-c751-5bb5-00266cf2880c (v3, owner: b1200., policy: No POLICY entry found in CMMDS)

RAID_1

Component: 31f8e73e-738f-3952-44b2-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b4300., md: 5288d2ea-7203-770d-5875-a5a721d925bc, ssd: 52f5b565-c20e-17d9-6b1f-6ebd6c50ae23,

votes: 1, usage: 0.4 GB)

Component: 31f8e73e-f1f7-3b52-08a5-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b1200., md: 527993ca-0b3f-d92f-a78b-a94ceccec98d, ssd: 52a32e68-80fc-285d-23f1-1758e80d63a5,

votes: 1, usage: 0.4 GB)

Witness: 31f8e73e-53f0-3d52-c1cc-00266cf2880c (state: ACTIVE (5), host: b9100., md: 52d67be3-f1a1-c6df-c4fa-60100694133c, ssd: 528b8cac-c51d-f0fb-f5fe-7c4d1fd1220d,

votes: 1, usage: 0.0 GB)

Extended attributes:

Address space: 273804165120B (255.00 GB)

Object class: vmnamespace

Object path: /vmfs/volumes/vsan:52f5f66efc19ecc0-f72aa19c783c8172/

Object capabilities: NONE

I tried to fix it with the  "vsan.check_state –r –e ~cluster"  command but it didn’t change anything. I also tried to go the Virtal SAN tab in vSphere client and repair it with “Repair object immediately” button but it was grayed out.

Does anybody have a solution for the problem?

I probably read everything there was on the Internet about VSAN at the moment and could not find any mentioning of situation like this.

Honestly, it should not happen in the Enterprise class solution unless I missed something.

0 Kudos
2 Replies
zdickinson
Expert
Expert

Good morning, you tried to break vSAN and succeeded!  I did the same test when we went to production on v5.5 and it recovered correctly.  At this point the next step would be support or your backup/DR solution.  Thank you, Zach.

0 Kudos
UBUTS5
Contributor
Contributor

Good Day,

i have same issue here the inaccessible object need to delete but it doesn't.

i have tried all commands and solution found in websites but it failed to delete the object.

any one have solution ?

Collecting all inaccessible Virtual SAN objects...

Found 1 inaccessbile objects.

Selecting vswp objects from inaccessible objects by checking their extended attributes...

Found 0 inaccessible vswp objects.

2017-11-15 12:14:29 +0200: Step 1: Check for inaccessible VSAN objects

Detected 87ead358-bcbe-fc82-08f6-40a8f021a3d4 to be inaccessible, refreshing state

2017-11-15 12:14:35 +0200: Step 1b: Check for inaccessible VSAN objects, again

Detected 87ead358-bcbe-fc82-08f6-40a8f021a3d4 is still inaccessible

2017-11-15 12:14:36 +0200: Step 2: Check for invalid/inaccessible VMs

2017-11-15 12:14:36 +0200: Step 2b: Check for invalid/inaccessible VMs again

2017-11-15 12:14:36 +0200: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

Did not find VMs for which VC/hostd/vmx are out of sync

0 Kudos