2 Replies Latest reply on Nov 15, 2017 2:31 AM by UBUTS5

    VSAN failed test. Objects inaccessible. All object components marked STALE.

    pzin2 Lurker

      I configured a 3 node All flash VSAN 6.2 (v3) cluster in my lab.

      Before going into production I wanted to see how the cluster will tolerate a datacenter power loss. I decided to add some delays to power cut offs assuming that the battery backups on the servers will have slightly different capacities.

      The cluster had 8 VMs all running a Crystal Mark benchmark at the time of power disconnect.

       

      Let’s name the cluster nodes A, B, C.

      First I disconnected the node C. Then after 5 minutes nodes A and B with 2 seconds interval.

      Waited for a couple of minutes and started to power them on but in reverse.

      I deliberately started the node C first. I assumed this would be a worst case scenario because data on C will be outdated and wanted to see how well the situation will be handled. Still it can easily happen in the production environment due to BIOS boot delays.

      After 2 minutes I turned on two remaining hosts.

       

      During the boot process ESXi spends several minutes initializing the VSAN disks. Even though the C started first there was a period of time when all three host were in VSAN initialization process at the same time.

      I thought it will be enough for the system to resynchronize but I was wrong!

      After all 3 hosts were online 3 out of 8 MWs were in the Inaccessible STATE. 3 other VMs were accessible but out of sync. 2 were healthy.

       

      The cluster is stuck on rebuilding of one object. The object is only 1GB but after 10 hours it is in the same state:

       

      /localhost/DC74/computers/CLSTR01> vsan.resync_dashboard ~cluster

      2016-04-09 05:41:16 -0500: Querying all VMs on VSAN ...

      2016-04-09 05:41:16 -0500: Querying all objects in the system from b1200. ...

      2016-04-09 05:41:17 -0500: Got all the info, computing table ...

      +-----------------------------------------------------------------------+-----------------+---------------+

      | VM/Object | Syncing objects | Bytes to sync |

      +-----------------------------------------------------------------------+-----------------+---------------+

      | A_temp_moving | 1 | |

      | [vsan_ssd1] 3574e73e-d79b-d092-8bfb-00266cf2880c/A_temp_moving.vmx | | 1.00 GB       |

      +-----------------------------------------------------------------------+-----------------+---------------+

      | Total | 1 | 1.00 GB       |

      +-----------------------------------------------------------------------+-----------------+---------------+

       

       

      3 VSAN objects corresponding to the 3 inaccessible VMs are also marked inaccessible:

       

      /localhost/DC74/computers/CLSTR01> vsan.check_state ~cluster

      2016-04-09 05:45:42 -0500: Step 1: Check for inaccessible VSAN objects

      Detected 3 objects to be inaccessible

      Detected cef8e73e-955e-f306-1078-00266cf2880c on b1200. to be inaccessible

      Detected 31f8e73e-ab51-c751-5bb5-00266cf2880c on b1200. to be inaccessible

      Detected e8f8e73e-7056-0d95-1f51-00266cf2880c on b9100. to be inaccessible

       

      2016-04-09 05:45:42 -0500: Step 2: Check for invalid/inaccessible VMs

      Detected VM 'A_temp_moving8' as being 'inaccessible'

      Detected VM 'A_temp_moving6' as being 'inaccessible'

       

      2016-04-09 05:45:42 -0500: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

      Found VMs for which VC/hostd/vmx are out of sync:

      A_temp_moving9

      A_temp_moving7

      A_temp_moving4

      A_temp_movin3

       

       

      Examination of the inaccessible objects further showed that some of them have All components ACTIVE, but marked as STALE! Other have 2 components ACTIVE one missing but all marked STALE. Here is an example:

       

      <LSTR01> vsan.object_info ~cluster 31f8e73e-ab51-c751-5bb5-00266cf2880c

      DOM Object: 31f8e73e-ab51-c751-5bb5-00266cf2880c (v3, owner: b1200., policy: No POLICY entry found in CMMDS)

      RAID_1

      Component: 31f8e73e-738f-3952-44b2-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b4300., md: 5288d2ea-7203-770d-5875-a5a721d925bc, ssd: 52f5b565-c20e-17d9-6b1f-6ebd6c50ae23,

      votes: 1, usage: 0.4 GB)

      Component: 31f8e73e-f1f7-3b52-08a5-00266cf2880c (state: ACTIVE (5), csn: STALE (owner stale), host: b1200., md: 527993ca-0b3f-d92f-a78b-a94ceccec98d, ssd: 52a32e68-80fc-285d-23f1-1758e80d63a5,

      votes: 1, usage: 0.4 GB)

      Witness: 31f8e73e-53f0-3d52-c1cc-00266cf2880c (state: ACTIVE (5), host: b9100., md: 52d67be3-f1a1-c6df-c4fa-60100694133c, ssd: 528b8cac-c51d-f0fb-f5fe-7c4d1fd1220d,

      votes: 1, usage: 0.0 GB)

      Extended attributes:

      Address space: 273804165120B (255.00 GB)

      Object class: vmnamespace

      Object path: /vmfs/volumes/vsan:52f5f66efc19ecc0-f72aa19c783c8172/

      Object capabilities: NONE

       

      I tried to fix it with the  "vsan.check_state –r –e ~cluster"  command but it didn’t change anything. I also tried to go the Virtal SAN tab in vSphere client and repair it with “Repair object immediately” button but it was grayed out.

       

      Does anybody have a solution for the problem?

      I probably read everything there was on the Internet about VSAN at the moment and could not find any mentioning of situation like this.

      Honestly, it should not happen in the Enterprise class solution unless I missed something.

        • 1. Re: VSAN failed test. Objects inaccessible. All object components marked STALE.
          zdickinson Expert

          Good morning, you tried to break vSAN and succeeded!  I did the same test when we went to production on v5.5 and it recovered correctly.  At this point the next step would be support or your backup/DR solution.  Thank you, Zach.

          • 2. Re: VSAN failed test. Objects inaccessible. All object components marked STALE.
            UBUTS5 Lurker

            Good Day,

             

            i have same issue here the inaccessible object need to delete but it doesn't.

            i have tried all commands and solution found in websites but it failed to delete the object.

             

            any one have solution ?

             

            Collecting all inaccessible Virtual SAN objects...

            Found 1 inaccessbile objects.

            Selecting vswp objects from inaccessible objects by checking their extended attributes...

            Found 0 inaccessible vswp objects.

             

             

            2017-11-15 12:14:29 +0200: Step 1: Check for inaccessible VSAN objects

            Detected 87ead358-bcbe-fc82-08f6-40a8f021a3d4 to be inaccessible, refreshing state

             

            2017-11-15 12:14:35 +0200: Step 1b: Check for inaccessible VSAN objects, again

            Detected 87ead358-bcbe-fc82-08f6-40a8f021a3d4 is still inaccessible

             

            2017-11-15 12:14:36 +0200: Step 2: Check for invalid/inaccessible VMs

             

            2017-11-15 12:14:36 +0200: Step 2b: Check for invalid/inaccessible VMs again

             

            2017-11-15 12:14:36 +0200: Step 3: Check for VMs for which VC/hostd/vmx are out of sync

            Did not find VMs for which VC/hostd/vmx are out of sync