vSAN1

 View Only
  • 1.  Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 21, 2016 12:41 AM

    Since a VSAN crash on Monday I have been having issues getting one virtual disk back into compliance.

    I have an open support case but support is not being responsive, so hopefully someone here has experienced the same thing and can offer up some info.

    The VSAN health tool shows the disk in the state Reduced availability with no rebuild

    Best I can tell all the problems have been resolved, but the components will not do a resync, even when clicking the button "Repair Objects Immediately". There are no errors or warnings in the cluster other than this disk.

    The help has this to say about the state.

    Reduced availability with no rebuild: The object has suffered a failure, but VSAN was able to tolerate it. For example: I/O is flowing and the object is accessible. However, VSAN is not working on re-protecting the object. This is not due to the delay timer (reduced availability - no rebuild - delay timer) but due to other reasons. This could be because there are not enough resources in the cluster, or this could be because there was not enough resources in the past, or there was a failure to re-protect in the past and VSAN has yet to retry. Refer to the limits health check for a first assessment if any resources may be exhausted. You have to resolve the failure or add resources as quickly as possible in order to get back to being fully protected against a subsequent failure.

    Below is the output from the command "vsan.vm_object_info 7"

    The VM has several disks, but this is the only one that has this problem.

    It would be nice if VSAN just deleted the STALE ones and created new ones.

    Cormac's blog post explains it a bit at the end, except my system is not doing a resync.

    VSAN Part 31: Object compliance and operational status - CormacHogan.com

      Disk backing: [vsanDatastore] b2207355-5edb-36ed-0a6c-a0369f613ca4/SERVER_OFFICE01_1.vmdk

        DOM Object: c9207355-2693-b93a-3f57-a0369f613ca4 (v2, owner: esx-h02, policy: forceProvisioning = 0, hostFailuresToTolerate = 1, spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, proportionalCapacity = 0, spbmProfileGenerationNumber = 1, cacheReservation = 0, stripeWidth = 1)

          RAID_1

            RAID_0

              Component: 15505b56-a04d-3d3e-f47e-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19a418, ssd: naa.55cd2e404b7c6bb2,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-78d6-3f3e-acee-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-0e0f-413e-e188-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-ae10-423e-9188-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-4221-433e-d6f8-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-651b-443e-1fe5-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-4c1b-453e-b197-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-8622-463e-49f7-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19a418, ssd: naa.55cd2e404b7c6bb2,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-765b-473e-3ce1-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                               votes: 1, usage: 200.9 GB)

            RAID_0

              Component: 15505b56-239e-483e-354e-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19ca38, ssd: naa.55cd2e404b7995a5,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-ef1b-4a3e-96b0-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d18e820, ssd: naa.55cd2e404b7995a5,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-4615-4b3e-3e4e-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-7524-4c3e-c8c9-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-701f-4d3e-3af4-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-2c23-4e3e-e642-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19ca38, ssd: naa.55cd2e404b7995a5,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-f924-4f3e-a11a-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-9e1b-503e-5741-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

              Component: 15505b56-0c04-513e-ffd8-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                               votes: 1, usage: 200.9 GB)

          Witness: 5eda5b56-8b1e-fb8c-dc99-a0369f613924 (state: ACTIVE (5), host: esx-h03, md: naa.5000cca06d1a0b5c, ssd: naa.55cd2e404b7995ca,

                                                         votes: 9, usage: 0.0 GB)



  • 2.  RE: Disk stuck in reduced-availability-with-no-rebuild
    Best Answer

    Posted Jan 21, 2016 01:06 PM

    Good morning, you mentioned you tried "Repair Objects Immediately", have you tried to re-apply the storage policy?  Or perhaps make a new one with slightly different settings (stripe width = 2) and then apply that?  My other thought would be to storage motion it to other... storage and then storage motion it back to vSAN.

    Side note, you mention support not being responsive.  This is also something I have noticed, when I call in for support, I often get "The guy who knows vSAN is not available, he'll call you back".  Really?!  "They guy".  Anecdotally it appears that support resources for vSAN are lacking.

    Thank you, Zach.



  • 3.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 21, 2016 06:45 PM

    Zack, thanks for the input.

    I've thought about creating a new policy with FT of 0 and SW of 1 in hopes that will delete the degraded objects, then put it back to the policy it is on now (FT=1 SW=1).

    Going to a different stripe width is a last resort since it will consume a few T of disk space to make the change, and put a huge load on VSAN effectively making all the VMs come to a crawl.

    The option for Re-Apply policy only applies to disks that are in an "Out of Date" state, when I run the re-apply it states there are no disks "Out of Date" and doesn't do anything.

    I can assure you there are 2 guys that work in the VSAN support team, I've worked with them both :smileywink:

    -dan



  • 4.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 21, 2016 08:25 PM

    That's very funny!



  • 5.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 22, 2016 09:09 PM

    Working with support it was decided that removing the storage policy and applying it would likely resolve the issue.

    I created a new policy with the same stripe width of 1 and fault tolerance of 0.

    Applied it and it immediately cleaned up the disk.

    Then applied the previously assigned policy and it started doing a resync immediately.

    Easy fix, but having never gone through this I decided to take the cautious route.



  • 6.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Broadcom Employee
    Posted Jan 24, 2016 01:33 AM

    Hi foodandbikes.  Sorry the resync didn't kick-in automatically.  There are several reasons why this may be the case, and we would need more info about the state of the cluster to be able to tell for sure.  One likely explanation is there not being enough resources in the cluster to do a rebuild.  For components marked DEGRADED (e.g. due to a disk IO failure or bad SMART status), we do not resync them automatically even if the disk becomes healthy again.  They should be replaced with entirely new components, but to do that we require some temporary extra space in the cluster, and that must lie on a different fault domain than the three already being used (here host 01, 02, and 03).  Are there only three hosts in the cluster with space available?  If so, adding another host or freeing up space on another host may solve your problem.

    Just wondering -- is there an SR open for this case, and are vmsupport bundles available to VMware support staff?  Thanks.



  • 7.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 24, 2016 02:07 AM

    I am from the vSAN team and I'd like to add a few things to the previous post.. The object initially started with hft=1 (2 replicas) and we suffered a failure that degraded one of the replicas. At this time vSAN immediately tried to fix this and it probably failed due to resource constraints. When this happens, vSAN would try and fix this object the next time it gets to it, which should be within the next 24 hour period. The current status of the object is that it is waiting for vSAN to get to it. When the same policy is reapplied, the object is immediately fixed. We would like to look at the SR if you have opened one already. Thanks!



  • 8.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Jan 25, 2016 10:16 PM

    Thanks for the info.

    VSAN sat in the degraded state for about 48 hours, or longer without an automatic rebuild being kicked off.

    I looked but could not find any logs anywhere that indicated it was trying to do a rebuild, or why it couldn't do a rebuild.

    While I would love to add a 4th host there is no chance of it.

    SR# 16864316901

    The SR was for a different issue, the disks not rebuilding was a result of an outage we had; the outage was the reason for opening the case.

    See issue #3 in this thread for a description of what happened.

    My VSAN nightmare

    I have not been able to upload support bundles. My browser always crashes when trying to download the bundles, still trying to figure that one out.



  • 9.  RE: Disk stuck in reduced-availability-with-no-rebuild

    Posted Nov 11, 2017 07:46 PM

    How do you immediately re-apply the existing storage policy? It is grayed out for me.