Solved: Re: Disk stuck in reduced-availability-with-no-reb...

foodandbikes · ‎01-20-2016

Since a VSAN crash on Monday I have been having issues getting one virtual disk back into compliance.

I have an open support case but support is not being responsive, so hopefully someone here has experienced the same thing and can offer up some info.

The VSAN health tool shows the disk in the state Reduced availability with no rebuild

Best I can tell all the problems have been resolved, but the components will not do a resync, even when clicking the button "Repair Objects Immediately". There are no errors or warnings in the cluster other than this disk.

The help has this to say about the state.

Reduced availability with no rebuild: The object has suffered a failure, but VSAN was able to tolerate it. For example: I/O is flowing and the object is accessible. However, VSAN is not working on re-protecting the object. This is not due to the delay timer (reduced availability - no rebuild - delay timer) but due to other reasons. This could be because there are not enough resources in the cluster, or this could be because there was not enough resources in the past, or there was a failure to re-protect in the past and VSAN has yet to retry. Refer to the limits health check for a first assessment if any resources may be exhausted. You have to resolve the failure or add resources as quickly as possible in order to get back to being fully protected against a subsequent failure.

Below is the output from the command "vsan.vm_object_info 7"

The VM has several disks, but this is the only one that has this problem.

It would be nice if VSAN just deleted the STALE ones and created new ones.

Cormac's blog post explains it a bit at the end, except my system is not doing a resync.

VSAN Part 31: Object compliance and operational status - CormacHogan.com

Disk backing: [vsanDatastore] b2207355-5edb-36ed-0a6c-a0369f613ca4/SERVER_OFFICE01_1.vmdk

    DOM Object: c9207355-2693-b93a-3f57-a0369f613ca4 (v2, owner: esx-h02, policy: forceProvisioning = 0, hostFailuresToTolerate = 1, spbmProfileId = aa6d5a82-1c88-45da-85d3-3d74b91a5bad, proportionalCapacity = 0, spbmProfileGenerationNumber = 1, cacheReservation = 0, stripeWidth = 1)

      RAID_1

        RAID_0

          Component: 15505b56-a04d-3d3e-f47e-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19a418, ssd: naa.55cd2e404b7c6bb2,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-78d6-3f3e-acee-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-0e0f-413e-e188-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-ae10-423e-9188-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-4221-433e-d6f8-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-651b-443e-1fe5-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-4c1b-453e-b197-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d16c1b8, ssd: naa.55cd2e404b7c6bb2,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-8622-463e-49f7-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19a418, ssd: naa.55cd2e404b7c6bb2,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-765b-473e-3ce1-a0369f613924 (state: ACTIVE (5), host: esx-h01, md: naa.5000cca06d19c404, ssd: naa.55cd2e404b7995d6,

                                                           votes: 1, usage: 200.9 GB)

        RAID_0

          Component: 15505b56-239e-483e-354e-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19ca38, ssd: naa.55cd2e404b7995a5,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-ef1b-4a3e-96b0-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d18e820, ssd: naa.55cd2e404b7995a5,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-4615-4b3e-3e4e-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-7524-4c3e-c8c9-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-701f-4d3e-3af4-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-2c23-4e3e-e642-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19ca38, ssd: naa.55cd2e404b7995a5,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-f924-4f3e-a11a-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-9e1b-503e-5741-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d139cf0, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

          Component: 15505b56-0c04-513e-ffd8-a0369f613924 (state: DEGRADED (9), csn: STALE (211!=220), host: esx-h02, md: naa.5000cca06d19c960, ssd: naa.55cd2e404b7c70c8,

                                                           votes: 1, usage: 200.9 GB)

      Witness: 5eda5b56-8b1e-fb8c-dc99-a0369f613924 (state: ACTIVE (5), host: esx-h03, md: naa.5000cca06d1a0b5c, ssd: naa.55cd2e404b7995ca,

                                                     votes: 9, usage: 0.0 GB)

zdickinson · ‎01-21-2016

Good morning, you mentioned you tried "Repair Objects Immediately", have you tried to re-apply the storage policy? Or perhaps make a new one with slightly different settings (stripe width = 2) and then apply that? My other thought would be to storage motion it to other... storage and then storage motion it back to vSAN.

Side note, you mention support not being responsive. This is also something I have noticed, when I call in for support, I often get "The guy who knows vSAN is not available, he'll call you back". Really?! "They guy". Anecdotally it appears that support resources for vSAN are lacking.

Thank you, Zach.

View solution in original post

zdickinson · ‎01-21-2016

Good morning, you mentioned you tried "Repair Objects Immediately", have you tried to re-apply the storage policy? Or perhaps make a new one with slightly different settings (stripe width = 2) and then apply that? My other thought would be to storage motion it to other... storage and then storage motion it back to vSAN.

Side note, you mention support not being responsive. This is also something I have noticed, when I call in for support, I often get "The guy who knows vSAN is not available, he'll call you back". Really?! "They guy". Anecdotally it appears that support resources for vSAN are lacking.

Thank you, Zach.

foodandbikes · ‎01-21-2016

Zack, thanks for the input.

I've thought about creating a new policy with FT of 0 and SW of 1 in hopes that will delete the degraded objects, then put it back to the policy it is on now (FT=1 SW=1).

Going to a different stripe width is a last resort since it will consume a few T of disk space to make the change, and put a huge load on VSAN effectively making all the VMs come to a crawl.

The option for Re-Apply policy only applies to disks that are in an "Out of Date" state, when I run the re-apply it states there are no disks "Out of Date" and doesn't do anything.

I can assure you there are 2 guys that work in the VSAN support team, I've worked with them both

-dan

zdickinson · ‎01-21-2016

That's very funny!

foodandbikes · ‎01-22-2016

Working with support it was decided that removing the storage policy and applying it would likely resolve the issue.

I created a new policy with the same stripe width of 1 and fault tolerance of 0.

Applied it and it immediately cleaned up the disk.

Then applied the previously assigned policy and it started doing a resync immediately.

Easy fix, but having never gone through this I decided to take the cautious route.

eknauft · ‎01-23-2016

Hi foodandbikes. Sorry the resync didn't kick-in automatically. There are several reasons why this may be the case, and we would need more info about the state of the cluster to be able to tell for sure. One likely explanation is there not being enough resources in the cluster to do a rebuild. For components marked DEGRADED (e.g. due to a disk IO failure or bad SMART status), we do not resync them automatically even if the disk becomes healthy again. They should be replaced with entirely new components, but to do that we require some temporary extra space in the cluster, and that must lie on a different fault domain than the three already being used (here host 01, 02, and 03). Are there only three hosts in the cluster with space available? If so, adding another host or freeing up space on another host may solve your problem.

Just wondering -- is there an SR open for this case, and are vmsupport bundles available to VMware support staff? Thanks.

gsuryanarayana · ‎01-23-2016

I am from the vSAN team and I'd like to add a few things to the previous post.. The object initially started with hft=1 (2 replicas) and we suffered a failure that degraded one of the replicas. At this time vSAN immediately tried to fix this and it probably failed due to resource constraints. When this happens, vSAN would try and fix this object the next time it gets to it, which should be within the next 24 hour period. The current status of the object is that it is waiting for vSAN to get to it. When the same policy is reapplied, the object is immediately fixed. We would like to look at the SR if you have opened one already. Thanks!

foodandbikes · ‎01-25-2016

Thanks for the info.

VSAN sat in the degraded state for about 48 hours, or longer without an automatic rebuild being kicked off.

I looked but could not find any logs anywhere that indicated it was trying to do a rebuild, or why it couldn't do a rebuild.

While I would love to add a 4th host there is no chance of it.

SR# 16864316901

The SR was for a different issue, the disks not rebuilding was a result of an outage we had; the outage was the reason for opening the case.

See issue #3 in this thread for a description of what happened.

My VSAN nightmare

I have not been able to upload support bundles. My browser always crashes when trying to download the bundles, still trying to figure that one out.

wreedMH · ‎11-11-2017

How do you immediately re-apply the existing storage policy? It is grayed out for me.

All

Disk stuck in reduced-availability-with-no-rebuild