Solved: Re: vCenter 7.0u3 shutdown vSAN cluster results in...

davemuench · ‎10-26-2021

Hi,

I used the new vSAN cluster shutdown wizard yesterday for the first time when I had to shut a lab down for a power outage. The cluster consists of three Dell R730xd nodes, with vCenter residing on a different non-vSAN node. The shutdown was clean, and the hosts were shut down properly before the power was lost. On bootup, all the vsan VMs are listed as Inaccessible, and aren't visible if browsed to in the datastore (via GUI or command line).

The button to restart the cluster was not present, so I followed the instructions to manually restart the cluster via the command line. The recover script however times out:

[root@esx01:/tmp] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
Begin to recover the cluster...
Time among connected hosts are synchronized.
Scheduled vSAN cluster restore task.
Waiting for the scheduled task...(18s left)
Checking network status...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Timeout, please try again later

I have been digging since then with no success. The cluster looks to have reformed properly:

[root@esx01:/tmp] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-10-26T18:22:51Z
   Local Node UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
   Sub-Cluster Backup UUID: 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0
   Sub-Cluster UUID: 528157e4-4935-2809-ab88-5d161aec89a5
   Sub-Cluster Membership Entry Revision: 4
   Sub-Cluster Member Count: 3
   Sub-Cluster Member UUIDs: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c, 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0, 614b991e-f0f6-8762-c918-801844e56f42
   Sub-Cluster Member HostNames: esx01, esx02, esx03
   Sub-Cluster Membership UUID: fd197861-36c6-b896-868a-a0369f59e56c
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: a01749d0-4f70-4911-aeb0-919cfdc176bb 31 2021-10-26T18:12:08.28
   Mode: REGULAR

esx01 being the master, esx02 showing as backup, and esx03 as agent. The unicast list looks correct, and the vsan vmk's were re-tagged properly.

In RVC, this is typical of what I see for each VM:

/localhost/Lab/vms/wan-test> vsan.vm_object_info .
VM wan-test:
  Disk backing:
    [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk

  [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmx
    DOM Object: 65d06061-65f9-6456-0383-a0369f59e56c (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
      RAID_1
        Component: 65d06061-6ce4-8057-b167-a0369f59e56c (state: ACTIVE (5), host: esx01, capacity: naa.5002538e30820eb4, cache: naa.55cd2e404b795ca0,
                                                         votes: 1, usage: 0.2 GB, proxy component: false)
        Component: 65d06061-7415-8457-02b2-a0369f59e56c (state: ABSENT (6), csn: STALE (109!=160), host: esx03, capacity: naa.5002538ec110af17, cache: naa.55cd2e404b796059,
                                                         dataToSync: 0.21 GB, votes: 1, usage: 0.2 GB, proxy component: false)
      Witness: 65d06061-92d5-8757-6ca8-a0369f59e56c (state: ACTIVE (5), host: esx02, capacity: naa.5002538e4066db91, cache: naa.55cd2e404b78c07a,
                                                     votes: 1, usage: 0.0 GB, proxy component: false)

  [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
    DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
      RAID_1
        Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
                                                         dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
        Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
                                                         votes: 1, usage: 1.6 GB, proxy component: false)
      Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
                                                     votes: 1, usage: 0.0 GB, proxy component: false)

The two things that stand out for me is the no policy found in CMMDS, and that in every VM's case the component that resides in esx03 is the absent component - whether it is a component or a witness.

I have tried various things to try to manipulate the storage policy and reduce the ftt to 0, but none of them take as the new policy can't be defined due to the invalid state of the VM.

Any help would be greatly appreciated. I'd love to open a support ticket but we don't have support on this small lab environment, and I'd rather not have to rebuild from backups if I can avoid it. I'm also trying to learn why this happened in the first place and if the shutdown cluster functionality can be relied upon or not.

davemuench · ‎11-03-2021

Here's the post-mortem.. I reinstalled the cluster with 7.0u3a. The same behavior started again almost immediately after configuring vSAN - objects/VMs going inaccessible, I/O errors listing the datastore, etc. I even zeroed out the vsan disks ahead of time to make sure no old metadata was picked up by the new install.

Reinstall again with ESXi 7.0u2d (but the same vCenter 7.0u3a). Works great now, no problems at all. The non-wizard cluster shutdown and startup also works great.

7.0u3/3a has some serious issues with vSAN, at least on my R730xds. I wish I could open a support case to help diagnose the problem, but it's a homelab.

View solution in original post

davemuench · ‎10-26-2021

Thanks to a suggestion on reddit, I am making some progress.

vsish -e set /vmkModules/vsan/dom/ownerAbdicate <uuid>

Is getting the components back from absent to active. Still working out my next step though.

jhunter1 · ‎10-28-2021

Please open a support request (SR) if you are able to. 1. We would like to help troubleshoot and 2. It would great to gather logs and such to determine what went wrong and fix the issue in the code in an upcoming release.

davemuench · ‎10-28-2021

I very much wish I could, but despite this being on enterprise hardware it's a VMUG homelab.

Beyond getting the components all into an active state by abdicating ownership (which does not survive a node reboot or cluster reboot) I appear to be at a dead end, the cluster looks to be a total loss. All the objects and VMs stored on vSAN are inaccessible, and newly created VMs also go inaccessible a short time after creation.

More than anything I'd also like to know what happened here as my main concern is if it were to happen again, in this environment or more critical environments.

davemuench · ‎11-03-2021

Here's the post-mortem.. I reinstalled the cluster with 7.0u3a. The same behavior started again almost immediately after configuring vSAN - objects/VMs going inaccessible, I/O errors listing the datastore, etc. I even zeroed out the vsan disks ahead of time to make sure no old metadata was picked up by the new install.

Reinstall again with ESXi 7.0u2d (but the same vCenter 7.0u3a). Works great now, no problems at all. The non-wizard cluster shutdown and startup also works great.

7.0u3/3a has some serious issues with vSAN, at least on my R730xds. I wish I could open a support case to help diagnose the problem, but it's a homelab.

llb1 · ‎12-22-2021

Hello, I have the same problem. Final installation 7.0U2D running?

Is the VM machine recovered？

TheBobkin · ‎12-23-2021

@llb1 , I would advise being wary of making assertions such as "I have the same problem" when it is very unclear what OPs problem was here.

From looking at the data, OP clearly had issues on node esx03 for example you can see that it was missing hundreds of data-updates to components e.g. 'STALE (67!=202)' means the current data is on revision 202 and this component on node esx03 is on data revision 67 (e.g. async and way behind):
[vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
RAID_1
Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
votes: 1, usage: 1.6 GB, proxy component: false)
Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
votes: 1, usage: 0.0 GB, proxy component: false)

@llb1, please provide more information e.g. what did you do, what happened, what is the data-state and cluster health?

davemuench · ‎12-23-2021

Hi @llb1, my setup has been stable (and through several cluster shutdowns without issue) since I reloaded and rebuilt on the same hardware with 7.0u2d. I had to restore all data from backups.

If you do have the same problem I wish you the best of luck. For me the obvious variable was 7.0u3 and I will not be upgrading to it in the future.

TheBobkin · ‎12-23-2021

@davemuench, Bit of an oddly specific question but do you by any chance have these nodes 'daisy-chained' e.g. directly connected and not using a switch?

davemuench · ‎12-23-2021

@TheBobkin No, they are attached to 10gb switches.

llb1 · ‎12-24-2021

It is not clear what the cause of the problem is, but the same problem has occurred. All virtual machines appear orphaned and browse storage is not visible. Operation process: shut down all virtual machines by pressing the shut down cluster button.

TheBobkin · ‎12-24-2021

@llb1 , Was it after using cluster shutdown feature and then attempting to revert this like @davemuench indicated occurred in their case?

Asking as the cluster shutdown (either using vSphere-function or older ESXi reboot_helper script) is expected to make all VM data in the cluster inaccessible as it isolates all the nodes from one another by untagging vsan-network on the vmk configured for this.

Have you attempted the recover/restore part of this workflow and/or if not possible (e.g. @davemuench mentioned there was no button visible for reverting from the cluster shutdown option) checked are all the vsan-vmk (and witness-vmk if stretched cluster) untagged for these traffic types and tried re-tagging them?

Is this a production cluster with S&S or a homelab? If the former and what I suggested above doesn't help then please open a P1 Support Request so that my colleagues can check this properly (feel free to PM me the SR number for awareness but I am on PTO most of the next few weeks so unlikely to be actively looking into it myself).

heky777 · ‎01-17-2022

We have had the same strange behaviour, absolutely identical.. Opened SR with VMware and they came up with solution, which is change status of the cluster to ClusterpoweredOff. And that's it. Gui button to poweron VSAN appeared and we were able to start it up without a problem. Just like that. Here are the steps

if you call log into the vsan mob you can run through the following
-Change the status of the VsanClusterPowerSystem.
#vsanmob -> VsanClusterPowerSystem -> UpdateClusterPowerStatus -> apply from "clusterPoweredOn" to "clusterPoweredOff"

Hope it helps smbody.

heky

TheBobkin · ‎01-18-2022

@heky777 , If this was in EMEA then I may have been the engineer-behind-the-engineer determining this solution - I also found an alternative solution outside of vCenter that just requires 2 esxcfg-advcfg settings to be reverted - I have written a KB article covering both the vCenter and ESXi workarounds but am awaiting approval from engineering to make it publicly-available (and found that a colleague did the same yesterday!), I will link the KB here once available.

heky777 · ‎01-25-2022

yes @TheBobkin it was EMEA and thank you for your effort. This kind of problem is quite a real showstopper (reeeeaally glad to have test enviro and not to be under pressure of stucked production when dealing with such dead end problem) and speaking frankly we were left breathless 🙂 no way from this. And I was aware of this article here (and couple reddit based with similar story) and your suggestion to open SR and voila, all turned to the solution not only for us but all. Thank you&your colleagues for helping us all really appreciate your work!

heky

TheBobkin · ‎01-25-2022

@heky777 Happy to just understand how this issue occurs to be honest - this was annoying me in back of head since @davemuench mentioned it and it irked me as if cluster is formed then data should be okay but this is basically not the case if updates are still paused.

KB documenting this issue and various ways of resolving it is now public and accessible here:

https://kb.vmware.com/s/article/87350

davemuench · ‎01-25-2022

Thank you very much for your diligence on this - it gives me some confidence to move forward on the newer updates of vCenter in the future.

senwebtek · ‎03-14-2022

I've been experiencing the exact same issue running on my homelab on VMWare Workstation 16.2.3. Here's the situation: I've set up a vSphere/vCenter 7.03 update 3 (c I think) environment on VMworkstation 16.2.3 using vSAN, HA, and DRS with a 3 node cluster. Each esxi host has 4cpus and 24gigs of memory. Everything seems to be working fine...vMotion works, running vms off vSAN works fine, etc. My problem is that each time I shutdown the cluster(using the 'Shutdown Cluster' function) and bring the cluster back up, the vSAN is hosed. It either has no capacity or only the capacity of the disks from one host and the contents are gone. The vCLSs(inaccessible and can't remove them) won't start because the datastore they were on (vSAN) is now empty. I've recreated my environment at least about 10 times now and the same results. My host system is a Ryzen 9 5900x with 128gigs of memory so I'm not running short on resources. I've tried using the Quickstart and manually creating the vSAN and get the same results. I have 2 virtual nics in each esxi host and have vMotion and vSAN running on their own DSwitch on the 2nd nic(the 1st nic is acutally connected to my physical nic in my host PC). I'm thinking of adding another esxi host and make a 4 node cluster to see if that helps. Any ideas would be greatly appreciated. Also, I've run Skyline Health Diagnostics and it didn't really find anything except nics losing connectivity once.

TheBobkin · ‎03-15-2022

@senwebtek , when it is in that state, are you using the restore function (reverse of shutdown)?

Can you check if this returns 1 (enabled) on any host:

# esxcfg-advcfg -g /VSAN/DOMPauseAllCCPs

If so then this is the same issue I linked kb for above.

senwebtek · ‎03-15-2022

The cluster 'restart' command only appeared for me once but I can't remember the specific circumstances. I'm in the process of recreating my cluster(again) and I'll run that command and post back what I get. At this point, I was thinking about rolling back my esxi hosts to 7.0u2? as mentioned earlier in this thread but I would rather be running the current version if I can get it to work properly.

All

vCenter 7.0u3 shutdown vSAN cluster results in broken cluster