VMware Cloud Community
davemuench
Contributor
Contributor
Jump to solution

vCenter 7.0u3 shutdown vSAN cluster results in broken cluster

Hi,

I used the new vSAN cluster shutdown wizard yesterday for the first time when I had to shut a lab down for a power outage. The cluster consists of three Dell R730xd nodes, with vCenter residing on a different non-vSAN node. The shutdown was clean, and the hosts were shut down properly before the power was lost. On bootup, all the vsan VMs are listed as Inaccessible, and aren't visible if browsed to in the datastore (via GUI or command line).

The button to restart the cluster was not present, so I followed the instructions to manually restart the cluster via the command line. The recover script however times out:

[root@esx01:/tmp] python /usr/lib/vmware/vsan/bin/reboot_helper.py recover
Begin to recover the cluster...
Time among connected hosts are synchronized.
Scheduled vSAN cluster restore task.
Waiting for the scheduled task...(18s left)
Checking network status...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Recovery is not ready, retry after 10s...
Timeout, please try again later

 

I have been digging since then with no success. The cluster looks to have reformed properly:

[root@esx01:/tmp] esxcli vsan cluster get
Cluster Information
   Enabled: true
   Current Local Time: 2021-10-26T18:22:51Z
   Local Node UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
   Local Node Type: NORMAL
   Local Node State: MASTER
   Local Node Health State: HEALTHY
   Sub-Cluster Master UUID: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c
   Sub-Cluster Backup UUID: 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0
   Sub-Cluster UUID: 528157e4-4935-2809-ab88-5d161aec89a5
   Sub-Cluster Membership Entry Revision: 4
   Sub-Cluster Member Count: 3
   Sub-Cluster Member UUIDs: 5ec45e01-2a6f-de52-2b8d-a0369f59e56c, 5ed1c2d6-fd4f-e6ea-1e6c-001b21d41ea0, 614b991e-f0f6-8762-c918-801844e56f42
   Sub-Cluster Member HostNames: esx01, esx02, esx03
   Sub-Cluster Membership UUID: fd197861-36c6-b896-868a-a0369f59e56c
   Unicast Mode Enabled: true
   Maintenance Mode State: OFF
   Config Generation: a01749d0-4f70-4911-aeb0-919cfdc176bb 31 2021-10-26T18:12:08.28
   Mode: REGULAR

 

esx01 being the master, esx02 showing as backup, and esx03 as agent. The unicast list looks correct, and the vsan vmk's were re-tagged properly.

In RVC, this is typical of what I see for each VM:

/localhost/Lab/vms/wan-test> vsan.vm_object_info .
VM wan-test:
  Disk backing:
    [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk

  [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmx
    DOM Object: 65d06061-65f9-6456-0383-a0369f59e56c (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
      RAID_1
        Component: 65d06061-6ce4-8057-b167-a0369f59e56c (state: ACTIVE (5), host: esx01, capacity: naa.5002538e30820eb4, cache: naa.55cd2e404b795ca0,
                                                         votes: 1, usage: 0.2 GB, proxy component: false)
        Component: 65d06061-7415-8457-02b2-a0369f59e56c (state: ABSENT (6), csn: STALE (109!=160), host: esx03, capacity: naa.5002538ec110af17, cache: naa.55cd2e404b796059,
                                                         dataToSync: 0.21 GB, votes: 1, usage: 0.2 GB, proxy component: false)
      Witness: 65d06061-92d5-8757-6ca8-a0369f59e56c (state: ACTIVE (5), host: esx02, capacity: naa.5002538e4066db91, cache: naa.55cd2e404b78c07a,
                                                     votes: 1, usage: 0.0 GB, proxy component: false)

  [vsanDatastore] 65d06061-65f9-6456-0383-a0369f59e56c/wan-test.vmdk
    DOM Object: 97d06061-ede2-06ca-1bb3-001b21d41ea0 (v15, owner: esx01, proxy owner: None, policy: No POLICY entry found in CMMDS)
      RAID_1
        Component: 97d06061-d7a3-c9ca-c0c3-001b21d41ea0 (state: ABSENT (6), csn: STALE (67!=202), host: esx03, capacity: naa.5002538ec110b853, cache: naa.55cd2e404b796059,
                                                         dataToSync: 1.56 GB, votes: 1, usage: 1.6 GB, proxy component: false)
        Component: 97d06061-832b-cbca-1280-001b21d41ea0 (state: ACTIVE (5), host: esx02, capacity: naa.5002538e30820eb0, cache: naa.55cd2e404b78c07a,
                                                         votes: 1, usage: 1.6 GB, proxy component: false)
      Witness: 97d06061-7f37-ccca-803a-001b21d41ea0 (state: ACTIVE (5), host: esx01, capacity: naa.5002538e4102a7d6, cache: naa.55cd2e404b795ca0,
                                                     votes: 1, usage: 0.0 GB, proxy component: false)

 

The two things that stand out for me is the no policy found in CMMDS, and that in every VM's case the component that resides in esx03 is the absent component - whether it is a component or a witness.

I have tried various things to try to manipulate the storage policy and reduce the ftt to 0, but none of them take as the new policy can't be defined due to the invalid state of the VM.

Any help would be greatly appreciated. I'd love to open a support ticket but we don't have support on this small lab environment, and I'd rather not have to rebuild from backups if I can avoid it. I'm also trying to learn why this happened in the first place and if the shutdown cluster functionality can be relied upon or not.

0 Kudos
21 Replies
senwebtek
Contributor
Contributor
Jump to solution

Replied at the wrong place. Hopefully move reply to the right place.

0 Kudos
senwebtek
Contributor
Contributor
Jump to solution

Trying to reply to @TheBobkin at 03-15-2022 02:29 AM OK, I've rebuilt my cluster as before. Before shutting down cluster, value of command is 0 (which I guess is expected at this point). Next executed cluster shutdown command. At this point, the 'Restart Cluster' command is present. I powered up all 3 hosts and the 'Restart Cluster' is still present. Next, ran the command you gave me and the value is '1' at this point. After that, ran the 'Restart Cluster' command now that the esxi hosts were back up. After the 'restart cluster' command finished, the vSAN storage was only showing the capacity equal to on of the esxi hosts and it was empty. Running the command you gave returned '1' again. So, I'm assuming that my solution would be for scenario 3 in the knowledge base article you referenced. I'm very new to this so I'll try to get thru step 2 of the solution for scenario 2. Thanks so much for your help.

0 Kudos