VMware Cloud Community
janatlud
Contributor
Contributor

Removing of corrupted ESXi host from vSAN cluster after full vSAN maintenance break

Hello,

I would ask how is possible to remove "failed" ESXi host from vSAN cluster after maintenance window.

It means, exactly during time, when the whole vSAN has been powered off. According to this "https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-89211E74-6..."
There is no info about restarting of the vSAN, in case of some physical host crash during booting. In production, this can mean several hours/few days break.

At this moment I only test some procedures. And if we will use the "shutdown vSAN cluster" button and one host will fail, the cluster restart fail also if I try to drag and drop unreachable host from cluster in vCenter and remove the unicast communication for the host. Restart of the cluster failed on exit the down host from MM. Also the manual run of "reboot_helper.py recover" ended by fail. 

In case of manual shutdown, if in the KB this comment "If any hosts fail to restart, you must manually recover the hosts or move the bad hosts out of the vSAN cluster."
What this exactly means? How to exclude the host from booting vSAN environment and avoid to multiple hours/days break. Also in scenario, that the OS of the ESXi is corrupted and ESXi backup file is not created. Is there some way, or after vSAN break just must be all ESXi powered on?

Thank you in advance

Reply
0 Kudos
3 Replies
TheBobkin
Champion
Champion

@janatlud "If any hosts fail to restart, you must manually recover the hosts or move the bad hosts out of the vSAN cluster."

It means what it says - if for instance a host cannot boot (e.g. failed boot device or other issue) then either fix it so it does work OR it should be removed from the cluster - this can be done via 'remove from inventory' in vSphere client - even if that node isn't booting or responsive this will cause vCenter to send update to all node to remove that nodes entry from their unicastagent lists and thus this node will not be a member of this cluster until added back.

 

"Also in scenario, that the OS of the ESXi is corrupted and ESXi backup file is not created. Is there some way, or after vSAN break just must be all ESXi powered on?"

I would just reinstall it on either same media (assuming that is not the cause of corruption) or new media if the original is impaired - leave the vSAN disk partitions in place - remove the old reference to this node/host from vSphere inventory (as after reinstall it will have new node UUID etc. and be seen as a new node with disk data in place and reusable (if not already resynced without it)) and then add it back to the cluster like you would a new node.

Reply
0 Kudos
janatlud
Contributor
Contributor

hi,

thanks, this view was very helpful, because I tested many times and no success, this description ping the root cause. The "removing of hosts" from inventory is not enough. After/before that is needed to all ESXi hosts to run "esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates" to enable update from vCenter again, which is by default stopped in automatic shutdown process or also during the manual steps.

And because this is mostly in the end of startup process, the recovery process always fails.

Reply
0 Kudos
TheBobkin
Champion
Champion

@janatlud Yes very good point - if you remove a node from the cluster while nodes are set to ignore unicastagent updates then yes indeed it will still be listed there and from their opinion still considered a cluster member so worth confirming whether that is set also.

 

Happy to have inadvertently helped you figure it out 😆

Reply
0 Kudos