vSAN Health Checks

HockeyFan04 · ‎11-13-2020

I am a bit new to using vSAN and have a couple questions on the health checks.

One I am getting a warning on the "vSAN Disk Balance" and 2 of my 5 servers, looks like 3 disks from one and 1 from the other, have the "Proactive rebalance is needed". If I use the "Configure Automatic Rebalance" will this affect any of the current data or any issues at all to the the vSAN?

Another health check that is in error state is the "vCenter state is authoritative" with all 5 hosts out of sync. What exactly will happen if I "Update ESXi Configuration", is there a risk to lose any data at all?

Thank you!

TheBobkin · ‎11-14-2020

@HockeyFan04, Welcome to Communities and vSAN.

vSAN Health alert relating to Disk Balance is basically an informational alert that there is >30% variance between highest and lowest used Capacity-Tier disks in the cluster - vSAN doesn't automate moving data to newly added/blank disks for multiple reasons and thus why Proactive Rebalance exists. If all nodes in the cluster are 6.7 U3 or higher then you can configure this to proactively rebalance automatically, if lower then you can push the button from the Health UI to start this task, this is intentionally very very low priority IO and thus should have no impact on performance (and zero impact on data-state as it is just moving data and only removing the original data it moved once completed).

"vCenter state is authoritative" is a little more complex - this triggers if vCenter has not pushed and/or is not in sync with the unicast agent lists on the nodes (basically the list nodes have to know who is in the cluster) - did you add this cluster to a new or restored from backup vCenter and/or were there any manual changes via the CLI of the unicast agent entries?
First steps should be to validate that no nodes are set to ignore vCenter membership updates:

# esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates
If this is set to 1 then this node is set to ignore vCenter updates, this should by default be set to 0:
# esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListupdates
Next you should validate that the vSphere cluster members match the vSAN cluster members, check the nodes in the vSphere UI vs the members and as per 'esxcli vsan cluster get' - if they match and the above checks and remediation (where necessary) has been done, then you can go ahead and click "Update ESXi Configuration" - either way there is no risk to the data but the above checks need to be performed to avoid any possible cluster partition.

HockeyFan04 · ‎01-21-2021

For the ""vCenter state is authoritative" we had to add the hosts to a new vCenter, but we disassociated the hosts from the old vCenter server before adding them to the new one.

If I use the "Update ESXi Configuration" option in the Skyline Health section is there any risk of losing data?

I will run the commands below against the ESXi servers to see.

TheBobkin · ‎01-21-2021

@HockeyFan04 actually if you add an ESXi host to a new vCenter it should actually automatically get disconnected from the old vCenter (or at least it did last I checked likely around vSphere 6.5).

"If I use the "Update ESXi Configuration" option in the Skyline Health section is there any risk of losing data?"
No, worst possible case scenario is the cluster becomes partitioned (which shouldn't occur provided you perform the necessary checks and is easily remediated by repopulating the unicast lists on the nodes) - one thing to also validate is that you created the cluster with the correct configurations as they were in the previous vSphere cluster (e.g. if Deduplication or encryption were enabled then they should be here too, if it is a Stretched cluster then the Fault Domains should be configured etc.).

HockeyFan04 · ‎01-26-2021

So @TheBobkin I ran the command "esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates" against all of the ESXi hosts and only one host out of the 5 were set to 1. If setting this option to 0 there will be no effect on data.

Also I apologize about constantly asking about data. We currently have no backup plan in place for these servers (still working on it) so I am trying to keep the network up and running. Also new to all the vSAN workings.

I also came across that one of them has a different host name. Do I have to disconnect the host, put it into maintenance mode and rename it (https://kb.vmware.com/s/article/1010821) or can I just change the hostname?

TheBobkin · ‎01-26-2021

"I ran the command "esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListupdates" against all of the ESXi hosts and only one host out of the 5 were set to 1"
@HockeyFan04 , So this can explain all nodes stating requiring remediation due to vCenter not being authoritative - One node is set to ignore vC membership updates and thus it can't validate and/or remediate this nodes settings/configuration (even if they don't differ from vC's opinion of the cluster etc.).

"If setting this option to 0 there will be no effect on data. Also I apologize about constantly asking about data."
Absolutely no need for apologies whatsoever - my colleague (Hi Kam!) has a statement (maybe more of a mantra 😑 ) that I completely believe in when dealing with any situation where there is any question of the outcome and/or approach:
'Our first priority here is the data. Our second priority here is the data. Our third priority here is the data. All other priorities fall after these.'

Setting this to 0 (the default setting) on this node won't do anything, however, it will allow (when remediate button is pushed) vC to check the settings/configurations on this node and push any necessary changes to this and the other nodes - typically this is only relegated to unicastagent updates e.g. if what the nodes have as the vsan IP of a node differs with what it is currently set to from vC's perspective. The worst possible outcome from such change would be that node getting isolated from the cluster (which doesn't permanently impair data and is easily remediated), but this shouldn't be possible provided the information on the nodes and also the from vC perspective (e.g. check in the UI) are matching and correct.

This can be easily validated via checking the following:
Check the unicastagent list on each node - this should contain correct UUID and vsan-IP entry for all nodes in the cluster except the node this is being run on (e.g. a node in a 6-node cluster with single vsan-enabled vmk will have 5 entries):
# esxcli vsan cluster unicastagent list
Check that the information from the above IPs match the hosts information for the vsan-enabled vmk in the UI.
Host to UUID information can be determined via:
(run on each node and only returns itself)
#cmmds-tool whoami
or from any node:
# cmmds-tool find -t HOSTNAME | grep -iE 'uuid|health|content'

"Do I have to disconnect the host, put it into maintenance mode and rename it (https://kb.vmware.com/s/article/1010821) or can I just change the hostname?"
Yes, host name changes requires the steps in that kb (e.g. it has to be in Maintenance Mode (with Ensure Accessibility to keep all VMs accessible), remove from cluster (this does NOT remove the vSAN Disk-Groups, just kicks it out of the cluster), disconnect, remove from inventory, make the changes, reconnect, move back into cluster, exit MM.
While there are slightly 'hacky' means of keeping a node in a vSAN cluster while removed from vC inventory, these are not advisable and thus I won't be advising on these.

HockeyFan04 · ‎02-02-2021

@TheBobkin. All of this has worked an our vSAN is back to normal health. I apologize about the time in between responses, but thank you greatly for the help!