Solved: Re: 2 node direct VSAN maintenance

bulabog · ‎01-22-2018

Hi All,

Just wondering if someone can provide some helpful advise. We have a 2 node direct connect VSAN cluster. We wanted to rebuild one of the ESXi host (reinstall) and are trying to put the host into maintenance mode and selecting " Full Evacuation" so we can remove it from the VSAN cluster and rebuild it. I understand that during this time we will lose redundancy if only one host plust the witness remains but that is fine. But we are getting the error:

"Failed to enter maintenance mode in the current VSAN data migration mode due to insufficient nodes or disks in the cluster. Retry operation in another mode or after adding more resources to the cluster"

Are we doing this wrong?

Many thanks

bulabog · ‎01-23-2018

Hi All,

Thank you for your inputs. Much appreciated. I've managed to get this to work by doing the following:

1. Put the host into maintenance mode, select "Ensure Data accessibility...." instead of the "Full Evacuate Data"

2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on.

3. After waiting for VSAN to do its bit, the cluster is now running in non-compliance mode (my VMs are still running and working but non-compliant).

4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data"

5. Move the host out of the cluster and remove from inventory.

I was able to rebuild the host from scratch and add back to the VSAN cluster...During this time, we knew the risk of the vsan cluster running from 1 node (+ witness) but that is fine in our case.

I don't know why it couldn't do all of the above automatically when selecting "Full Evacuate Data" the first time though...

Thank you.

View solution in original post

daphnissov · ‎01-22-2018

The Full evacuation option is the problem as you can't evacuate data from one data node in a 2-node cluster. You'll have to "ensure accessibility" instead which attempt won't move data to the other node.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

bulabog · ‎01-22-2018

thanks... does this mean we would need to add a (temporary) 3rd host before we can remove this host?

perthorn · ‎01-23-2018

Remember that a 3 node cluster is the absolute minimum to be able to use the storage policy FTT=1 (a 2 node cluster + witness is actually a 3 node cluster). To be able to completely evacuate one of the hosts while keeping the vSAN objects FTT=1 you would have to add another host. The alternative is to change the storage policy to FTT=0 for all VMs during the maintenance. This means you would lose availability for all objects, but this is what happens during "ensure accessibility" anyway. Just remember to change the policy back to FTT=1 when you are done.

Cheers

Per

TheBobkin · ‎01-23-2018

Hello bulabog,

Check that there is no resync ongoing:

Cluster > Monitor > vSAN > Resyncing components

Check that all vSAN Objects (e.g. vmdks) are compliant with their Storage Policy(SP):

Home > Policies and Profiles > VM Storage Policies > Select the SPs in use and check the VMs and disks all show as 'Compliant'

In a 2+1 configuration such as yours, compliance for VMs with the Default SP means that there is a data-mirror of each Object residing on each node + Witness components for tie-breaker residing on the Witness - this requires 3 Fault Domains for component placement (node+node+witness).

VM Objects only require access to a single data-mirror (and witness component/majority) for the VM to remain accessible, so when you place a data-node in Maintenance Mode with 'Ensure Accessibility'(EA) you are essentially telling the cluster to not use the data on the host in MM and use the other data-mirror instead. When in this state, VM data is not protected from any failures e.g. a capacity disk dying as there is no redundancy, they are essentially FTT=0 until the other data-node is available to the cluster and the data has been resynced from the copy of data that is active on the node that remains up - so make sure you take and verify back-ups before doing this.

I wouldn't advise changing all VMs SP to FTT=0 as this will likely drop half the extraneous data-components from both nodes - not just the node that you are doing maintenance on - and this will result in all the data remaining on the node to have to be evacuated off to put the host in MM with EA which will take a lot longer.

Similarly if you have any data with FTT=0 SP applied located on this host by choice this will have to be evacuated off to put the host in MM with EA.

If a host is taking a long time to enter MM with EA then note what % it is working at e.g. 2% is pre-check, ~19% is vMotion of VMs and after that is data-evacuation - you can get more visibility of what exactly it is doing from the vmkernel.log and clomd.log

Do note as you are doing a full re-install you will have to reconfigure the vSAN networking and join the host back to the cluster after:

VMware Knowledge Base

Bob

bulabog · ‎01-23-2018