The Full evacuation option is the problem as you can't evacuate data from one data node in a 2-node cluster. You'll have to "ensure accessibility" instead which attempt won't move data to the other node.
thanks... does this mean we would need to add a (temporary) 3rd host before we can remove this host?
Remember that a 3 node cluster is the absolute minimum to be able to use the storage policy FTT=1 (a 2 node cluster + witness is actually a 3 node cluster). To be able to completely evacuate one of the hosts while keeping the vSAN objects FTT=1 you would have to add another host. The alternative is to change the storage policy to FTT=0 for all VMs during the maintenance. This means you would lose availability for all objects, but this is what happens during "ensure accessibility" anyway. Just remember to change the policy back to FTT=1 when you are done.
Check that there is no resync ongoing:
Cluster > Monitor > vSAN > Resyncing components
Check that all vSAN Objects (e.g. vmdks) are compliant with their Storage Policy(SP):
Home > Policies and Profiles > VM Storage Policies > Select the SPs in use and check the VMs and disks all show as 'Compliant'
In a 2+1 configuration such as yours, compliance for VMs with the Default SP means that there is a data-mirror of each Object residing on each node + Witness components for tie-breaker residing on the Witness - this requires 3 Fault Domains for component placement (node+node+witness).
VM Objects only require access to a single data-mirror (and witness component/majority) for the VM to remain accessible, so when you place a data-node in Maintenance Mode with 'Ensure Accessibility'(EA) you are essentially telling the cluster to not use the data on the host in MM and use the other data-mirror instead. When in this state, VM data is not protected from any failures e.g. a capacity disk dying as there is no redundancy, they are essentially FTT=0 until the other data-node is available to the cluster and the data has been resynced from the copy of data that is active on the node that remains up - so make sure you take and verify back-ups before doing this.
I wouldn't advise changing all VMs SP to FTT=0 as this will likely drop half the extraneous data-components from both nodes - not just the node that you are doing maintenance on - and this will result in all the data remaining on the node to have to be evacuated off to put the host in MM with EA which will take a lot longer.
Similarly if you have any data with FTT=0 SP applied located on this host by choice this will have to be evacuated off to put the host in MM with EA.
If a host is taking a long time to enter MM with EA then note what % it is working at e.g. 2% is pre-check, ~19% is vMotion of VMs and after that is data-evacuation - you can get more visibility of what exactly it is doing from the vmkernel.log and clomd.log
Do note as you are doing a full re-install you will have to reconfigure the vSAN networking and join the host back to the cluster after:
Thank you for your inputs. Much appreciated. I've managed to get this to work by doing the following:
1. Put the host into maintenance mode, select "Ensure Data accessibility...." instead of the "Full Evacuate Data"
2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on.
3. After waiting for VSAN to do its bit, the cluster is now running in non-compliance mode (my VMs are still running and working but non-compliant).
4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data"
5. Move the host out of the cluster and remove from inventory.
I was able to rebuild the host from scratch and add back to the VSAN cluster...During this time, we knew the risk of the vsan cluster running from 1 node (+ witness) but that is fine in our case.
I don't know why it couldn't do all of the above automatically when selecting "Full Evacuate Data" the first time though...
Sorry but I don't think you are getting how this works.
"2. Once in maintenance mode, I then go to the VSAN disk groups and delete the Disk Group for the host I am working on."
There is no need to remove the disk-groups when re-installing a host.
"4. I then exit maintenance mode on the host, then put it in maintenance mode again this time selecting "Full Evacuate Data" "
The host was already in MM so this changes nothing and is unnecessary.
There was no data on the node to evacuate as there were no disk-groups...