It's not how much free space you have but where you have it (and what version of vSAN as we have made a lot of changes to resync over the years) - for example, you could have a 3-node cluster using FTM=RAID1 and even if you have 10% vsanDatastore usage you wouldn't be able to do MM with FDM as you would have no available Fault Domain to place the components while still being compliant with the applied Storage Policy.
Another example would be a 2+2+1 Stretched cluster - even just with normal FTM=RAID1 across sites you would have to have a lot of free space to do MM FDM of one node (less than 50% usage and less again in 6.7 where we have guardrails). Other factors that could complicate this further would be Objects with just single-site protection in Stretched clusters, relatively large vmdk Objects that require more than single DG/FD for single data-set placement and relatively small capacity-tier devices.
You state that it *does* start evacuating but then fails - what does the error message on failure state? What does the evacuation precheck state with regard to space usage before and after (or the Health UI check for usage after 1 additional host failure inform)? What build version is in use? Share RVC output of vsan.disks_stats <pathToCluster> if you can.
Congestion can be expected during large resyncs, think about what is occurring here: you already have the normal workload of the cluster, you are adding a potentially massive amount of read and write workload to this and you are taking one node out of use for component placement at the same time.
Feedback in red, based on the below if possible let me know what you think is the cause and steps to try to rectify:
what does the error message on failure state? - Failed to enter maintenance mode in the current VSAN date migration mode due to insufficient nodes or disks in the cluster (Apologies I should have included this last bit) - It's a 6 node cluster and it's not a stretched cluster.
What does the evacuation precheck state with regard to space usage before and after (or the Health UI check for usage after 1 additional host failure inform) - From GUI Health Check, Disk space utilization is 80% assuming one host failure (Note data re-sync is still not fully completed so think this percentage will drop further).
What build version is in use? - ESXi 6.0U2 (Upgrade to 6.5U2 in progress hence why MM and FDM are needed)
Share RVC output of vsan.disks_stats <pathToCluster> - Will do if still needed on a further post
Congestion can be expected during large resyncs - yes thought so, thanks for confirming
Additional Info: there is a VM taking around 33% of the full vSAN storage if this helps/matters
"It's not how much free space you have but where you have it" - This is interesting...would a disk balance help/increase the chances of MM with FDM to be successful or am not understanding what you exactly mean by this?
"Disk space utilization is 80% assuming one host failure (Note data re-sync is still not fully completed so think this percentage will drop further)"
Those numbers don't quite add up if you had ~55% used as you initially said - you shouldn't be trying to place nodes into Maintenance Mode when there is an ongoing resync, what is the cause and goal of the resync? (e.g. have Storage Policies been changed, nodes or storage added/removed, hosts rebooted or rebalance job)
"there is a VM taking around 33% of the full vSAN storage if this helps/matters"
If this is all stored as one large vmdk (e.g. a single 10TB vmdk not 10 1TB vmdks) then this is very likely implicated as this would likely have its data components distributed across more than just 2 nodes - if it was initially placed when the cluster was relatively full then this is more likely again.
Is there a reason you are doing MM with FDM instead of EA for upgrade or just more interested in why it is failing to enter with FDM?
Are you 100% positive there isn't a node already in or entering MM? (note that there are ways that a node can be in vSAN MM but not ESXi MM)
This can easily be confirmed using:
# cmmds-tool find -t NODE_DECOM_STATE -f json
Let me try to explain better:
No data re-sync was in place and vSAN datastore had around 55% free capacity
A Host was attempted to be placed in MM with FDM
Attempt was progressing for a couple of minutes after which I got the error about lack of available nodes/disks (This event still triggered Data re-synch process)
My last post was posted before this same Data Re-Synch was completed hence why the reading was still at 80% used/20% free in case of additional host failure. Used space went down to 67% once Re-Sync completed Assuming one host failure.
Regarding big VM, it has one big disk of 5TB split into 21 components across 3 servers.
Am pretty sure all hosts are running vSAN (not in vSAN MM) as think the vSAN datastore would go down in storage size if that was the case.(stand to be corrected here)
# cmmds-tool find -t NODE_DECOM_STATE -f json
The above is not showing State (4 or 6) entry for none of the hosts in the same cluster
So to summarize placing a Host in MM with FDM is triggering a data re-synch but Host fails to enter in MM due to lack of space. The Data Re-Sync progresses even though the Host fails to enter in MM.
Only warnings I have in vSAN health related to disk format version (As some hosts already upgraded) and vSAN disk balance, not sure if the latter would help if I run with the issue am having?
I am interested to know why this is not working and equally important what steps I can take as MM with EA is not a risk I would like to take.
Any suggestions/action plan based on my feedback that addresses my last sentence?
I'm not sure I can make any better recommendation than just use backups+MM with EA option like probably 90% of vSAN customers use for upgrades but more information might help here:
Precisely what build of ESXi 6.0 U2? (e.g. 3620759 or 6921384)
Are all disks on the same virsto on-disk format?
Are all capacity-tier drives the same size?
Are all Disk-Groups the same size?
Do all hosts have the same number of Disk-Groups?
Is this cluster All-Flash or Hybrid? If All-Flash is Deduplication&Compression enabled?
Do you have any Fault-Domains (e.g. pseudo rack-awareness) configured? (e.g. 2+2+2 (or randomly something eccentric like 1+1+2+2 ))
What Storage Policy/Policies are applied to the data?
Have you tried just the same host each time or tried different hosts?
Have you tried a host that has none of the components of the relatively large vmdk?
Is there anything else potentially wrong with the cluster e.g. disks not in CMMDS, becoming network-partitioned (even briefly), any data unhealthy, abnormal congestion (e.g. ssd/log/mem congestion at 200+)?
Please confirm the gross size and current % used of the vsanDatastore and the used size of that large vmdk, you mentioned it was consuming 33% of the vsanDatastore, was 5TB (5TB allocated before FTT?5TB used for each mirror?5TB on disk (e.g. varies depending if thin)?).
If there is no strict application requirement for that vmdk to be 1x5TB as opposed to 2x2.5TB vmdks (and it is taking up 33% of the clusters storage) then I would advise splitting this.
Thank you for your help, I am raising with VMware as think this requires more in depth investigation. Really appreciate your efforts to help.