Hi, we have been dealing with some issues in out vSAN Streched cluster when some VMs disappeared from inventory and from vSAN Datastore. When we took a deep look, the used space in the datastore didn't change. weird.
After some troubleshooting we manage to see the cluster Preferred Fault Domain was set to Null. We resolved the issue by disconnecting cluster Master vSAN VMKernel and rebooting al the host one by one.
My question is: Why would this happen? What are possible causes for that to happen?
"some issues in out vSAN Streched cluster when some VMs disappeared from inventory and from vSAN Datastore."
So my first question would be what exactly do you mean by "disappeared", do you mean unavailable/inaccessible or permanently gone? - the only legitimate times I have seen such things (e.g. excluding something or someone deleting stuff) is when people have data Objects stored as FTT=0 (almost exclusively unknowingly) and lose a Disk/Disk-Group, aside from this it should be clear what happened to the data.
I would strongly advise to open a Support Request with vSAN GSS if this is not well understood already.
"Why would this happen? What are possible causes for that to happen?"
I would start with checking do you have leftover stale CMMDS entries from replaced Witness(es), from Master/Backup node (*should* be the same) this can be checked with:
# cmmds-tool find -t PREFERRED_FAULT_DOMAIN
# cmmds-tool find -t HOSTNAME
Other than that, potentially there was some other issue with CMMDS, I can only really think of one (which is actually a long knock-on effect from issues causing /scratch to be unavailable) as issues in this area are exceedingly rare (which they SHOULD be as this service places a critical role).
Thanks Bob. What i mean by disappear is that the files were no longer mapped in vsan file system until the master was rebooted.
I agree with you, really strange scenario. VMware is still looking for a root cause
Hi, Duncan. Really enjoyed vSAN 6.7 deep dive.
This is a cluster deployed with VCF 3.8.1, now upgraded to 3.9.1.
vSAN version: 6.7
ESXi builds: 15160138
vCenter + external PSC build: 15976728
Are you running any scripts against vSAN APIs? I have seen issues before where people were running scripts against our APIs and would populate fields which did not need to be populated. Other than that, this doesn't ring a bell unfortunately.
Did you file an SR?
Duncan, we opened a SR and it was solved by disabling vSAN traffic for the master node VMK and rebooting the hosts one by one. I'm not aware of scripts running but I will definitely take a deeper look into that.