I work for a cloud service, and we host many customer private clusters across the country on one 6.5 U2g VCSA. We have one customer cluster, 6 hosts and 60 VMs, that keeps having VMs come up with the error "this virtual machine failed to become vsphere ha protected...". It's easy to correct, but it has to be done manually, turning off HA and then turning it back on. It's happened 4 times in the last 3 days. I'm not even sure at this point if it really is working, as it seems to break every time a VM migrates to a new host, and, no, it is not always the same source or destination host. So, I can't even tell if HA is working, and I don't want a host failure to be the time when we discover that it really isn't working, and have a high paying customer have 8-10 VMs down until they're manually restarted. The hosts are all physically identical, same firmware levels, and same version, 6.5U3, of ESXi, with Enterprise+ licensing.
Is there a more permanent fix for this issue? I haven't been able to find anything in VMWare's knowledge base other than the fix I'm already doing, which seems to last maybe until the next VM migration, or maybe not at all.
Can you share fdm.log to investigate if there is an issue over there?
Also, can you tell a little bit more about HA configuration? How is admission control configured? Are you using any reservation? What datastores you use for heartbeating? does it select automatically?
The first thing I consider not right is that you have ESXi hosts with a higher version than vCenter. Alwasy ensure that your vCenter Server is equal or higher in version that your ESXi hosts.
Can you provide fdm.log ((/var/log/fdm.log) file with time stamp and vm name too.
6.5 U2g VCSA
6.5 U3 ESXi
VCSA version should be same or higher. You have 6.5 U2g for VCSA and 6.5 U3 for ESXi which is a higher version than VCSA which means VCSA is lower than ESXi.
Please algo get the fdm.log as mentioned previously