Please check hosts, vpxa & vmkernel.log files to understand why host is going to not responding state.
Otherwise please upload files with timestamp to review.
I still haven't found clues, but it reminds me problems I have read about here but I can't find the posts again.
Everything works fine until I reboot the server. Of course it gets disconnected and doesn't connect automatically (should it) when back online. Reboot time is long, as the server features around 700Gb RAM.
If I connect it manually, it will disconnect soon after.
But if I remove it from the inventory, add it again, it will connect and stay connected (I left it like that for a couple of days).
Of course, on the next reboot, same mess ... and only one machine.
I might end reinstalling it, but I would like to understand what's happening.
I would advise checking for vpxa.log to understand if the host is losing Heartbeat to vcenter.
Secondly removing the host from vcenter inventory and adding back point to Vcenter database entry where the host is registered and given unique ID; here the issue might be with either stale or duplicate ID pointing to that ESXi host which will require validation or cleanup within vcenter DB.
PS: will require log bundle from both ESXi host and vcenter to understand exactly what's happening when the host is pushed out or get disconnected from vcenter.
I would request to get the log bundle even though if you will be reinstalling the host.
For the present time, I give up,
I don't see anything helpful in the logs, mostly because I don't really know what error messages mean (some seem to be harmless even if marked as errors).
I have extracted the logs from the export bundle, and anonymized them. But the content is genuine.
Just for science's sake, I decided to restart VS appliance and guess what : everything was all green when I was able to log again. But after rebooting one ESXi, back to reality !
var.zip 1.2 MB