This is a new HP ProLiant DL380p Gen8 server with ESXi 5.5 installed. Around every month I've had to restart the management services because I keep loosing connection to the ESXi through vSphere. All VM's keep working correctly, it's a textboox case of VMWare's KB1003490.
I have noticed in other similar cases that the usual culprit is the hostd service getting hanged. Checking the hostd.log file I have found this, usually around the day that the ESXi get's disconnected:
2015-12-15T22:35:02.197Z [3A981B70 info 'Vimsvc.ha-eventmgr'] Event 289 : Issue detected on ESX01 in ha-datacenter: hostd detected to be non-responsive
--> (2015-12-15T22:35:02.196Z cpu39:7373211)
I would like help finding out what might be freezing the hostd service, because constantly restarting the management services doesn't seem like a long term solution. Any suggestions about how to pinpoint the root cause?
Thanks for your help.
There are several this that can cause this, HPs insight management agents and or outdated drivers for example
Take a look in this KB to narrow it down
we have seen exactly the same issue on some of our hosts. The hosts have nothing in common (differnent hardware vendors, different vCenters, differnet locations and differnet shared storage), but the problem starts and ends sometimes exactly at the same time on different hosts.
Did you found the root cause in the meantime? We have no idea at the moment, what triggers that problem. We have also no hardware agents installed on our hosts.
Thank you for your help!
I'm actually still troubleshooting this issue.
I updated the HP insight and drivers, but 2 weeks later the issue reocurred on the same ESXi. Last night I upgraded the hosts from 5.5.0 Patch 4 (build 2403361) to 5.5.0 Update 3b (build 3248547), along with the vCenter from 5.5 Update 3a (build 3142196) to 5.5 Update 3c (build 3660016).
On this upgrade I noticed a known issue that sounds similar to this host disconnecting issue. I'll keep posted if this fixes my situation, and maybe it could help in yours. Check the different builds of your affected ESX/ESXi and you could compare against the table found here.