I have two ESX 4 hypervisors in my cluster of 6 that spontaneously dropped connectivity overnight (still not sure what happened), but when I try to view their respective Summary tabs in the vCenter 4 client interface I get the message "HA agent disabled on <host> in cluster <cluster> in <datcenter>. Cannot synchronize host <host>. Operation timed out." I can't interface with that ESX server at all (I tried to put it in maintenance mode and then reboot it, but the options are greyed out). There aren't even any Alarms listed in the tab in vCenter!! The odd thing is that the two VMs running on one of the ESX servers are pingable and live!! (although I can't see them through the console), and they are listed as being Disconnected in the Hosts and Clusters section of the Inventory option, although I can't Edit Settings at all, that option is greyed out. Does anyone have a clue what happened or how I can get my ESX servers back online? Thanks in advance.
hosts need to be added into vCenter as root. You are using the root credentials?
yes, the same root credentials for all 6 hypervisors in this cluster, and the same credentials I've been using since they were setup three weeks ago (and they've been working fine since sometime between last night and this morning!!)
issue the below command on the ESX Host in question, then try to add back into vCenter
service vmware-vpxa restart
no dice. I get the Add Host Wizard popup again, but when I enter the proper root credentials, the error response is simply "An Error Occured While Communicating with Remote Host".
I think you have a name resolution issue going on in your environment, probably just with that host and vCenter. Please confirm proper name resolution is setup and you can resolve the names of the ESX Host(s) and vCenter using FQDN and shortname.
I think I might have to reboot the ESX servers, but I have no option in vCenter to put the servers in maintenance mode or reboot them, any such options are either not listed or greyed out. should I reboot the servers using shutdown -r now from the CLI or is there a better option? (I imagine vSphere won't like that option too much!!)
yes, the vCenter server and both ESX hosts can be pinged by FQDN and short name from both the AD DC and my workstation (which are on two different subnets)
here's the output from my latest attempt to connect to one of the ESX servers that's acting up:
"A general system error occured: internal error: vmodl.fault.HostCommunication"
any idea what subsystem that pertains to?
I still say something with DNS/name resolution isn't right, but I could be wrong
See this, http://kb.vmware.com/kb/1012154 , or this , http://kb.vmware.com/kb/1008707
Also, what version of vCenter are you running?
You also need to try the ping and the ability to resolve FQDN from the ESX hosts to vCenter and to each other. Check the case (upper lower) matches.
its vCenter 4 and I'll check out those articles, thanks!
I can successfully ping between both affected ESX hosts and from the hosts to the vCenter server, both by FQDN and shortname
Check time on both the host & vCenter.
I am having a similar issue and while everyone's advice is necessary to rule out basic problems I think there is something more complicated going on here.
IF you are running vCenter 4.1 then there may be a bug in the VC agent 4.1. I have ESX 3.5 to 4.1 Hosts that are intermittently having this issue. Many have run for years with no problems like this. What I have found is that you first need to reestablish communication with the Host by running service mgmt-vmware restart from a SSH login.
Once you reestablish a connection with VC you can then Disconnect the Host by right clicking the specific host. They Remove the Host the same way.
Reconnect the host to the cluster in question. This will reinstall the VC agent and the daemons. On the hosts I have done this to the problem goes away.
Just restarting the VC agents will fix the problem for a short ammount of time, but the problem seems to come back.
I have opened a case with VMware on this exact issue and they are running all my logs throught the Engineering Group.
There seems to be some flaky issues with the deamons that are installed when you upgrade your VC to 4.1.
sorry for the delay in resolving this issue! I contacted VMWare support, and the issue ended up being related to the APD running and maxing out local resources to the exclusion of all other ESX services. I had incorrectly unpresented a SAN LUN to vSphere before removing the associated datastore, and in so doing, each of they hypervisors was still trying to locate the rogue resource, eventhough it didn't actually exist anymore!!! the rate of APD resource utilization seemed to increase exponentially until two of the hypervisors simply failed, and the other 4 were on the verge. I received the correct process to unpresent a LUN to vSphere, or conversely, I was told to simply reboot each hypervisor after making a change of that nature... go figure!!!