xlcor
Contributor
Contributor

HA agent disabled on ESX 4 host

I have two ESX 4 hypervisors in my cluster of 6 that spontaneously dropped connectivity overnight (still not sure what happened), but when I try to view their respective Summary tabs in the vCenter 4 client interface I get the message "HA agent disabled on <host> in cluster <cluster> in <datcenter>. Cannot synchronize host <host>. Operation timed out." I can't interface with that ESX server at all (I tried to put it in maintenance mode and then reboot it, but the options are greyed out). There aren't even any Alarms listed in the tab in vCenter!! The odd thing is that the two VMs running on one of the ESX servers are pingable and live!! (although I can't see them through the console), and they are listed as being Disconnected in the Hosts and Clusters section of the Inventory option, although I can't Edit Settings at all, that option is greyed out. Does anyone have a clue what happened or how I can get my ESX servers back online? Thanks in advance.

0 Kudos
34 Replies
Troy_Clavell
Immortal
Immortal

hosts need to be added into vCenter as root. You are using the root credentials?

0 Kudos
xlcor
Contributor
Contributor

yes, the same root credentials for all 6 hypervisors in this cluster, and the same credentials I've been using since they were setup three weeks ago (and they've been working fine since sometime between last night and this morning!!)

0 Kudos
Troy_Clavell
Immortal
Immortal

issue the below command on the ESX Host in question, then try to add back into vCenter

service vmware-vpxa restart

0 Kudos
xlcor
Contributor
Contributor

no dice. I get the Add Host Wizard popup again, but when I enter the proper root credentials, the error response is simply "An Error Occured While Communicating with Remote Host".

0 Kudos
Troy_Clavell
Immortal
Immortal

I think you have a name resolution issue going on in your environment, probably just with that host and vCenter. Please confirm proper name resolution is setup and you can resolve the names of the ESX Host(s) and vCenter using FQDN and shortname.

0 Kudos
xlcor
Contributor
Contributor

I think I might have to reboot the ESX servers, but I have no option in vCenter to put the servers in maintenance mode or reboot them, any such options are either not listed or greyed out. should I reboot the servers using shutdown -r now from the CLI or is there a better option? (I imagine vSphere won't like that option too much!!)

0 Kudos
xlcor
Contributor
Contributor

yes, the vCenter server and both ESX hosts can be pinged by FQDN and short name from both the AD DC and my workstation (which are on two different subnets)

0 Kudos
xlcor
Contributor
Contributor

here's the output from my latest attempt to connect to one of the ESX servers that's acting up:

"A general system error occured: internal error: vmodl.fault.HostCommunication"

any idea what subsystem that pertains to?

0 Kudos
Troy_Clavell
Immortal
Immortal

I still say something with DNS/name resolution isn't right, but I could be wrong

See this, http://kb.vmware.com/kb/1012154 , or this , http://kb.vmware.com/kb/1008707

Also, what version of vCenter are you running?

0 Kudos
DSTAVERT
Immortal
Immortal

You also need to try the ping and the ability to resolve FQDN from the ESX hosts to vCenter and to each other. Check the case (upper lower) matches.

-- David -- VMware Communities Moderator
0 Kudos
xlcor
Contributor
Contributor

its vCenter 4 and I'll check out those articles, thanks!

0 Kudos
xlcor
Contributor
Contributor

I can successfully ping between both affected ESX hosts and from the hosts to the vCenter server, both by FQDN and shortname

0 Kudos
jb12345
Enthusiast
Enthusiast

Check time on both the host & vCenter.

0 Kudos
adamy
Enthusiast
Enthusiast

I am having a similar issue and while everyone's advice is necessary to rule out basic problems I think there is something more complicated going on here.

IF you are running vCenter 4.1 then there may be a bug in the VC agent 4.1. I have ESX 3.5 to 4.1 Hosts that are intermittently having this issue. Many have run for years with no problems like this. What I have found is that you first need to reestablish communication with the Host by running service mgmt-vmware restart from a SSH login.

Once you reestablish a connection with VC you can then Disconnect the Host by right clicking the specific host. They Remove the Host the same way.

Reconnect the host to the cluster in question. This will reinstall the VC agent and the daemons. On the hosts I have done this to the problem goes away.

Just restarting the VC agents will fix the problem for a short ammount of time, but the problem seems to come back.

I have opened a case with VMware on this exact issue and they are running all my logs throught the Engineering Group.

There seems to be some flaky issues with the deamons that are installed when you upgrade your VC to 4.1.

0 Kudos
xlcor
Contributor
Contributor

sorry for the delay in resolving this issue! I contacted VMWare support, and the issue ended up being related to the APD running and maxing out local resources to the exclusion of all other ESX services. I had incorrectly unpresented a SAN LUN to vSphere before removing the associated datastore, and in so doing, each of they hypervisors was still trying to locate the rogue resource, eventhough it didn't actually exist anymore!!! the rate of APD resource utilization seemed to increase exponentially until two of the hypervisors simply failed, and the other 4 were on the verge. I received the correct process to unpresent a LUN to vSphere, or conversely, I was told to simply reboot each hypervisor after making a change of that nature... go figure!!!

0 Kudos