Hi guys,
Having a problem with one of our 3 hosts. The host says it's "disconnected". The virtual servers which are running on it say they are disconnected, however they are still physically working as servers.
I have now put the server into maintenance mode. After rebooting and bringing back into the Cluster I get the following message when trying to enable HA - "An error occurred during configuration of the HA Agent on the host"
The cluster is set up for HA. However if select the properties of the other two hosts it does give me the option to "configure for HA" which is strange. Any ideas?
After reading some other posts I checked DNS which seems fine. I can ping the host server from the others, not using the FQDN. I can ping the gateway no problem.
I should have said the server has worked in HA, but now
G
null
First question are there any VM running on the host that you rebooted currently?
No, it was in maintenance mode with all servers moved off it.
From the shell on that server from this command
service mgmt-vmware restart
See if this will let you connect again
Thanks for the replies. Problem seems to have been down to an old WINS entry hanging around for the Virtual Centre.
G
My problem was "an error occurred during configuration of the HA Agent on the host" with additional error "Vmap_foobarvm02 process failed to stop".
This started occuring after I added third ESX server (foobarvm03), moved VM's to new server and un-installed two old servers. Next I replaced old servers with new hardware and installed ESX on them using same names that old servers had. Everything went fine until I tried to add two new servers to HA setup and I just got those errors. Switch configs, DNS, etc. were all fine so solutions that helped others didn't help me. After some poking around on SC I figured out what Legato AAM is trying to do. Some tips below.
Check /opt/LGTOaam512/log for errors, especially aam_config_util_addnode.log. If it's filled with "Error \[10022]: Process Not Found" and last correct error before them is "Error \[10001]: Instance Already Exists" problem is likely same that I had.
It seems that when you install new ESX server using same name as old one it's going to fail. Could be that I removed old servers incorrectly too. VMware tools did remove accounts for old nodes but there was some other junk left on Legato AAM side. Here's how you can re-add those logins and only then VMwares HA configuration knows what to do. It still takes longer than usually for HA config to go initially thru, but it does work after this.
Enter these on working node (foobarvm03):
perl /opt/LGTOaam512/vmware/aam_config_util.pl -cmd=listnodes
\- should say problematic nodes are down
FT_DIR=/opt/LGTOaam512 /opt/LGTOaam512/bin/ftcli -d vmware
\- You get 'AAM>' prompt. Type:
deleteUser root foobarvm01
deleteUser root foobarvm02
createUser root foobarvm01 Node PERM_ALL
createUser root foobarvm02 Node PERM_ALL
Check userlist. Should have admin account for all nodes now.
listUsers
Attempt logon from broken node (foobarvm02):
FT_DIR=/opt/LGTOaam512 /opt/LGTOaam512/bin/ftcli -d vmware
No errors? Does it work?
listUsers
Good. Now just retry HA setup using VirtualCenter.
WINS? You think so? HA problems are usually dns resolution on the ESX nodes...so definitely no WINs.
Anyway, glad it's working.
Dave
They may well be. However, what if the ESX hosts are pointing at Windows DNS, which is using WINS as a last resort for resolving names?
http://www.microsoft.com/technet/archive/winntas/deploy/integrat.mspx?mfr=true
Make sure that your host file has the correct hostname and ip address of the server you are getting the error.
We had a server that had been built with the wrong IP, this was changed but the local hosts file was not update.
Change this and took out of maintenance mode and HA Agent started with no issues.
I had this same problem
Make sure that you configure the host file on every host and include all your host names inthe format of IP address FQDN SHORTNAME. Do not rely on DNS to resolve your shortnames - this was recommended to me by vmware support. So... edit the following file
/etc/hosts
Make sure you enter the ipaddress (space) FQDN (space) shortname
ie
192.168.10.1 esx1.suffix.com esx1
192.168.10.2 esx2.suffix.com esx2
and so on for every host in every host file
The shortname was the key for me...
I had the same problem, it ended up being a case sensitivity problem with VMWare. In the Vmware DNS configuration in VC, the server name had caps in it, while our dns records obviously did not. Changing to all lower case solved the problem. That is another thing for those of us who despise host files.
Hi Johu,
thanks for your trouble shooting procedure. and this helped me in solving my HA error problem.
abraju
I too just had this problem..
I changed the IP address from the console using esxcfg-vswif and the /etc/hosts had the old ip
hi,
i've got the same error since a few days and i tried all the things but nothing seems to be working.
must i restart vmware daemon each time i make a dns change ? or must i reboot the machine ?
can anyone tell me how to restart vmware daemons each time i make a dns modification ?
thanks
I just got bit by this and I have setup many HA/DRS clusters.
DNS was ok, but HA wouldn't stay configured. Pinging ip, short name and fqdn from each esx host's console gave all the expected results.
It was one /etc/hosts file. An esx host was renamed but the /etc/hosts file had the old entry.
Made the change and all was ok. If DNS (even Microsoft) has the correct a record and you have a ptr record I have never had any issues.
Error : An error occurred during configuration of the HA Agent on the host
Solution : Try these 4 steps one by one
1) Check the DNS is working correctly and using your DNS Server you can resolve each ESX host by name.
2) If No DNS Server in place edit the /etc/hosts file and add entry for each ESX host.NOTE : Make sure you add FQDN and also the shortname.You can obtain the shortname by typing hostname -s.
3) Disconnect the ESX host from virtual center and connect them back.
4) Disbale VMware HA feature from Cluster and enable them back.
You will find that one of this steps will solve your problem.
Raj
I have a question about the DNS portion. Does this still apply to 3i 3.5? Is there still a hosts file on this version that we can look at?
Host file :
/etc/hosts
Also look at : /etc/resolv.conf for DNS..
Thanks for the reply. I thought Linux was not on the new version that is why we use the Remote CLI. How does one get to the ESX Host local hosts file then?
I think I found a way to see the hosts file and the rest of the files as well on a 3i 3.5 host. In the VI client, open the Administration menu and choose Export Diagostic Data. Select the hosts whose logs you want and the destination directoy. Then use something like 7Zip to extract the tgz file that is created. The folder structure of the host is then accessable along with all of the files like the hosts file, resolv.conf etc.