I'm testing upgrade paths to vSphere from ESX 3.5 update 4 on a IBM BladeCenter with 2 hs21 xm blade servers; i'm going through several problems and the last one is the one mentioned in the subject of this thread.
In this scenario i upgraded vCenter succesfully then i took all VMs on a single esx 3.5 host, i removed the other host from the cluster and then from vCenter, then i made a fresh install of vSphere, i reconnected the host to the cluster then i did the whole procedure again with the second node. At the end i have two hosts with vSphere installed but i had to disable HA in my cluster since i always get this error when i try to configure HA agents on the hosts, but i have to say that DRS works ok.
In the release notes of vSphere, in the known issues section i can read:
"Upgrading from an ESX/ESXi 3.x host to an ESX/ESXi 4.0 host results in a successful upgrade, but VMware HA reconfiguration might fail
When you use vCenter Update Manager 4.0 to upgrade an ESX/ESXi 3.x host to ESX/ESXi 4.0, if the host is part of an HA or DRS cluster, the upgrade succeeds and the host is reconnected to vCenter Server, but HA reconfiguration might fail. The following error message displays on the host Summary tab: HA agent has an error : cmd addnode failed for primary node: Internal AAM Error - agent could not start. : Unknown HA error .
Workaround: Manually reconfigure HA by right-clicking the host and selecting Reconfigure for VMware HA."
The problem is that this workaround doesn't work for me, so i was wondering if someone, once again is able to help me with this issue.
Thanks in advance for your support.
Hi Caddo,
i had the same issue with one of my hosts. It was resolved after disabling HA on the cluster and enabling it again.
Mirko
I am having this exact problem with 3 R900's running ESX4 with an FC-SAN and vCenter. I have everything upgraded to 4.0 including the tools on the VM's themselves.
After a collective of 8 hours on the phone with DELL VMware support they have manged to come up with following (it did not help for us.)
A. From within VC
1.Remove the servers
2.Remove the cluster
3.Create a new cluster
4.Add the hosts
If that does not work try.
B. Run the following command as root in the console of each of the hosts.
/opt/vmware/aam/bin/VMware-aam-ha-uninstall.sh
Try to remove and add again in to the cluster.
I have already tried all that you suggested with no luck, eventually i reinstalled both ESX hosts and HA is now working just fine.
Thanks anyway, i hope this will help someone else.
Hi Caddo,
i had the same issue with one of my hosts. It was resolved after disabling HA on the cluster and enabling it again.
Mirko
This is what my file /etc/hosts looked like after a disc installation of ESX4.
127.0.0.1 localhost
::1 localhost
172.114.20.133 HOST3.ORG.DS.COM
Notice the DNS name is not appeneded to the end of the line. It should look like.
172.114.20.133 HOST3.ORG.DS.COM HOST3
172.114.20.35 HOSTVMVIC1.ORG.DS.COM HOSTVMVIC1
172.114.20.131 ESXHOST1.ORG.DS.COM ESXHOST1
172.114.20.132 ESXHOST2.ORG.DS.COM ESXHOST2
This is a host file from a machine that I upgraded using the vCenter Update Manager to ESX4.
Do not remove the following line, or various programs
that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
172.114.20.131 ESXHOST1.ORG.DS.COM
172.114.20.132 ESXHOST2.ORG.DS.COM ESXHOST2
172.114.20.133 ESXHOST3.ORG.DS.COM ESXHOST3
172.114.20.35 HOSTVMVIC1.ORG.DS.COM HOSTVMVIC1
Again you will notice that the DNS name is not appended to the first line.
It should look like this.
172.114.20.131 ESXHOST1.ORG.DS.COM ESXHOST1
Once I fixed those files all the systems came up in HA.
disable HA and DRS. then enable ONLY HA by itself, it should go through without a problem, then enable DRS on its own after HA is on.
Neither of those options worked.
It all came back to the host files of the servers.
I would try to disable / enable HA on the cluster first. That worked for me, too.
Disadvantage of disabling DRS would be, that he will loses his Ressource pool config. Therefore i would only do that if it is not enough to disable / enable HA on the cluster.
Way to go 'Mvarre"
You made my day.
I had totally run out of options, but your solution of enabling only HA and then DRS solved the problem.
Best regards,
Adeel Akram
Hello,
Host entries worked for us, added the short name at the end like this:
esx.domain.com esx
added on both servers and it worked.
If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful ****
Well in my case this initially solved my problem. But this time i had all entries in my host file correct and still could not resolve the issue. So enabling HA and DRS one by one solved the issue for me.
Wierd things happen with HA in ESX.
Oddly, I encountered this problem not after upgrading from ESX 3.5 to 4.0, but after applying the first round of patches that took me up to build 175625. The first host I patched refused to enable HA... after fixing the hosts file all was good. I proactively fixed the hosts file on the upgraded but not patched host, and after patching everything was still good there.
My cluster is running a mix of HS20 and HS21XM blades -- can't vmotion between them, sadly... but when the next HS21 comes in, I'll be able to decommission the HS20's entirely.
Had this very same issue. After moving the hosts to a new vCenter it couldn't start HA. Thanks to Remnarc for pointing me in the right direction.
Originally I had our hosts on a 192.168.1.x ip subnet and then later moved them to 10.10.0.x subnet. The /etc/hosts file ended up still having the old IPs listed.
After changing them and reconfiguring for HA it worked.
Modifying the HOSTS file worked for me too! Thanks.
Hi,
I had exactly the same problem in a lab environment. After reading caddo's post I checked the host files and sure enough the host file had only the FQDN but not the NetBIOS name. Instead modifying the host file I decided to experiment a little. Because this was a lab environment I didn't have a running DNS on the subnet. So I decided to setup one on the VC. I created (A) records and corresponding PTR records for both hosts. Then I changed the DNS and routing settings on both hosts, removed them from the cluster and re-added them using the FQDNs. After that I was able to setup HA and DRS with no problem. Hope this helps.
I. Mitkov
Had the same problem with ESXi's.
Resolved by creating A-records in DNS-server and static records in WINS for both hosts.
Sergiy
I also had this problem. I also hadn't applied all patches to all my servers. Her is my scenario:
3 ESX Servers
The cluster exist of server1 with 4.0.0 Build 244038 and server2 with 4.0.0 Build 208167.
The server I tried to add (server3) had Build 4.0.0 Build 244038, also same as the highest build in the cluster (this one failed with the Internal AAM Error)
This is what I did to resolve the issue:
1. Disconnected server2 (the one with the lowest build)
2. Run Reconfigure for VMware HA on server3 (the server was now added successfully).
3. Connected server2 (it also added successfully to the HA cluster, I surely upgraded this one to latest build afterwards)
I also checked my hosts-files and they are not updated with FQDNS names but I have added my servers to the windows DNS servers so esx-servers resolves names successfully anyway.
Hi All,
For my case, I did the following steps and worked out
1) Enter the host under “Maintenance Mode”
2) Uncheck “Turn On VMware HA” box at the cluster level
3) Take out the host from “Maintenance Mode” &
4) Finally enable HA at cluster level
The host was part of HA without any issues.
Hello,
i had sitation with 6 host in one cluster. I created another one and moved host to that ha cluster. Before putting it back to out of the maintenance mode i used Remediate and aplied 30 patches to it. After i removed it from Maintenance mode inside the second HA cluster i received the same error message after all the steps found in comunity forums and technical documents.
Than i created another HA cluster moved each host out of the second cluster, removed them from maintenance mode and then i dragged them to the third cluster. Only then i received no errors and completed movement of other host to this new cluster.
B.