chjones
Enthusiast
Enthusiast

HA failures after vCenter 4.1 upgrade - My resolution

Hi all,

Last night I upgraded my vCenter Server from 4.0 to 4.1 and thought I'd share my experience with you and the HA issue I encountered afterwards and how I resolved it, just incase someone else has the same issue.

The vCenter Server upgrade was easy enough, it always is. I had to remove Orchestrator before vCenter would upgrade, nothing major. The installer warns you about this if you haven't removed it.

After the Server upgrade and updating the VI Client all my DRS & HA Clusters were showing HA Errors stating that a Primary HA Server could not be found. Every single cluster node had this error (we have 5 HA clusters in 5 sites across Australia). If HA has errors, vMotion fails. So I had to disable HA on all of the clusters till I could fix it.

I tried everything I read in forums, technotes, knowledge bases, etc. Renaming the clusters, removing and re-adding hosts, restarting management agents, restarting hosts, new clusters, and so on. Nothing worked.

I uploaded the ESXi 4.1 Upgrade Package into Update Manager and updated the hosts in one of the clusters. Same issue. HA was reporting a "General Error".

I ended up removing one of the Hosts from vCenter and rebuilding it from the ESXi 4.1 ISO, reconfiguring and adding back to vCenter. This fixed the problem, but is a massive time waster.

I tried a few other things and found a solution that has been working 100% for me across 20 host so far:

1. Disable HA on the Cluster

2. Place the host into Maintenance Mode

3. Remove the host from the cluster

4. Remove the host from vCenter (make sure you remove and not just disconnect, as remove uninstalls the vCenter Agent)

5. Connect to the console of the host (I use HP iLO for this) and reset the system configuration to defaults and let the server restart

6. Reconfigure the Root Password and Management Network

7. Add the host to vCenter

8. Configure the host settings (networking for the cluster, ntp, security profile, etc)

9. Add the host back to the cluster

10. Viola! The HA Agent is downloaded to the host and installed, and HA now configures correctly

This process works for me on hosts that are either 4.0 or 4.1. I dont have any 3.5 hosts anymore so haven't tested it works for them.

I hope this will help anyone else that has a similar issue. This process takes me a bout 10-15 mins per host, which is quicker for me than rebuilding the host. I guess its just a matter of whichever works best for you, but this works well for me.

Cheers,

Chris Jones

Canberra, Australia

0 Kudos
19 Replies
schepp
Leadership
Leadership

Hi Chris,

thanks for sharing. I'm planning to do some upgrades today, including some HA clusters. Let's see how it goes Smiley Wink

Regards,

Tim

0 Kudos
schepp
Leadership
Leadership

Upgrades one Cluster now. Had the same error message, but was able to fix it just by disabling HA in the cluster settings, waiting for the hosts to uninstall it automaticly and then enabling it again.

problem solved for me, yay Smiley Happy

0 Kudos
chjones
Enthusiast
Enthusiast

Good to hear, Tim. Hopefully your solution works for other people as it's definitely easier than the solution I had to use. I wish your fix worked for me Smiley Happy

Thanks for the update.

Chris

0 Kudos
AllBlack
Expert
Expert

We upgraded to vcenter 4.1 last week and all was fine HA wise. Today I added a host and that host would not configure for HA.

After checking the usual checks such as DNS etc I decided to disable HA, rename cluster, enable HA. The new host was still not configured so

I selected to reconfigure HA for just that host. Oh boy, it looked like everything was turning to custard. Every host started to show failures and I mean major failures.

Luckily it recovered and I removed host from cluster. I will open an SR with vmware cos this was bad.

The new host was esxi 4.1 while other hosts are ESX 4.0 but that should not be a problem. Mixing is allowed.

So now I am not sure whether it is a vCenter or ESXi issue

Please consider marking my answer as "helpful" or "correct"

Please consider marking my answer as "helpful" or "correct"
0 Kudos
arturka
Expert
Expert

Thanks for nice procedure.

But this problem and solution is known since version 3.5.

visit my blog

http://www.vmwaremine.com

if my answer it's useful click some stars Smiley Happy

VCDX77 My blog - http://vmwaremine.com
0 Kudos
roconnor
Enthusiast
Enthusiast

This is the fix I used for ESX 4.0.1 hosts on vCenter 4.1 - no need to put hosts in maintenance mode or reboot

1. Disable HA on the Cluster

2. Disconnect then remove each host from vCenter

3. Remove the Cluster

4. Create a new cluster, enable HA, DRS

5. Remove HA packages from each host*

a. Log in as root to the ESX service console.
b. Run the command:
rpm -qa | grep -i aam.
This returns two packages that are named similar to:
VMware-aam-haa-#.#.#-#
VMware-aam-vcint-#.#.#-#
c. Run these commands to remove the packages returned by Step 5 b
Note: Ensure to remove the VMware-aam-vcint-#.#.#-# package first.
rpm -e VMware-aam-vcint-#.#.#-#
rpm -e VMware-aam-haa-#.#.#-#
d. Run the command:
rpm -qa | grep -i vpxa
A package named VMware-vpxa-#.#.#-##### is returned.
e. Run this command to remove the package returned by Step d:
rpm -e VMware-vpxa-#.#.#-#####

6. Test adding the host to the newly created cluster to see if this has resolved the issue.

7. Add the host to the newly created cluster to see if this has resolved the issue.

I think there may be a simpler way, but just disabling, re-enabling HA didn't work for me.

*http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=10037...

0 Kudos
AllBlack
Expert
Expert

I have VMware looking at my issue and they are as dumbfounded as I am.

Everytime we think we found the culprit, it turns out to be something else

so we are still looking for a common denominator.

First we thought it was caused by adding a ESXi 4.1 host but after that one was removed

we had issues with a Classic 4.0 U2 host. It appears that every time you do something

that is HA related all hosts in the cluster start throwing errors.

IMO, this points to vcenter as the common denominator but in saying that we only

have an issue in one of our clusters so it could be just related to that one cluster.

We pretty much tried everything that roconnor did other than creating new cluster so we will attempt

that today. I have rebuild the ESXI 4.1 host with ESXI 4.0 U2 . It will be interesting to see how HA reacts.

This is done to rule out a possible ESXI 4.1 issue

Please consider marking my answer as "helpful" or "correct"

Please consider marking my answer as "helpful" or "correct"
0 Kudos
AllBlack
Expert
Expert

The other thing I am curious about is whether someone is using IPv6 in this scenario or just IPv4?

Please consider marking my answer as "helpful" or "correct"
0 Kudos
AllBlack
Expert
Expert

Just spent another three hours on the phone with vmware.

Creating a new cluster and adding some hosts is causing the same problem.

Each time you touch HA there are issues. It all seem to point to vcenter 4.1 now.

There appeared to be no issue with adding an ESXi 4.0 U 2 host. I am also

getting DRS issues "Unable to apply DRS resources, general fault occured"

Probably caused by the same thing as DRS is now more thightly integrated with HA.

We are slowly peeling away all the layers of the onion....

I seem to find some errors in the aam_config_util_addnode.log but so far vmware support hasn't even looked at them

Please consider marking my answer as "helpful" or "correct"

Please consider marking my answer as "helpful" or "correct"
0 Kudos
roconnor
Enthusiast
Enthusiast

Hi Allblack

Moved a couple of hosts from maintenance to operative and HA failed in both removing them from the cluster/vCenter and manually removing the HA packages didn't work. Disabled both HA AND DRS waited 5 minutes then re-enabled HA and DRS and everything was fine.

Repeated the test and this time HA installed correctly..?

We have two clusters in this VC and only the PROD has been affected.

I want the functionality of DRS VM to Host Group affinity that is only available in vSphere 4.1, but I don't feel comfortable with this HA issue.

...Additionally I have multiple 'The storage service is not initialized' errors, which I didn't have in 4.0.1

PS. This guide for HA troubleshooting was useful

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100159...

0 Kudos
AllBlack
Expert
Expert

No solution yet but a pattern starts to emerge. I think it may have something to do with secondary service consoles for ISCSI.

They are not required anymore these days but we still have them. Right now that seems to be the common denominator.

I am moving all hosts with secondary service console to a different cluster. My HA error appears to have gone but I am getting a DRS

error on two hosts. Funnily enough these are two hosts with an active ISCSI connection. We have one VM who is dependant on it.

Unfortunatelly we are nt in a position to disconnect it right now.

What I find weird is that when I disabled DRS and HA these two hosts are still complaining about being unable to apply DRS!!! I have waited 15 minutes and re-enabled HA/DRS on the cluster. HA enables fine again but DRS error remains.

The saga continues......

Please consider marking my answer as "helpful" or "correct"

Please consider marking my answer as "helpful" or "correct"
0 Kudos
mdgunther
Contributor
Contributor

Any resolution from VMware on this? I'm about to call them myself.

I just upgraded yesterday and am having the same issue. Specific error messages below.

Cluster:

Insufficient resources to satisfy HA failover level on cluster (CLUSTER)

Unable to contact a primary HA agent in cluster (CLUSTER)

Host:

HA agent on (HOST) in cluster (CLUSTER) has an error: Cannot complete the HA configuration

Tasks:

Cannot complete the configuration of the HA agent on the host. Misconfiguration in the host network setup.

aam_config_util_addnode.log on the host (paraphrased):

Failure location: function main::myexit called from line 982, function main::validate_iso_address called from line 1216, function main::add_aam_note called from line 210, VMwareresult=failure

I've tried to create a different cluster and move two hosts there and it still is broken. Reinstalling or reconfiguring isn't something I want to do right now figuring there is probably some config file that needs to be fixed. I was planning on eventually moving to ESXi 4.1 at some point, but I was just focused on vCenter since I had the 32 to 64 bit migration to do.

0 Kudos
s1xth
VMware Employee
VMware Employee

I had this same issue when I was upgrading my vCenter to 4.1. I didnt have the problem until upgraded a host to 4.1 though. I recommend reading through the following kb article from VMware: http://goo.gl/cjyV and trying what they recommend by reinstalling the HA agents. This should definitly clear the problem up. If not, turn HA AND DRS OFF completely off the cluster and then turn it back on.

I would also make sure you Bios/FIrmware on your hosts are up to date. ESXi/ESX 4.1 is very particular about bios this time around and can cause odd problems.

Hope that is some help.

Blog: www.virtualizationbuster.com

Twitter: s1xth

http://www.virtualizationimpact.com http://www.handsonvirtualization.com Twitter: @jfranconi
0 Kudos
mdgunther
Contributor
Contributor

Thanks for the advice, but it didn't work. Since I have ESX, I went the rpm -e route to remove the agents.

I have five ESX 4.0 U1 hosts and three of them were sleeping. The two that were on were happy with HA. Then I turned off DPM and it immediately powered on the three sleepers. It was at that time it updated the agents (due to run after the vCenter upgrade) on all the hosts. (Yes, it did not update the agents on the two running hosts until all the rest of the hosts were on.) After the update it reconfigured HA for the hosts and everything went crazy.

So the agent update upset HA. I ran updates to U2 on the three previously sleeping hosts and that doesn't seem to change anything. The two hosts I pulled into a new cluster were running U2, so having U1 and U2 is likely not a problem. However, I did not try the suggested proceedure on the test cluster with two hosts.

0 Kudos
roconnor
Enthusiast
Enthusiast

If you try the turn off HA and DRS route, be sure to leave plenty of time for the agents to remove, this seems to have been an issue for me.

I think I was turning HA back on too quickly and the remove HA agent task had not finished on the ESX.

Try turning HA DRS off and going for lunch, then if you want to be very thorough, check if the agent is removed, once you are sure the cluster is clean, verify your DNS is good and re-enable.

You might want to check out this guide from HA troubleshooting (http://xtravirt.com//xd10005), although if you have a support contract the quickest route may be open a ticket with VMware, and read the guide while you are waiting for them to call you back.

0 Kudos
JimKnopf99
Commander
Commander

Hi

i upgrade my 8 Hosts today and have no problems.

I Disable HA for the whole cluster, install the upgrade via Update Manager and enable HA in the cluster.

Frank

If you find this information useful, please award points for "correct" or "helpful".

If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
mdgunther
Contributor
Contributor

I have found the solution to my problem. First the answer:

Set das.usedefaultisolationaddress to False under the HA Advanced Options.

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1002117&sl...

Why this works for me:

I was using two isolation addresses because HA is using a private subnet with no default route, so I had to pick at least one alternate address to use. (I chose two to be safe.) This used to work fine in vCenter 2.5 and 4.0. With the agent update in 4.1, my guess is that the failure of the default route was not tolerated. I knew from the logs it was still trying the route, but the two IPs I specified were working. As stated in the article, a subset of failures would allow for configuration (and turn the cluster yellow), but with 4.1 agents that has apparently changed.

0 Kudos
roconnor
Enthusiast
Enthusiast

Yeap, I agree it is something to do with the das.isolationaddress

I did the following in two separate sites, both with 4.1, we no longer have any issues

However on the first site we have not set the das.isolationaddress to false, only increased the time out

Our second site is more like mdgunther's.

Where we have a single console on vSwitch0 with two adapters we added

das.failuredetectiontime = 60000

Where we have two consoles on separate vSwitches connected to different nics

das.failuredetectiontime = 20000

das.isolationaddress1 = <IP service console 0>

das.isolationaddress2 = <IP service console 1>

das.usedefaultisolationaddress = false

See the HA Best Practices on pages 28- 29 of  http://www.vmware.com/pdf/vsphere4/r41/vsp_41_availability.pdf

0 Kudos
Josh26
Virtuoso
Virtuoso

For ESXi users, the equivalent process of removing HA is:

Disable HA on the cluster

Access Troubleshooting Mode (the console)

Run this command to uninstall the HA agent: /opt/vmware/aam/VMware-aam-ha-uninstall.sh

Reconfigure HA on the cluster

0 Kudos