Solved: Re: HA fails after host exit standby

GavinConway · ‎11-27-2010

Hi All,

We have a HA, DRS and DPM cluster running ESX4.1 320092, occasionally when we power up a host HA gives an error and knocks itself over.

We get the following error through vSphere and this takes down HA on every node. HA seems to stop at 87% and then fails as it slowly goes up, taking 10-20 minutes to time out;

HA agent has an error : cmd addnode failed for

primary node: Internal AAM Error - agent could

not start. : Unknown HA error

error

We've got proper forward and reverse DNS in place, the hosts all pass a host-profile check and we've not got any network interface issues.

Can anyone shed any light on how best to investigate this further?

Thanks

Gavin

vickey0rana · ‎11-28-2010

try this one:

Turn off HA for the cluster

then

From ESXi 4.1 SSH console: (You can enable SSH from Configuration > Security Profiles > Properties)

run the uninstall script

./opt/vmware/aam/VMware-aam-ha-uninstall.sh

services.sh stop

services.sh start

re-enable HA for the cluster and click "reconfigure for VMware HA" if it doesn't do it automatically.

refer: kb.vmware.com/kb/1007234

---------------------------------------------------------------- If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) BR, Ravinder S Rana

View solution in original post

idle-jam · ‎11-27-2010

Hi,

Can you click each of the host for "reconfigure for HA" once done try to re-enable HA in the cluster.

iDLE-jAM | VCP 2, VCP 3 & VCP 4

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

AndreTheGiant · ‎11-27-2010

Have a look at this KB: http://kb.vmware.com/kb/1001596

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

vickey0rana · ‎11-28-2010

rename the aam directory from console (opt/vmware/aam/) to opt/vmware/aam_old. then try to add the server in cluster.

---------------------------------------------------------------- If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) BR, Ravinder S Rana

GavinConway · ‎11-28-2010

idle-jam - That's great but if the host-exit standby happens at 2am and destabilises the cluster I'd rather not have to log on to click reconfigure. No points awarded

vickey0rana - Apologies, typo on my behalf, we use ESXi rather than ESX so I don't think this applies.

AndreTheGiant · ‎11-28-2010

You can enable access also in ESXi console, and you will find similar files.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

GavinConway · ‎11-28-2010

Thanks Andre.

In our scenario we use DPM to save on power when the systems aren't needed. That being the case the cluster will power on and off hosts as it see's fit. what we're seeing is that more often than not this power-up causes HA to fail.

Scenario is;

DRS see's VM requirements go up
WOL gets sent to power on a host
Host powers up and when it goes to 'configuring HA' this gets to 81% and sits there, after a while it'll go up to 87%, by that point though we then see an error on the primary host in the cluster. The primary gives a 'healthcheck' error and this fault then cascades through the primary servers until HA has failed across the board with the above or a healthcheck fault.

Does this help? We have tried enabling/disabling HA, we've rebuilt all the servers and have upgraded to the latest version of vSphere and ESXi 4.1.

Thanks

vickey0rana · ‎11-28-2010

try this one:

Turn off HA for the cluster

then

From ESXi 4.1 SSH console: (You can enable SSH from Configuration > Security Profiles > Properties)

run the uninstall script

./opt/vmware/aam/VMware-aam-ha-uninstall.sh

services.sh stop

services.sh start

re-enable HA for the cluster and click "reconfigure for VMware HA" if it doesn't do it automatically.

refer: kb.vmware.com/kb/1007234

---------------------------------------------------------------- If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) BR, Ravinder S Rana

depping · ‎11-28-2010

I would also like to recommend to file a support case and log this issue.

Duncan

VMware Communities User Moderator | VCDX

-

Soon to be release: vSphere 4.1 HA and DRS deepdive (end of November through Amazon.com)

Blogging: | Twitter:

GavinConway · ‎12-06-2010

Hi vickey,

For those that see this problem I'd recommend checking iSCSI MTU's where you have them as well as the KB article for HA issues but in our case this seems to have solved it;

From ESXi 4.1 SSH console: (You can enable SSH from Configuration > Security Profiles > Properties)

run the uninstall script

./opt/vmware/aam/VMware-aam-ha-uninstall.sh

services.sh stop

services.sh start

Thanks for the assist!

Gavin

ahrqnag · ‎01-31-2011

I see you solved you issue with HA when using DPM. Is DPM worth using? We have a blade environment and at least one cluster that is way underutilized BUT it is NOT something we can resolve (internal issues, I'll leave it at that). That said, does DPM work well for systems like HP blades using ILO?

As for the resolution, I also had to run that console script to remove HA and then stop/restart the services to get HA "reinstalled" as it were. I don't think that should be necessary in my opinion and HA should not be so easily corrupted as was suggested by one of VMware's support persons. While it's an easy fix if you are using HA and this happens frequently it seems more trouble than simply leaving all hosts on continually.

Just curious as to your thoughts on this DPM feature...seems like a good idea BUT if it causes problems with the cluster features I would just pass.

Thanks for any comments,

Michael

GavinConway · ‎01-31-2011

Hi Ahrqnaq,

In our case we changed DPM to manual mode as the idea of the cluster powering on an additional blade in the middle of the night and effecting HA wasn't something we could guarantee wouldn't happen.

We've had HA failures since I posted this so the fix I've mentioned wasn't a complete one, it's really hit and miss if we'll see a HA failure on power-on or not. That being the case we've changed DPM to manual and we power on blades as and when our cluster capacity dictates (we tend to go through a rolling buy or power on new blade, upgrade memory in cluster, power off spare blades and repeat cycle).

Hope this helps!

Cheers,
Gavin

http://www.uksolutions.co.uk/

ahrqnag · ‎01-31-2011

Thanks Gavin,

Appreciate the input. I decided that DPM is just not beneficial for us as our environment is not that large and this power off/power on really would not give us much return. I guess if we ran hundreds or had large clusters or something we could consider it…for these small clusters I think it just adds another layer to manage.

Good concept on paper I guess but in reality it’s a bit questionable as to its benefit…maybe in vsphere 5

Appreciate the input,

Michael Audet

(that ahrqnag name is just our group email in case you were wondering so that we can all see replies)

depping · ‎02-02-2011

I would say that even in clusters up to 5 hosts you can benefit. Generally speaking I've seen customers placing 3 out of 5 hosts in sleepmode during non-production hrs. Although that might not seem substantial it did impact their electricity bills and especially for SMB customers cutting costs will make a difference at the end of the month/year when the accountant starts looking at the numbers. Now the main issue here of course is how do you as an Admin show to your manager that you were responsible for this decrease in costs and how do you get credit for it 🙂

These HA issues should not happen, and I would like to ask you to submit a support request so this can be fixed,

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

ahrqnag · ‎02-02-2011

I hear you and the feature looks good on paper but our environment is simply not large enough to see any benefits. Powering down a blade or two in these cClass chassis which are already power efficient won’t result in huge savings in my opinion. They simply don’t draw that much power.

As you said, customers with larger clusters or many more hosts could, in fact, see substantial savings when all powered down systems are ‘added up’ as it were. For us, just not worth the trouble of experiencing HA issues in a hit or miss scenario as I’ve seen in other posts.

Also, as you mentioned HOW do we quantify this into dollars for management so that we can prove this sub-feature is beneficial. That’s why we bought energy efficient HP blade systems to begin with so powering down blades is NOT a priority for us. We’ve already removed over 100 physical rack-mounted servers for this vm environment plus consolidated a lot in our data center, shrinking the room physically, removing 7+ racks of equipment, tightening up all the cabling above the racks to improve air flow, installing meters on our HVAC systems to capture KW stats and a host of other “green” implementations so I’m not worried about saving a few more watts with a couple of blades.

I have not yet implemented this at all so no support request is needed. I was trying to get some input on the usefulness of this feature and, right now, it seems 50/50 and truly depends on the customer environment as to whether any real savings will be seen.

Thanks for the input…much appreciated.

Michael