VMware Cloud Community
MarcLaflamme1
Enthusiast
Enthusiast
Jump to solution

HA agent in cluster has an error

ESX 3.5 - 2 node cluster managed by VCenter 2.5

Working fine up until last week when I saw the error appear on the second node "HA agent on xxx in cluster yyy in zzz has an error".

I clicked "Reconfigure for VMware HA" and that worked for about 30 seconds then errored again.

The detailed events for that host in VCenter say sufficient resources when enabling then change to Insufficient resources to satisfy HA failover level on cluster.

We haven't added any new machines on either host nor has the configuration changed. Resource Distribution is 0-10% for CPU and 20-30% RAM on one host and 30-40% RAM on second host.

I checked the vmware_hostname.log file on the problematic host and the only thing that seems wrong is

Error FT Mon Nov  5 14:22:14 2012
By: FullTime/Process Monitor on Node: hostname
MESSAGE: Invalid Failure Detection IP Address 10.99.10.152, please fix.

followed by

Warning SEC Mon Nov  5 14:22:14 2012
By: FT/Agent on Node: msvottsanhost1
MESSAGE: Rejected Message. msgid 98 from (1/3:24716.0)

Then it continues with "Node is running" and both hosts are receiving heartbeats from each other.

We've tried disabling HA and re-enabling but that didn't work.

Under DNS and Routing, both hosts match domain, preferred/alternate DNS, search domains, default gateways for service console and VMKernel (all lowercase too).

Running out of things to check/try! Any direction would be appreciated.

0 Kudos
1 Solution

Accepted Solutions
depping
Leadership
Leadership
Jump to solution

First thing you should do is check the following:

1) Is Admission Control enabled?

If the answer is yes:

2) Is there a reservation set on any of the VMs?

My guess the answer to both is yet. Remove the reservation or change the type of admission control you are using. Want to know more about admission control you can find details here: http://www.yellow-bricks.com/vmware-high-availability-deepdiv/#HA-admission

View solution in original post

0 Kudos
7 Replies
vmroyale
Immortal
Immortal
Jump to solution

Note: Discussion successfully moved from VMware vCenter™ Server to Availability: HA & FT

Brian Atkinson | vExpert | VMTN Moderator | Author of "VCP5-DCV VMware Certified Professional-Data Center Virtualization on vSphere 5.5 Study Guide: VCP-550" | @vmroyale | http://vmroyale.com
0 Kudos
depping
Leadership
Leadership
Jump to solution

First thing you should do is check the following:

1) Is Admission Control enabled?

If the answer is yes:

2) Is there a reservation set on any of the VMs?

My guess the answer to both is yet. Remove the reservation or change the type of admission control you are using. Want to know more about admission control you can find details here: http://www.yellow-bricks.com/vmware-high-availability-deepdiv/#HA-admission

0 Kudos
MarcLaflamme1
Enthusiast
Enthusiast
Jump to solution

Thanks depping,

The Admission Control was set to "Allow VM's to be powered on even if they violate availability constraints" so that wasn't the issue. When I initially checked reservations at the machine level I didn't see anything set (and we never set these anyway), however when I checked the overall Resource Allocation tab on the cluster level, I saw 2 machines (one powered on and one powered off) with reservations on the CPU set to the max (it was easy to miss the Unlimited check box!). Upon further investigation, this is the only machine that myself or the other admin didn't create (a third user with occasional access)... disabled this (set to unlimited) and repaired the HA and it hasn't freaked out yet!

What bothered me though was the error that was being logged seemed like this was a networking issue

Error FT Mon Nov  5 14:22:14 2012
By: FullTime/Process Monitor on Node: hostname
MESSAGE: Invalid Failure Detection IP Address 10.99.10.152, please fix.

What was that referring to? Oh well, issue resolved (so far, will see how it fairs throughout the day).

0 Kudos
depping
Leadership
Leadership
Jump to solution

I have no clue why you received that error and you shouldn't have received the "resource constraint" error either when your admission control is configured like this. Anyway as you said it is solved Smiley Happy

0 Kudos
MarcLaflamme1
Enthusiast
Enthusiast
Jump to solution

Well I checked this morning and was greeted with a nice red flag, Insufficient resources to satisfy HA failover level on cluster.

Looks like it errored at 7:13pm last night and then again at 5:33am this morning.

Now I'm really stumped!

0 Kudos
depping
Leadership
Leadership
Jump to solution

Do you have support? I would suggest calling them. But more so, I would recommend upgrading to a later version. (We are at 5.1 at the moment)

0 Kudos
MarcLaflamme1
Enthusiast
Enthusiast
Jump to solution

In the process of renewing our support (just recently lapsed). Additionally, we are also in the the process of upgrading our entire VMware infrastructure to ESXi 4.1. Just wanted to have a stable environment before and while doing the upgrade.

I'll figure something out eventually (I hope!)

Thanks for the help.

0 Kudos