Solved: vSphere High Availability "Election" fails at 99% ...

Joris85 · ‎11-29-2014

Hello,

We had a system with 1 ESXi 5.1 host with local disks.

Now we install redundancy by adding an ESXi 5.5 U2 host and a vCenter 5.5 appliance.

After installing and adding everything to vcenter, we upgraded the ESXi 5.1 to ESXi 5.5 U2. The SAN is operating correctly (vMotion is working on seperate NIC).

Now, if I try to enable High Availability, both servers will install the HA Agent, and start "Election".

All datastores (4) on the SAN are chosen for the HA heartbeat, isolation response is "keep powered on" default.

One server will always get this process done, and the other will keep "electing" until it gets to 100% and errors on the election "operation timed out".

I have seen this problem on both servers, so I think the elected "master" does not have the problem, only the "slave".

I have checked these articles and executed them, but non worked:

VMware KB: Reconfiguring HA (FDM) on a cluster fails with the error: Operation timed out

- The services were running

VMware KB: Configuring HA in VMware vCenter Server 5.x fails with the error: Operation Timed out...

- All MTU's were set to 1500

VMware KB: Configuring VMware High Availability fails with the error: Cannot complete the config...

- the default gateway was not the same on both hosts, but I corrected this. There are no routings. HA setting is "leave powered on". After correcting and disabling/reenabling HA, problem is still the same.

VMware KB: Verifying and reinstalling the correct version of the VMware vCenter Server agents

- I executed "Reinstalling the ESX host management agents and HA agents on ESXi" for the HA Agent, and I verified that it was uninstalled and reinstalled when reenabling HA.

cp /opt/vmware/uninstallers/VMware-fdm-uninstall.sh /tmp
chmod +x /tmp/VMware-fdm-uninstall.sh
/tmp/VMware-fdm-uninstall.sh

I did this for both hosts. This actually fixed the election problem, and I was even able to run a HA test succesfully, but when after this test I powered down the 2nd server (to test the HA in the other direction), HA did not do the failover to the 1st and everything remained down. After pushing "reconfigure HA", the election problem appeared again on 1 of the hosts.

These are some extractions from the logs:

-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:00 PM 192.27.224.138

-vSphere HA agent is healthy info 11/29/2014 10:02:56 PM 192.27.224.138

-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:02:56 PM 192.27.224.138

-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:01:26 PM 192.27.224.138

-vSphere HA agent is healthy info 11/29/2014 10:01:22 PM 192.27.224.138

-The vSphere HA availability state of this host has changed to Master info 11/29/2014 10:01:22 PM 192.27.224.138

-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:03:02 PM 192.27.224.139

-Alarm 'vSphere HA host status' on 192.27.224.139 changed from Green to Red info 11/29/2014 10:02:58 PM 192.27.224.139

-vSphere HA agent for this host has an error: vSphere HA agent cannot be correctly installed or configured warning 11/29/2014 10:02:58 PM 192.27.224.139

-The vSphere HA availability state of this host has changed to Initialization Error info 11/29/2014 10:02:58 PM 192.27.224.139

-The vSphere HA availability state of this host has changed to Election info 11/29/2014 10:00:52 PM 192.27.224.139

-Datastore DSMD3400DG2VD2 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139

-Datastore DSMD3400DG2VD1 is selected for storage heartbeating monitored by the vSphere HA agent on this host info 11/29/2014 10:00:49 PM 192.27.224.139

-Firewall configuration has changed. Operation 'enable' for rule set fdm succeeded. info 11/29/2014 10:00:45 PM 192.27.224.139

-The vSphere HA availability state of this host has changed to Uninitialized info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root

-vSphere HA agent on this host is disabled info 11/29/2014 10:00:40 PM Reconfigure vSphere HA host 192.27.224.139 root

-Reconfigure vSphere HA host 192.27.224.139 Operation timed out. root HOSTSERVER01 11/29/2014 10:00:31 PM 11/29/2014 10:00:31 PM 11/29/2014 10:02:51 PM

-Configuring vSphere HA 192.27.224.139 Operation timed out. System HOSTSERVER01 11/29/2014 9:56:42 PM 11/29/2014 9:56:42 PM 11/29/2014 9:58:55 PM

Can someone please provide me with some help here?

Or extra things I can check or provide?

I am running out of options currenty.

Best Regards,

Joris

P.S. I had problems with Cold Migration when implementing the SAN. After setting up everything (vMotion, upgrading ESX), these problems were gone.

When searching for this error, I came to this article: VMware KB: VMware vCenter Server displays the error: Failed to connect to host

And that cause could make sense, since the vCenter server changed and IP addressing was changed during implementation.

However, in the vpxa.cfg files, the <hostip> and <serverip> is correct (checked using https://<hostip>/host).

Tried this again today, no problem at all.

P.P.S. I have configured more of these systems from scratch in the past with no problem (though this is an 'upgrade').

Joris85 · ‎12-03-2014

OK so the issue is fixed.

I contacted Dell Pro Support (OEM delivering the license) and they checked the logs (fdm.log) and found out that the IP default-gateway was not reachable.

The default gateway is the default host isolation ip address, used by HA.

Because this is an isolated production system, the supplied gateway turned out to be only for future purposes.

I now changed the default-gateway to a management address on the switch connected to both hosts, that is pingable.

This solved everything.

View solution in original post

vbrowncoat · ‎12-01-2014

Have you verified that the management interfaces on both hosts can reach each other (using vmkping, etc)?

Joris85 · ‎12-02-2014

Yes they can ping to each other and to vcenter, in all directions.

However, in the vCenter ARP table, there is an entry to an unexisting IP address (this is the previous address of the vCenter server itself, we changed it during setup).

Maybe this could cause the issue?

schepp · ‎12-02-2014

Please check if this wrong IP is still configured as the managed IP address in the vCenter:

VMware KB: Verifying the VMware vCenter Server Managed IP Address

Tim

Joris85 · ‎12-02-2014

In that field, no ip is configured at all.

We have other sites were HA is working perfectly and checked them, they also have this field empty.

EDIT: I inserted the IP address, but HA still times out

Joris85 · ‎12-03-2014

OK so the issue is fixed.

I contacted Dell Pro Support (OEM delivering the license) and they checked the logs (fdm.log) and found out that the IP default-gateway was not reachable.

The default gateway is the default host isolation ip address, used by HA.

Because this is an isolated production system, the supplied gateway turned out to be only for future purposes.

I now changed the default-gateway to a management address on the switch connected to both hosts, that is pingable.

This solved everything.

capcom700 · ‎05-09-2016

Hello there

I see the same issues in one environment and I see this KB from vmware and fix this problem.

VMware KB: VMware High Availability fails to configure when a Denial Of Service feature is enabled o...

The problem was the physical switch, I hope that work for you.

Regards :smileycool:

All

vSphere High Availability "Election" fails at 99% "operation timed out" at 1 of the 2 hosts