VMware Cloud Community
scerazy
Enthusiast
Enthusiast
Jump to solution

vSphere HA Waiting for cluster election to complete Operation timed out

I have enough of re-installing ESX5i (so far this was trhe quickes way to get it working on number of separate incidents - as a side not, vSphere4 just worked fine in my environment, without THAT MUCH fluffing about)

So now I have a host which almost works (lastest upgrade to 515841), but will NOT get configured for HA

It starts fine, get to the point:

The vSphere HA availability state of this host has changed to Election

and then gives "nice" Timed out

This host can be dis/connected, pinged, has VM running on it, accesses all Datastores, yet will not do HA

I went through all KB aricle:

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&e...

with no resolution

Anybody has any more (useful) ideas?

Thanks

Seb

36 Replies
admin
Immortal
Immortal
Jump to solution

Yes, those configuration issues are meant to bring to your attention that you don't have an optimal configuration.  You have introduced single points of failure.  So if your NIC or network fails, there is no other path available for HA communication.  Same with the single datastore that HA uses as a backup communication channel.  But HA can still function properly without them, it is just not as robust.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

How many hosts are in the cluster? Do any of them configure for HA ok or they all fail? You can look at the fdm (HA agent) logs on the host (/var/run/log/fdm*) to check for errors.

Elisha

Reply
0 Kudos
kfinken
Contributor
Contributor
Jump to solution

There are currently just 2 hosts in the cluster.  Depending on the order in which I bring them out of Maint. mode, the first one will configure as the Master and the second one times out.  It doesn't matter which host is the Master, the second one will always time out.  I will look at the logs to see if they offer any insight.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Did you make sure the hosts are reachable from each other over the management network (can they ping each other)?

Elisha

Reply
0 Kudos
kfinken
Contributor
Contributor
Jump to solution

Yes, the hosts can ping each other, the gateway and the DNS servers.  When I have HA turned on I cannot migrate VM's from host to host because of an HA error, but when HA is turned off I have no trouble migrating.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Migrating VMs uses a different vmkernel port group than HA uses.  HA uses the port group with the "management traffic" checkbox checked.  VMotion uses the port group with the vMotion checkbox checked.

Reply
0 Kudos
kfinken
Contributor
Contributor
Jump to solution

When I log into the direct console on the host and view the support information.  Should the SSL thumbprints from the 2 hosts match?  Right now they don't.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

No, they should not match.

Reply
0 Kudos
kfinken
Contributor
Contributor
Jump to solution

If I "Remove" a host does it delete all of the Networking and Storage config?  I would like to try removing it then adding it to the Datacenter but I don't want to configure it again.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

>>If I "Remove" a host does it delete all of the Networking and Storage  config?  I would like to try removing it then adding it to the  Datacenter but I don't want to configure it again.

No, configuring a host comprises changing settings on the host, not in vCenter. So if you remove the host from vCenter's inventory then add it back, vCenter will query the host's configuration and present those settings as they are.

Reply
0 Kudos
scerazy
Enthusiast
Enthusiast
Jump to solution

That is really mad, just added another host to my cluster yesterday & had no problems at all

Seb

Reply
0 Kudos
Esigolo
Contributor
Contributor
Jump to solution

I was facing the same problem and after a lot of catching
I managed to solve coltando the MTU from 9000 to 1500
is not a permanent solution however I can use for now

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

> MTU from 9000 to 1500

When we've seen this problem fixed by an adjustment of the MTU, the reason is usually because the entire network stack needs to have a consistent MTU to avoid truncation somewhere in the path.  HA does support changing the MTU as long as it is done correctly from the vnic through the vswitch.

Reply
0 Kudos
depping
Leadership
Leadership
Jump to solution

My article was updated to reflect this by the way. As Marc said these problems are usually caused by a lack of "end to end" jumbo frame config

http://www.yellow-bricks.com/2012/01/20/no-jumbo-frames-on-your-management-network/

Reply
0 Kudos
txolsonint
Contributor
Contributor
Jump to solution

The “fix” did not work for me, but disabling HA at a cluster level, then re-enabling did work.

Locoride
Contributor
Contributor
Jump to solution

So I spent a lot of time this evening working through similar issues.  I had successfully configured a cluster with 4 virtual hosts on a new vSphere 5 U2 vCenter.  I finally obtained some signed certificates and replaced all of the self signed certs.  I confirmed that the ESXi hosts saw the new thumbprints using the support information from the DCUI.  Immediately after doing this I began noticing that HA was acting up.  I couldn't vMotion anything because HA errors were being detected on all hosts.  The vCenter log showed the the vSphere HA availability state of this host has changed to Unreachable.  On some hosts it was just hung at the election screen.  However, the master HA owner was showing as green so it appeared to be affecting only the slaves.

Knowing that this was related to SSL I began researching and came across this KB: 2006210.  Using this information I ran the following SQL query against the vCenter database:

SELECT id,EXPECTED_SSL_THUMBPRINT,HOST_SSL_THUMBPRINT FROM dbo.VPX_HOST

This returned the thumbprint of the host and what was expected from vCenter.  What was really odd was the fact that they matched in the database.  However when I compared the thumbprints to that of the certificates they were different.  It appeared that vCenter didn't update the new thumbprints when I removed and readded to the hosts to vCenter.  I tried multiple things like disconnecting and reconnecting and removing the hosts altogether.  So I believe there are scripts to correct this but I took the manual approach for the small amount of virtual hosts that were being impacted.  I used the following query to modify the thumbprints in SQL:

UPDATE dbo.VPX_HOST SET EXPECTED_SSL_THUMBPRINT = 'thumbprint' WHERE id = 'hostid' UPDATE dbo.VPX_HOST SET HOST_SSL_THUMBPRINT = 'thumbprint' WHERE id = 'hostid'

Make sure you replace the thmbprint and hostid with that of your certificates and host ID's.

One challenge I ran into was trying to figure out which Host ID belonged to which Virtual Host.  If the host has VM's you can use this query to figure out the host ID's.

SELECT vpxv_vms.vmid, vpxv_vms.NAME, vpxv_vms.hostid, vpxv_hosts.NAME FROM vpxv_vms JOIN vpxv_hosts on VPXV_VMS.HOSTID = VPXV_HOSTS.HOSTID WHERE ( (vpxv_hosts.hostid = vpxv_vms.hostid) )

Im convinced this is a bug and may only impact someone replacing certificates after you have added hosts to vCenter.  I hope to do more testing in my lab to see if I can reproduce this problem.  Hope this helps others from dealing with the same headache I had.

likeahoss
Enthusiast
Enthusiast
Jump to solution

This continues to be a problem in vSphere 6.0 u1a when replacing self-signed certificates with 3rd party CA certificates.  Thank you for posting this and pointing me in the right direction!

One query I was able to do in order to get all details of the virtual host is running this query against the vCenter DB:

SELECT * FROM dbo.VPX_HOST

If one wanted to display the DNS_Name, and IP address to help marry up the thumbprints to host info, one could query:

SELECT ID,DNS_NAME,IP_ADDRESS,EXPECTED_SSL_THUMBPRINT,HOST_SSL_THUMBPRINT FROM dbo.VPX_HOST


Alternatively, following these steps updated the host's SSL certificate THUMBPRINT in vCenter database without stopping vCenter services:


  1. Login to vCenter Server.
  2. Place the host into the Maintenance Mode.
  3. Right-click on the host and click Disconnect.
  4. Remove the disconnected host from the cluster.
  5. Right-click on the disconnected host and select Connect.
  6. Add the host back to the cluster.
  7. Exit the host from Maintenance Mode.
  8. Query the SQL database before and after these steps and you'll see the thumbprint update.