VMware Cloud Community
tman24
Enthusiast
Enthusiast

HA won't initialize on cluser host

We have a 4 node vSphere 5.0 cluster, and I've just been finishing off replacing two of the older nodes with brand new (identical) servers. The first host came across fine, and joined the cluster without a problem. The second one will join, but HA will not initialize correctly. After a period of time, I just get 'vSphere HA Agent Unreachable' next to the new host. I've dug around, and have found quite a few references to this problem, but nothing seems to fix it. I'm reasonably sure that the very first time I added the new host, everything looked ok in vCenter. I then proceeded to up the EVC level (part of the plan), and although I can't say it happened exectly when I did this, on or around the same time, I started getting the error.

What I've tried.

1. Full reboot of the new host (no change)

2. Checking all management IP's on the hosts (all ok)

3. ICMP check between all cluster hosts and vCenter (all ok)

3. Checked DNS (all ok)

4. Checked: 'vCenter requires verfied SSL host certificates' (all ok)

5. Reverting EVC level back to the original setting (did not fix it)

I've had a look at vpxa.log, and there doesn't seem to be any problems I can see there, but fdm.log is reporting the following;

2014-11-14T14:51:08.306Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.122 is bad ip

2014-11-14T14:51:08.306Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.113 is bad ip

2014-11-14T14:51:08.359Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.116 is bad ip

In this case, xxx.113 is the management IP of another host in the cluster. xxx.116 is the management IP of the cluster master and xxx.122 is another IP address bound to one of the other hosts, just used for mounting NFS volumes (not management traffic). It just repeats these thee addresses over and over as 'is bad ip' in fdm.log. There's nothing wrong with these addresses. They're not duplicates, and are all in DNS correctly.

So, while I can pretty much see the errors causing this, I don't know how to fix this. Could really do with some advice.

Reply
0 Kudos
7 Replies
bspagna89
Hot Shot
Hot Shot

Hi,

A few questions...

1. Are you using standard MTU or jumbo frames? This is important because there are issues when jumbo frames is not enabled throughout your network which will cause HA to fail.

2. For the secondary IP that your using for NFS, can you make sure that the vmkernel is not enabled for management traffic?

Let me know.

New blog - https://virtualizeme.org/
RyanH84
Expert
Expert

Hi there,

I had a similar problem in our 5.5 environment not so long ago. After  I had rebuild the vCenter that was controlling our environment, several clusters would not fully enable HA. A master would elect but other hosts would not become slaves, time out and eventually give the same error.

I tried a lot of things, including everything you have listed plus:

- Disabling HA on the cluster / re-enabling

- Disconnecting the host from vCenter / reconnecting.

- Removing the host entirely from vCenter and re-adding.

- Manually removing the HA agent from the host and re-installing (similar to this article)

I spoke to a support guy from our  vendor (not direct with VMware support) and he wasn't much help. I then noticed that the same hosts would elect as the cluster master every time and the others would fail/time out. So my fix was:

- Disable HA.

- Put the re-occurring master host into maintenance mode.

- Enable HA

- At this point, the failing host would become a master.

- Bring the other host out of maintenance mode and it would become a slave.

- Repeat as necessary for any hosts that wouldn't join the HA cluster.

- Disable and re-enable HA to watch the cluster election complete gracefully with no issues.

It was a very odd fix to the issue and like you I had searched a great deal for an answer. The final straw was going to go directly to VMware support and potentially eject the hosts from the cluster and possibly rebuild them/the cluster itself. (Luckily it wasn't fully in use and we didn't have to go down that road).

I hope this helps. HTH if not!


Regards,

Ryan

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk
tman24
Enthusiast
Enthusiast

Thanks for the reply.

MTU is definately set to 1500 on all management interfaces. Never used Jumbo frames on our vSphere hosts.

I can also confirm that the second IP used for NFS definately ISN'T configured for management traffic.

I can also confirm that;

Remove/Re-add of the ESXi host doesn't fix it

Disconnect/Connect doesn't fix it.

Reply
0 Kudos
tman24
Enthusiast
Enthusiast

That's a very strange 'fix'. I was going to try turning off HA in the cluster, then re-enabling to see if that helpded. You suggest it won't, but I think I'll have to give it a try. I'll report back, but your 'solution' is one I might have to try if nothing else helps.

Reply
0 Kudos
RyanH84
Expert
Expert

Sure thing, it's worth a try and can't hurt. If HA isn't properly working on the hosts disabling it and re-enabling won't affect much.

My solution is fairly non intrusive and even easier if there are plenty of resources within the cluster to satisfy hosts going into maintenance mode.

------------------------------------------------------------------------------------------------------------------------------------------------- Regards, Ryan vExpert, VCP5, VCAP5-DCA, MCITP, VCE-CIAE, NPP4 @vRyanH http://vRyan.co.uk
Reply
0 Kudos
tman24
Enthusiast
Enthusiast

Well, I think I've cracked it. In my case, it looks like disabling / re-enabling HA in the cluster sorted it. This forced a full cluster re-election, and the failed host was then let back in. Surprisingly, the same host that was master before, became the master again, but all the 'is bad ip' messages disappeared from the fdm.log file on the failed host.

I'm going to get it sit like this over the weekend before migrating any VM's back onto the host next week.

Thanks for your help.

Reply
0 Kudos
dorth
Contributor
Contributor

This too resolved my issue.  2 of my 8 hosts were in HA, and the rest were not.  Fiddling with them one at a time (maint mode - reboot - add HA) would not resolve the issue.  The /var/log/fdm.log was full of these (glyphed with x's)

2017-06-30T18:00:06.710Z [FFCxxxxxx verbose 'Cluster' opID=SWI-8601xxxxx] [ClusterManagerImpl::IsBadIP] xx.xx.xx.176 is bad ip

Taking the whole cluster out of HA and then re-enabling completely resolved the problem.  I didn't have to mess around with them individually.

Reply
0 Kudos