We have a 4 node vSphere 5.0 cluster, and I've just been finishing off replacing two of the older nodes with brand new (identical) servers. The first host came across fine, and joined the cluster without a problem. The second one will join, but HA will not initialize correctly. After a period of time, I just get 'vSphere HA Agent Unreachable' next to the new host. I've dug around, and have found quite a few references to this problem, but nothing seems to fix it. I'm reasonably sure that the very first time I added the new host, everything looked ok in vCenter. I then proceeded to up the EVC level (part of the plan), and although I can't say it happened exectly when I did this, on or around the same time, I started getting the error.
What I've tried.
1. Full reboot of the new host (no change)
2. Checking all management IP's on the hosts (all ok)
3. ICMP check between all cluster hosts and vCenter (all ok)
3. Checked DNS (all ok)
4. Checked: 'vCenter requires verfied SSL host certificates' (all ok)
5. Reverting EVC level back to the original setting (did not fix it)
I've had a look at vpxa.log, and there doesn't seem to be any problems I can see there, but fdm.log is reporting the following;
2014-11-14T14:51:08.306Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.122 is bad ip
2014-11-14T14:51:08.306Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.113 is bad ip
2014-11-14T14:51:08.359Z [FFD64B90 verbose 'Cluster' opID=SWI-e6ab007a] [ClusterManagerImpl::IsBadIP] 192.168.xxx.116 is bad ip
In this case, xxx.113 is the management IP of another host in the cluster. xxx.116 is the management IP of the cluster master and xxx.122 is another IP address bound to one of the other hosts, just used for mounting NFS volumes (not management traffic). It just repeats these thee addresses over and over as 'is bad ip' in fdm.log. There's nothing wrong with these addresses. They're not duplicates, and are all in DNS correctly.
So, while I can pretty much see the errors causing this, I don't know how to fix this. Could really do with some advice.
Hi,
A few questions...
1. Are you using standard MTU or jumbo frames? This is important because there are issues when jumbo frames is not enabled throughout your network which will cause HA to fail.
2. For the secondary IP that your using for NFS, can you make sure that the vmkernel is not enabled for management traffic?
Let me know.
Hi there,
I had a similar problem in our 5.5 environment not so long ago. After I had rebuild the vCenter that was controlling our environment, several clusters would not fully enable HA. A master would elect but other hosts would not become slaves, time out and eventually give the same error.
I tried a lot of things, including everything you have listed plus:
- Disabling HA on the cluster / re-enabling
- Disconnecting the host from vCenter / reconnecting.
- Removing the host entirely from vCenter and re-adding.
- Manually removing the HA agent from the host and re-installing (similar to this article)
I spoke to a support guy from our vendor (not direct with VMware support) and he wasn't much help. I then noticed that the same hosts would elect as the cluster master every time and the others would fail/time out. So my fix was:
- Disable HA.
- Put the re-occurring master host into maintenance mode.
- Enable HA
- At this point, the failing host would become a master.
- Bring the other host out of maintenance mode and it would become a slave.
- Repeat as necessary for any hosts that wouldn't join the HA cluster.
- Disable and re-enable HA to watch the cluster election complete gracefully with no issues.
It was a very odd fix to the issue and like you I had searched a great deal for an answer. The final straw was going to go directly to VMware support and potentially eject the hosts from the cluster and possibly rebuild them/the cluster itself. (Luckily it wasn't fully in use and we didn't have to go down that road).
I hope this helps. HTH if not!
Regards,
Ryan
Thanks for the reply.
MTU is definately set to 1500 on all management interfaces. Never used Jumbo frames on our vSphere hosts.
I can also confirm that the second IP used for NFS definately ISN'T configured for management traffic.
I can also confirm that;
Remove/Re-add of the ESXi host doesn't fix it
Disconnect/Connect doesn't fix it.
That's a very strange 'fix'. I was going to try turning off HA in the cluster, then re-enabling to see if that helpded. You suggest it won't, but I think I'll have to give it a try. I'll report back, but your 'solution' is one I might have to try if nothing else helps.
Sure thing, it's worth a try and can't hurt. If HA isn't properly working on the hosts disabling it and re-enabling won't affect much.
My solution is fairly non intrusive and even easier if there are plenty of resources within the cluster to satisfy hosts going into maintenance mode.
Well, I think I've cracked it. In my case, it looks like disabling / re-enabling HA in the cluster sorted it. This forced a full cluster re-election, and the failed host was then let back in. Surprisingly, the same host that was master before, became the master again, but all the 'is bad ip' messages disappeared from the fdm.log file on the failed host.
I'm going to get it sit like this over the weekend before migrating any VM's back onto the host next week.
Thanks for your help.
This too resolved my issue. 2 of my 8 hosts were in HA, and the rest were not. Fiddling with them one at a time (maint mode - reboot - add HA) would not resolve the issue. The /var/log/fdm.log was full of these (glyphed with x's)
2017-06-30T18:00:06.710Z [FFCxxxxxx verbose 'Cluster' opID=SWI-8601xxxxx] [ClusterManagerImpl::IsBadIP] xx.xx.xx.176 is bad ip
Taking the whole cluster out of HA and then re-enabling completely resolved the problem. I didn't have to mess around with them individually.