VMware Cloud Community
kowhirlwind
Contributor
Contributor

Two hosts not responding after cluster reboot

Hello, I have recently been having an issue with two hosts in a cluster. Since powering the entire cluster off and on, two of the hosts will no longer stay connected longer than 60 seconds. I realize this is a heartbeat problem, as the hosts function normally from the vSphere client and SSH. Both hosts are able to open connections to vCenter on standard ports, and running Wireshark on vCenter while sending UDP packets from netcat on port 902 works just fine. However, according to Wireshark, it looks like these two hosts just aren't sending heartbeats. I am able to see the heartbeats from the other ~12 hosts just fine. On one of the hosts, I enabled ntp (the other host already had it enabled) and now the host WILL reconnect periodically on its own, but it will just disconnect after a minute. The other host will never connect automatically. Any suggestions on what could be causing this issue?

0 Kudos
2 Replies
luderitz
Enthusiast
Enthusiast

Hi Kowhirlwind,

Great troubleshooting! On the two hosts with the issue, run  "grep -i server* /etc/vmware/vpxa/vpxa.cfg" and make sure the vpxa service is configured with the proper IP and port (902 as you indicated) of your vCenter. Check the firewall settings in the Security Profile of the hosts, is 902 UDP outbound open? Any clues in /var/log/vpxa.log? Lastly, can you safely remove one of the hosts from the vCenter? What happens when you re-add it?

Hope this helps!

Matt Bradford @vmspot www.vmspot.com
0 Kudos
kowhirlwind
Contributor
Contributor

Thanks for the quick response!

The vCenter address and port are configured correctly.

Turning off the ESXi firewall made no difference.

Only error I see in vpxa is this:

2015-02-17T01:39:29.126Z [37690B70 error 'SoapAdapter.HTTPService'] Failed to read request; stream: <io_obj p:0x1f4d5be0, h:-1, <TCP '0.0.0.0:0'>, <TCP '0.0.0.0:0'>>, error: N7Vmacore16TimeoutExceptionE(Operation timed out)2015-02-17T01:39:29.126Z [37690B70 error 'SoapAdapter.HTTPService'] Failed to read request; stream: <io_obj p:0x1f4d5be0, h:-1, <TCP '0.0.0.0:0'>, <TCP '0.0.0.0:0'>>, error: N7Vmacore16TimeoutExceptionE(Operation timed out)2015-02-17T01:39:48.258Z [FFEA91A0 verbose 'vpxavpxaInvtHost'] [VpxaInvtHost] Increment master gen. no to (2275): Event:VpxaHalEvent::CheckQueuedEvents

This came up a few times after trying to reconnect the host in vCenter.

I also have tried removing it from vCenter's inventory, rebooting the host, etc.

0 Kudos