BobAlbion
Contributor
Contributor

Random ESX Host within a cluster goes "not responding", why?

Hello!

We have a 7-host cluster which has been working fine, but for the third time in 2 months a random host has gone "not responding" and the other hosts have HA Agent errors.

The HA Agent errors are resolved easily by "reconfigure HA.." option, but, the "not responding" host doesnt come back even after restarting the mgmt-vmware service, killing the relevant the processes or restarting the service console network service. A restart of the host itself is the only fix. The guests are fine but because they are disconnected we cant vmotion but we can RDP and shut down cleanly. (not ideal in healthcare environment).

A couple of questions; the log doesnt show anything obvious, but some pointers of what to look for would help.

i havent found any common factors to the 3 instances; one at night, one at weekend, one during the working week. no work being done to any host at the time. The host are in no way over-worked or at maximum resource.

alternatively has anyone seen similar symptoms and resolved or found the issue? is the HA Agent drop out coincidence or relevant?

My current thinking revolves around the network for the vmotion/service console/heart beat, but I am not sure if I am on the right track or how to test this.

The ESX Hosts are at 3.5 Update 4.

Many Thanks in advance.

Tags (3)
0 Kudos
3 Replies
azn2kew
Champion
Champion

What do you see in the vpdx log file and check any errors related to SSL certificate? Post the log to see more details.

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

VMware vExpert 2009

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
azn2kew
Champion
Champion

Each problem seems to be differently resoluted so its best to check out this troubleshooting guidelines by VMware to resolve your "not responding" What is your ESX version and have you patch to the latest? If not, check out this patch to see if its effected

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

VMware vExpert 2009

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
BobAlbion
Contributor
Contributor

looks like others have had this issue. http://communities.vmware.com/thread/181420?tstart=0

Thanks for your efforts azn2kew!

0 Kudos