VMware Cloud Community
GeneNZ
Enthusiast
Enthusiast

Intermittent Host Disconnections.

Hi All,

I'm wondering if anyone can help me with this issue. We recently updated our servers to the latest patches vsphere Update 1a via VUM. We had originally applied Update 1 when it got initially released, but we didn't see these issues at all, it was only after we applied Update01a and 200912101-UG that these issues started showing up. I am still unsure if it is related to these updates either, but these things are the only changes that have occurred since these problems started happening.

The issue we're experiencing is that we will get intermittent ESX host disconnections for about 15 seconds to our vCenter (also running Update 1). vCenter will just report "Host is not responding" and fire the appropriate alarms. The first time it occurred, 6 of our 7 ESX servers lost connection to the vCenter at the same time. A moment ago, 1 of the 7 servers lost connection temporarily. The fact that the first time it occurred, 6 of our 7 hosts reported disconnections, suggests that its a problem with vCenter server's networking, but I haven't made any changes to that server. It doesn't explain the second occurence where just one server gets disconnected.

Would anyone have any clue about where I could find logs or any information that would point me in the direction to help me determine what is going wrong?

I have checked the /var/log directories on the ESX server's themselves, and nothing of interest comes up at the specified times. The only thing that appears when a host disconnection actually occurs is the following messages in /var/log/vmkernel, of the host that actually disconnected:

Dec 18 17:20:49 esx-alpha vmkernel: 4:02:00:44.874 cpu6:4110)Config: 289: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Dec 18 17:21:29 esx-alpha vmkernel: 4:02:01:25.127 cpu5:4109)Config: 289: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

In the Vpxd log file I also noticed the following entry at the moment of disconnection:

Bad primary esx-alpha: connected=0, dasState=running, vpxaDasState=running

Would that mean anything to anyone?

Potentially the host timeout's have shortened? I'm unsure exactly how vCenter maintains connection to its ESX servers. Presumably via heartbeat? Is there anyway to modify heartbeat settings?

Thanks in advance.

Gene

Reply
0 Kudos
4 Replies
Seth_A
Contributor
Contributor

Hi Gene,

Did you ever end up finding a cause of or solution to this issue? We are experiencing this exact issue today, and I was planning to open a ticket with VMware, but when I came across your post I figured I would check to see if you made any progress?

Thanks in advance!

Seth

Reply
0 Kudos
stingray75
Contributor
Contributor

Hi,

We're getting the same error also. We have a call open with VMware and they first told us to ensure the service console memory on all hosts is set to 800MB. Most of them were, but we have a few which are not and cannot reboot them at the moment.. However, I'm getting disconnect errors on hosts where it's already set to 800MB (and is not swapping much) and sometimes all hosts disconnect at the same time. It's driving us nuts.

Reply
0 Kudos
samuk
Enthusiast
Enthusiast

Not sure if this is the same issue. but i have seen host disconnections at a customer site and we found it to be a DNS issue \ IP.

I recently saw this issue again, after many days of looking decided to rebuild and the issue went away.

Reply
0 Kudos
jdoll66
Contributor
Contributor

We have been experiencing this issue since Friday, 10/8/10. Have two data centers, primary and secondary. vCenter 4.0 208111. 36 hosts at primary, 8 at secondary, all esx 4.0 U 2 (261974). have SR open with VMware. i've placed a managed i.p. address in each vCenter and added vCenter and esx hosts in host file of each clustered ESX host. issue is still occurring. now going to place wireshark on vCenter to see if "heartbeat" messages are not making it. very aggravating.

Reply
0 Kudos