Host not responding

GlenB · ‎10-17-2009

I have ESX 3.5i running with < 10 guest OS (Win2k3 SP2 Server, or WinXP SP3). The host is a DELL T410 with 8 Gb RAM, 2 Tb of RAID 6. It's new installation, but has been basically up and running OK for a few weeks. I manage it from VMcentre VIclient running on a 2nd machine. These 2 machines are 1/2 of the "fleet" in this little development lab. Network connectivity between machines is through a DLink 8 port gigabit switch.

Just yesterday, the VIclient starting losing connection to the ESX host. In the Inventory view, the VM host shows as (not responding) and all the guests are (disconnected). I can ping all the guests and I can RDP to the guests so the network is OK and the guests are still running fine.

In the VIclient I can right click on the ESX host and "Disconnect" it, and all the guests also become (disconnected). A moment later I can "Connect" it and everything returns to normal. If I had a Virtual Machine Console open when the (not responding) happened, it had gone black but now it once again is showing me the machine's console as expected.

If I do nothing, I mean NOTHING, in the next couple of minutes the VIclient drops the ESX host again as (not responding). This happens forever, haven't had a longer duration than a few minutes where it stays connected. I can't think of any particular event that seemed to have been the trigger. I have power cycled the VIclient machine and the ESX host machine and that had no effect.

I am sure there are an ample number of logs that I ought to be looking through, but I'd appreciate the advice on just what to look for and what to try.

Regards - Glen

mircmmatgmail · ‎10-20-2009

sorry, I forgot to mention that after that you should remove the host(s) and reconnect them ...

here is the completeKB:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100803...

admin · ‎10-20-2009

Glen,

Heartbeats are issued by the host every 10 sec. If VC is not receiving heartbeats you will see the host go not responding in about 30sec - 2min.

The problem with the heartbeats are that they are UDP, everything else you have there is TCP. VI Client doesn't listen to heartbeats so that you can connect VIC to the ESX doesn't rule out too much.

Is it only one ESX having this issue? what is special with it? different location? Have you a network admin? - can you ask him if 902/udp is open all the way from this ESX to VC?

Next up, wireshark.

DSTAVERT · ‎10-20-2009

I agree. Wireshark will immediately show you if you have heartbeats. A simple thing to check even if it is to eliminate that as an issue.

-- David -- VMware Communities Moderator

GlenB · ‎10-20-2009

Thanks for the link to the KnowledgeBase article. I'm going to have to get better at mining the KnowledgeBase.

After adding the "Managed IP Address" in the runtime configuration (was blank before), I removed the ESX host and reconnected it. Looked good for a moment, but then the host fell back into the (not responding) and all the VMs became (disconnected).

The only downside was that I sem to have lost my templates. The folders are still there when I browse the datastore, but they're not in the Inventory any more. I'm sure there's a way to get them back ....

Regards - Glen

GlenB · ‎10-20-2009

Sure sounds like a loss of heartbeats, eh?

I only have the one Host - it is ESXi not ESX in case that makes a difference. I've been looking through and all of its help states ESX and some of the shell commands it asks for don't work the same. I'm wondering if there is some basic difference that I need to look into? For example ps does not accept -ef as parameters, the ps results don't look anything like what the KB article suggests, though I do see lots of instances of hostd and vpxa but no xinitd. I'll poke around and see if I can find an equivalent article that deals with ESXi just in case.

The main reason I got VC was to make life easier for creating templates and rolling out multiple machines quickly. And it was working fine for a week or two before this happened - wonder what changed?

The ESX host, VC and XP Workstation are all on the same subnet with only a Gb switch inter-connecting them. This is just a simple little development setup!!! No networking issues expected in something that simple. If you have ideas though, I'm also the network admin dude. I have done nothing within in my subnet (that I can tihnk of) to mask 902/UDP going anywhere.

Regards - Glen

GlenB · ‎10-20-2009

If you can help me locate and run WireShark, I'll test for the heartneats.

Regards - Glen

DSTAVERT · ‎10-20-2009

http://www.wireshark.org/

Just install on the VC machine. You want to add a filter otherwise there is too much to see. Just add udp.port == 902 to the filter text box.

You should see the heartbeats.

-- David -- VMware Communities Moderator

admin · ‎10-20-2009

The problem here as you are not familiar with wireshark is that you will be looking for what's missing. My theory is that you will not see UDP traffic on port 902 from the ESX host to the VC server.

As you have an i host a lot of the things you are used to are missing. What you can confirm is in the file /etc/opt/vmware/vpxa/vpxa.cfg that the <serverIp> is in fact the IP of your VC server. Please make sure that you have no leading 0's either (there is a big difference between 10.10.10.1 and 010.010.010.001)

DSTAVERT · ‎10-20-2009

Once you start the capture you don't want to leave it capture for too long. If you don't get the UDP 902 entries quickly you aren't getting them. Shut it down.

-- David -- VMware Communities Moderator

GlenB · ‎10-20-2009

I captured a pile of stuff then display filtered on (IP.addr=ESXi host or IP.addr=VC server) and UDP.port=902. There's a packet every 10 seconds give or take a few milliseconds. Source = ESXi, Destination = VC, protocol = UDP, Source port increments each time somewhere up around 51800, destination port = ideafarm-door.

I think I'm getting the heartbeats and they're going from the correct ESXi machine to the correct VC machine. Agreed?

Regards - Glen

admin · ‎10-20-2009

you're getting the heartbeats.

Time to call VMware support I'd say.

For some reason the VC server is recieving the heartbeats but the VC service doesn't realize. Unless a reboot of the VC server clears this isn't just your garden variety not responding scenario

GlenB · ‎10-20-2009

hostIp=192.168.1.91, hostPort=443, serverIp=192.168.1.90, serverPort=902

Regards - Glen

DSTAVERT · ‎10-20-2009

You've spent enough time now. Just replace the server.

-- David -- VMware Communities Moderator

DSTAVERT · ‎10-20-2009

Just thought I would add this. Just stumbled across it.

http://communities.vmware.com/message/1138787#1138787

-- David -- VMware Communities Moderator

GlenB · ‎10-20-2009

Replace the server?!?!?!? But we've only just started to have some fun! Who wants to figure out an easy problem? (PS - appreciate your help and your humour)

Regards - Glen

GlenB · ‎10-20-2009

Interesting post - think there's a connection? I'm running ESXi 3.5.0 build 153875 - what Update # is that? I notice there was no fix reported, only that one was expected in U5. That was back earlier this year, I think.

Some differences - I'm on ESXi not ESX, and I have no HP hardware or software though I guess those were only triggers. I only have 5 or 6 VMs running on the ESX host, and it only just started happening -- that's not much load!

I tried to use the Unix commands I knew to see what was going on in the box, but ps -ef doesn't produce the output I expected and -? doesnt even talk about -e or -f as possible parameters. Anyone have a link to Unix help for ESX 3.5i V3.5.0?

When I did ps | grep hostd I ended up with about 12 lines, so perhaps I have a lot of orphaned hostd processes and a corresponding memory build-up. But again, with the differences in this Unix and what I am used to I can't figure out how to see the memory usage.

Just rebooted the VC machine - just in case I had fixed something or needed to unplug the ether - but no difference

Regards - Glen

cybulsk · ‎10-20-2009

glancing through this thread one thing I didn't see is a mention of DNS. Keep in mind that vCenter uses the DNS name to talk to the hosts, not the IP address. Name resolution issues can cause hosts to appear disconnected. Might be worth a look.

If you found this or any other answer useful please consider the use of the Helpful or correct buttons to award points

GlenB · ‎10-20-2009

Does VC periodically try to "talk to the hosts"? I know that the host talks to VC every 10 seconds (the heartbeat) and that appears to be done using only IP addresses without DNS. But does VC ever try to reach ESXi even if I don't touch the keys? Why, and what triggers it and what does it ask?

And if the answer to that is Yes, then would DNS still be a potential problem if I told you that every single time I try to Connect to the ESX host (after Disconnect to recover from the"not responding" state) it succeeds? You seem to be hypothesizing an intermittent DNS problem that always lets me find the ESX host to Connect, but exactly 2 minutes after I connect DNS is "in error" and the connection appears to disappear.

Thanks for offering an idea. Not saying you're wrong, wouldn't be the first time I had misconfigured a DNS server, but just help me understand the possible mechanism by which a DNS problem would show up in this circumstance.

Regards - Glen

mircmmatgmail · ‎10-20-2009

Templates are easy, browse the datastore, right click the template's .vmtx file and Add To Inventory.

I am out of ideas regarding you main topic, sorry

gregplou · ‎10-21-2009

We have seen this behavior in the past, and it turned out to be a name resolution issue between the VC and the ESX Hosts. It turned out to be an issue with the hosts file on both the VC and the ESX Hosts. We took the hosts file we had, which included only the IP and the FQDN, and we added the short name as well.

The original hosts file looked like this:

10.10.10.10 hostname.corporate.net

The updated hosts file had entries which included the short name:

10.10.10.10 hostname.corporate.net hostname

We made sure that the hosts file was the same on the Virtual Center server and all of the ESX Hosts. That corrected the disconnects we were encountering on a regular basis early on. Also, ensure that the 127.0.0.1 localhost entry exists on all the ESX Hosts.

Best Regards, Greg

All