GlenB
Contributor
Contributor

Host not responding

I have ESX 3.5i running with < 10 guest OS (Win2k3 SP2 Server, or WinXP SP3). The host is a DELL T410 with 8 Gb RAM, 2 Tb of RAID 6. It's new installation, but has been basically up and running OK for a few weeks. I manage it from VMcentre VIclient running on a 2nd machine. These 2 machines are 1/2 of the "fleet" in this little development lab. Network connectivity between machines is through a DLink 8 port gigabit switch.

Just yesterday, the VIclient starting losing connection to the ESX host. In the Inventory view, the VM host shows as (not responding) and all the guests are (disconnected). I can ping all the guests and I can RDP to the guests so the network is OK and the guests are still running fine.

In the VIclient I can right click on the ESX host and "Disconnect" it, and all the guests also become (disconnected). A moment later I can "Connect" it and everything returns to normal. If I had a Virtual Machine Console open when the (not responding) happened, it had gone black but now it once again is showing me the machine's console as expected.

If I do nothing, I mean NOTHING, in the next couple of minutes the VIclient drops the ESX host again as (not responding). This happens forever, haven't had a longer duration than a few minutes where it stays connected. I can't think of any particular event that seemed to have been the trigger. I have power cycled the VIclient machine and the ESX host machine and that had no effect.

I am sure there are an ample number of logs that I ought to be looking through, but I'd appreciate the advice on just what to look for and what to try.

Regards - Glen
0 Kudos
55 Replies
Luckybob
Enthusiast
Enthusiast

When the client drops the host, can you still ping the host?

Try setting up a continuous ping on the host, and watch the reponse times. Make sure they are within an acceptable range and that they are not spiking. Monitor what happens when the clients drops the connection.

GlenB
Contributor
Contributor

Good idea, nice try. I started a ping -t (every second). The host showed as (not responding). Then I disconnected it, paused, connected it, and sat there waiting till it went nonresponsive again. By the time I killed the ping it had been 136 seconds and every single ping returned in < 1 msec.

Regards - Glen

Regards - Glen
0 Kudos
Luckybob
Enthusiast
Enthusiast

Are you using the same PC to run the client from every time? If so try a different PC and see if you get the same results.

If you are running ESX, see if you can SSH into the console when the client losses connectivity.

0 Kudos
mmathurakani
Enthusiast
Enthusiast

Can you please tell me the update version of the esx 3.5i server tht you are using?Presently I can see from the compatibility list that the hardware you are using namely dell poweredge t410 is supported only by esx 3.5i update 4.

mmathurakani
Enthusiast
Enthusiast

and one more thing to check .......tell me if you are experiencing the same problem when you connect directly to the esx i server using vi client .

0 Kudos
GlenB
Contributor
Contributor

Yes, using U4.

One thing to note is that it was working for for 10 days before the problem started showing up. Happened a few times for a day or two and then became pervasive.

Regards - Glen

Regards - Glen
0 Kudos
athlon_crazy
Virtuoso
Virtuoso

Is there any weird thing from your esxi log? Could be ip conflict or something.

vcbMC-1.0.6 Beta

vcbMC-1.0.7 Lite

http://www.no-x.org
0 Kudos
GlenB
Contributor
Contributor

LuckyBob and mmathurakani

Your questions led me to some interesting discoveries:

#1 - my regular method of use was to start the VI client on a Win2k3 server where the vCenter database lives, so I'd connect to vCenter. This is the situation in which I noticed all the (not responding) occurances. In this situation, I was running VI client 2.5.0 build 84767.

#2 - I installed the VI client on a separate Win2000 server by downloading it from the VM host. The host is running ESXi 3.5.0 build 153875. The version that I downloaded from there and installed on the Win2000 box was VI client 2.5.0 build 147633. It appears to not see the VM host becoming unresponsive - at least it has runs for >15 minutes so far and seems OK.

#3 - when I ran that same downloaded VI client installation on my original Win2k3 server (see #1), it offered to upgrade my VI client, so I let it. Now it also is running 2.5.0 build 14763. I connected it to the VM Host (not vCenter) and it also appears to not see the VM host becoming unresponsive - >15 minutes, looks good so far.

#4 - now I am doing the same as #1 but with the new VI client build and connecting to the vCenter instead of the VM host. I have all 3 of these conenctions all running at the same time (Win2000 connect to Host, Win2k3 connected toHost, Win2k3 connected to vCenter). The 2 connected to the Host are fine but the one connected tpo the vCenter still sees it become unresponsive.

#5 - the vCenter software is 2.5.0 build 84767 (like the VI client was), wonder if that is at the root of all evils? Where to I get the update for VMware Virtual Center V2.5.0?

So, what do we know?

- Using the right builds probably helps.

- It takes almost exactly 2 minutes for the vCenter connected client to see the VM host become unresponsive (what happens on a 2 minute cycle?).

- while the vCenter connected client "seems"OK, it is not because if you highlight the Host in the left panel and select the Virtual Machines tab in the right panel the columns labelled Host CPU, Host Mem and Guest Mem all show zero. The State, Status and Description are fine.

- the problem seems specific to the vCenter machine, not to the VI client and not to the ESXi host.

Regards - Glen

Regards - Glen
0 Kudos
GlenB
Contributor
Contributor

An IP conflict would mean the pings and RDP wouldnt have worked, and it wouldnt be intermittent.

If you have a specific thing to look at in the ESXi log, I could give it a try, but it's hard to execute such a generic search request.

Regards - Glen

Regards - Glen
0 Kudos
DSTAVERT
Immortal
Immortal

You can watch the realtime logs. ALT + F12 when the problem begins. See if there is anything. "error" ???

-- David -- VMware Communities Moderator
0 Kudos
mmathurakani
Enthusiast
Enthusiast

Try doing the following steps in that order , if a step does not solve the issue proceed to next step:

1. restart the management agents in esxi server. For details on how to look below.

2.restart the virtual center server service in the virtual center server machine.

3.update virtual center to the latest update available, i think the latest available update is update 5 (U5), you can get this by following the "downloads" link in the site vmware.com and download it by providing your account details(email id and password)

restarting the management agents in esxi

-


method one:

To restart the management agents on ESXi:

  1. Connect to the console of your ESXi Server.

  2. Press F2 to customize the system.

  3. Login as root.

  4. Using the Up/Down arrows navigate to Restart Management Agents.

  5. Press Enter.

  6. Press F11 to restart the services.

  7. When the service has been restarted, press Enter.

  8. Press Esc to logout of the system.

If starting or stopping the management agent fails try restarting it a second time

or follow method two :

From the ESXi console summary screen hit ALT-F1.Enter the word "unsupported" (without quotes). You will not be able to see it being typed on the screen.Enter in the root password for your system when prompted.In the resulting command prompt type the follwoing command: /sbin/services.sh restart

0 Kudos
admin
Immortal
Immortal

If the state of the ESX host is "not responding" it means that VC is not getting any heartbeats from the ESX.

Heartbeats are sent over UDP on port 902.

If it is only one ESX experiencing this issue at the same location then the host may not be sending heartbeats. From the ESX service console you want to tcp dump to make sure that the heartbeats are being sent from the ESX:

tcpdump -i vswif0 -n udp port 902

If the ESX is not sending any heartbeats go ahead and restart the management agents.

If you have several ESX hosts at the same location showing up as not responding - OR you see in the tcp dump that the heartbeats are being sent the problem is else where.

Ask yourself, that is going on on the network, why are UDP packets being dropped on their way over to VC?

If all your ESXs are sending heartbeats and you believe that the network is fine you have to set up a wireshark trace on VC and see if VC is getting the packages.

0 Kudos
DSTAVERT
Immortal
Immortal

No tcpdump in ESXi.

-- David -- VMware Communities Moderator
0 Kudos
admin
Immortal
Immortal

ooops, should have read the description a bit better.

If your VC is on 2.5 you can grep the /var/log/vmware/vpx/vpxa.log file for heartbeat. If you are running VC4 the log will not show the heartbeats so you will have to set the vswitch to promiscuous and attach a VM and tcpdump from there.

In most cases though, the ESX is sending the heartbeats, if you are behind a NAT the ESX may not be sending the heartbeats to the right IP though. In this case, where it is intermittant, I'd say something is going on on the network though.

0 Kudos
jb12345
Enthusiast
Enthusiast

You want to keep your vCenter, hosts and clients reasonably close in build levels. You can download the update for vCenter from http://downloads.vmware.com You'll have to browse for the Update you want to install.

0 Kudos
GlenB
Contributor
Contributor

mmathurakani -- I had previously power cycled both the vCenter Server and the ESX server with no good effect. But, just to be sure, I did restart the management agents in ESX. Made no difference.

I know I don't understand how this whole thing hangs together, but the VI client running on 2 different machines connected directly to ESX never loses the connection. It is only the vCenter Server that loses the connection. Would not both cases be affected if the ESX server was not issuing heartbeats? It feels more to me like the vCenter server is the one with the problem.

Regards - Glen

Regards - Glen
0 Kudos
GlenB
Contributor
Contributor

dantastic -- two different machines running VI client connected directly to the ESX server never lose the connection, so I presume that means the networking is passing traffic on 902 and the ESX is issuing the heartbeats, right? If I run the VI client and connect it to the vCenter on one of those same 2 machines then that situation experiences the (not responding) 2 minutes after the connection is made or restored. Would the heartneats happen to be generated every 2 minutes? Why would one machine not see those heartbeats when running VI client connected to vCenter, but it DOES see them when running VI client connected directly to ESX?

Regards - Glen

Regards - Glen
0 Kudos
mircmmatgmail
Contributor
Contributor

Hi Glen !

Did you check vCenter IP address in: Administration / vCenter Server Settings / Runtime Settings / Managed IP Address

I had very simmilar problem, I had wrong Managed IP, changing it to correct vCenter IP solved it.

Sincerely, Mirc

0 Kudos
GlenB
Contributor
Contributor

Interesting idea. I checked and that field was empty. It says that it is only necessary if you have multiple vCenter servers operating in a common environment. I only have one, so it probably shouldn't be necessary. Also, if this was necessary then my problem would likely have been there since I first started running the vCenter server - instead of being a "recent occurance. Anyway, I filled it in, stopped and started the VI client connected to the vCenter server ..... wait for it ..... but there was no change, problem is still there.

Regards - Glen

Regards - Glen
0 Kudos