VMware Cloud Community
grittyminder
Contributor
Contributor

VMware Tools stops sending heartbeats, hours later guest OS freezes

Greetings,

I'm wondering if any experts could shed any light on this recent happening...

I have a VM running Ubuntu 8.04 with VMware Tools installed (I don't know the version; if anybody knows a safe and reliable way to find this information let me know, as the last time I tried to fiddle with /etc/init.d/vmware-tools the networking froze and I had to reboot the server). This VM was running great for over two years. However, earlier today Virtual Center reported that the heartbeat to the VM had been lost. In the VI CLient the VM performance statistics, CPU, memory, networking, and disk utilization all looked normal, and because there was other more pressing work to be done the VM was left in this "heartbeat-less" state (the VM was being used so rebooting was not a valid option). Approximately 4 hours after the server heartbeat was initially lost, networking to the VM became unstable and eventually connectivity was lost altogether. I logged into the server console directly to force a reboot, but the server froze completely after entering the reboot command. I was forced to run a hard reset on the VM.

Now I am trying to figure out why the server died on me. Here are the locations that I've checked so far:

1) The VM system log files - nothing out of the ordinary here.

2) The VMware tools log file - nothing out of the ordinary here.

3) The VM performance statistics in VI Client - nothing out of the ordinary here.

4) /var/log/hostd.log in ESX host - around the time networking problems were first reported a few of these log entries crop up:

Ticket issued for mks connections to user: vpxuser

I did an Internet search on the above log entry but I couldn't find anything meaningful.

I have not yet checked the vmware.log file in the VM directory because there is a lock on it and it can't be accessed. Apparently I will need to vMotion the machine in order to release the lock... but I am hoping that there might be some useful information inside this log file.

What I am thinking right now is that the version of VMware Tools installed in the guest OS went strange and caused the VM to become unstable? How could this happen? Any ideas as to the cause of this problem?

Reply
0 Kudos
7 Replies
AWo
Immortal
Immortal

Have you already checked the logs from within the guest (/var/log/...). When the heartbeat was gone and the guest became frozen there might be some process in the guest which broke the machine.

Can you keep "top" open on that guest via the console?


AWo

VCP 3 & 4

\[:o]===\[o:]

=Would you like to have this posting as a ringtone on your cell phone?=

=Send "Posting" to 911 for only $999999,99!=

vExpert 2009/10/11 [:o]===[o:] [: ]o=o[ :] = Save forests! rent firewood! =
grittyminder
Contributor
Contributor

Thank you for your reply.

Have you already checked the logs from within the guest (/var/log/...).

When the heartbeat was gone and the guest became frozen there might be some process in the guest which broke the machine.

When I checked the logs the first time, I didn't see any entries of interest. I just now rechecked the logs and noticed something I hadn't before...

Basically, from around the same time the VM heartbeat stopped, it seems that all logging stopped for all logs (i.e. syslog, messages, etc.). There isn't a single log entry in any log that I can see until the server reboot hours later. This is very strange, because syslog and messages typically show a consistent and steady stream of log entries.

I also checked the VM's vmware.log and there were no entries at all from 12 hours prior to the server reboot.

So I guess the initial problem is probably not related to VMware Tools going strange, but I still can't pinpoint an exact cause.

Reply
0 Kudos
AWo
Immortal
Immortal

What are you running on this box? I would look at this applications first. Recent changes? You may also want to try to run without the VMware Tools deamon (but keep the network drivers).


AWo

VCP 3 & 4

\[:o]===\[o:]

=Would you like to have this posting as a ringtone on your cell phone?=

=Send "Posting" to 911 for only $999999,99!=

vExpert 2009/10/11 [:o]===[o:] [: ]o=o[ :] = Save forests! rent firewood! =
Reply
0 Kudos
grittyminder
Contributor
Contributor

BTW, can anyone tell me what this log entry in the VM's vmware.log means? The time here corresponds with when the VM's networking started going funky.

Ticket issued for mks connections to user: vpxuser

Reply
0 Kudos
AWo
Immortal
Immortal

Are there any messages showing something like this, as well?

"Current value 169812 exceeds soft limit 122880."


AWo

VCP 3 & 4

\[:o]===\[o:]

=Would you like to have this posting as a ringtone on your cell phone?=

=Send "Posting" to 911 for only $999999,99!=

vExpert 2009/10/11 [:o]===[o:] [: ]o=o[ :] = Save forests! rent firewood! =
Reply
0 Kudos
grittyminder
Contributor
Contributor

What are you running on this box? I would look at this applications first. Recent changes?

The VM is actually a simple, bare-bones, GUI-less IP tables firewall/router. No extraneousness services/applications have been installed (e.g. apache, mysql, postfix, etc). As far as changes, there is the occasional IP tables update, but none that have been carried out recently.

You may also want to try to run without the VMware Tools deamon (but keep the network drivers).

So what would happen if I were to do this? Would High Availability still work properly for this machine?

Are there any messages showing something like this, as well?

"Current value 169812 exceeds soft limit 122880."

Yes, entries similar to the following seem to permeate throughout the entire log file:

'Memory checker' 30129072 warning] Current value 196612 exceeds soft limit 122880.

Reply
0 Kudos
AWo
Immortal
Immortal

You may also want to try to run without the VMware Tools deamon (but keep the network drivers).

So what would happen if I were to do this? Would High Availability still work properly for this machine?

That is for testing purposes only, of course. To check if the Tools are really (not) involved.

Are there any messages showing something like this, as well?

"Current value 169812 exceeds soft limit 122880."

Yes, entries similar to the following seem to permeate throughout the entire log file:

'Memory checker' 30129072 warning] Current value 196612 exceeds soft limit 122880.

Increase the Console RAM (if you're using ESX).


AWo

VCP 3 & 4

\[:o]===\[o:]

=Would you like to have this posting as a ringtone on your cell phone?=

=Send "Posting" to 911 for only $999999,99!=

vExpert 2009/10/11 [:o]===[o:] [: ]o=o[ :] = Save forests! rent firewood! =