VMware Cloud Community
wila
Immortal
Immortal
Jump to solution

Windows 2000 VM locked up overnight

One of the VMs on a ESX 3.0.1 host completely locked up last night.

When i came in this morning the Vm was not responding to anything, so i decided to hit the "reset" button as that way the machine would at least be back up after the reboot. This was at 09:05 am.

Additional information is that the W2k server VM was rebooted a few hours before the lockup as a last requirement to complete last months windows update.

While trying to find out what happened i'm getting more confused.

The VM has absolutely nothing in its eventlog that explains this. The system eventlog doesn't have any errors before the reboot.

Well it has this one: "The previous system shutdown at 1:08:41 AM on 24/07/2007 was unexpected. "

The system did show the screensaver this morning, so not sure if it was really shutdown. Anyways.. all of this would have been chaled up under the quote "windows update issues" if i didn't notice something else.

In the ESX host /root/ folder there's a new subfolder called "old_cores" which has nothing in it. There's no other core file.

Then the /var/log/messages file has the following lines:

Jul 22 01:01:02 ESXHOST syslogd 1.4.1: restart.

Jul 23 18:16:36 ESXHOST vmware-hostd\[1613]: Accepted password for user root from

127.0.0.1

Jul 23 20:45:48 ESXHOST vmware-hostd\[1613]: Accepted password for user root from

192.168.x.x

Jul 24 10:11:23 ESXHOST insmod: /lib/modules/2.4.21-37.0.2.ELvmnix/kernel/driver

s/net/tg3.o: init_module: No such device

Jul 24 10:11:23 ESXHOST insmod: Hint: insmod errors can be caused by incorrect m

odule parameters, including invalid IO or IRQ parameters. You may find mor

e information in syslog or the output from dmesg

Jul 24 10:11:23 ESXHOST insmod: /lib/modules/2.4.21-37.0.2.ELvmnix/kernel/driver

s/net/tg3.o: insmod eth0 failed

Jul 24 10:11:23 ESXHOST insmod: /lib/modules/2.4.21-37.0.2.ELvmnix/kernel/driver

s/net/tg3.o: init_module: No such device

Jul 24 10:11:23 ESXHOST insmod: Hint: insmod errors can be caused by incorrect m

odule parameters, including invalid IO or IRQ parameters. You may find mor

e information in syslog or the output from dmesg

Jul 24 10:11:23 ESXHOST insmod: /lib/modules/2.4.21-37.0.2.ELvmnix/kernel/driver

s/net/tg3.o: insmod eth1 failed

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth2

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth3

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth4

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth5

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth6

Jul 24 10:11:23 ESXHOST modprobe: modprobe: Can't locate module eth7

Jul 24 10:18:12 ESXHOST vmware-hostd\[1613]: Accepted password for user root from

127.0.0.1

Jul 24 10:31:32 ESXHOST snmpd\[1093]: Got trap from peer on fd 27

Jul 24 10:35:23 ESXHOST sshd\[32025]: Connection from 192.168.x.x port 2297

Jul 24 10:35:39 ESXHOST sshd\[32025]: Accepted password for foobar from 192.168.x.x port 2297 ssh2

Jul 24 10:35:39 ESXHOST sshd(pam_unix)\[32027]: session opened for user foobar by (u

id=0)

Jul 24 10:36:01 ESXHOST su(pam_unix)\[32063]: session opened for user root by foobar

(uid=588)

some other strangeness:

\- the file /var/log/rpmpkgs is dated Jul 24, 2007, 04:00 am ?

\- the file /var/log/vmkernel is dated Jul 24, 2007, 10:11 am ?

\- the file /root/old_cores is dated Jul 24, 2007, 10:11 am

Note that i was actually logged into the server around 10:11 am and everything worked fine at that time afaik as i was researching the lockup.

The hostd log has a lot of lines from yesterday, but i don't know how to interprete it, here's a snippet:

\[2007-07-24 05:45:26.269 'Vmomi' 90561456 info] Activation \[N5Vmomi10ActivationE:0xb5c31c8] : Inv

oke done \[retrieveContents] on \[vmodl.query.PropertyCollector:ha-property-collector]

\[2007-07-24 05:45:26.269 'Vmomi' 90561456 info] Throw vmodl.fault.ManagedObjectNotFound

\[2007-07-24 05:45:26.269 'Vmomi' 90561456 info] Result:

(vmodl.fault.ManagedObjectNotFound) {

obj = 'vim.Task:haTask-160-vim.VirtualMachine.reconfigure-8677'

msg = ""

}

\[2007-07-24 05:45:26.279 'Vmomi' 9792432 info] Activation \[N5Vmomi10ActivationE:0xb59ab90] : Invo

ke done \[retrieveContents] on \[vmodl.query.PropertyCollector:ha-property-collector]

\[2007-07-24 05:45:26.279 'Vmomi' 9792432 info] Throw vmodl.fault.ManagedObjectNotFound

\[2007-07-24 05:45:26.280 'Vmomi' 9792432 info] Result:

(vmodl.fault.ManagedObjectNotFound) {

obj = 'vim.Task:haTask-160-vim.VirtualMachine.powerOn-8680'

msg = ""

}

This is a standalone ESX host, with no HA configured afaik.

The host has been up now for 54 days. None of the other VMs are having issues, nor does the problem VM has issues after it has been rebooted.

I've also exported diagnostic data from the VIC and am still ploughing through the logs... I'm starting to suspect the "export to diagnostic data" to have generated the additional weirdness in /var/log/messages as it is all from around that time.

But does that actually make sense?

The server is a FSC RX300S3 and on the HCL.

Thanks for any hints.

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
1 Solution

Accepted Solutions
esiebert7625
Immortal
Immortal
Jump to solution

Well I would suspect an OS issue rather then ESX. If the VM has been up a long time perhaps a memory leak. Was this VM a conversion from a physical server? What apps are running on the server?

View solution in original post

0 Kudos
10 Replies
esiebert7625
Immortal
Immortal
Jump to solution

Well I would suspect an OS issue rather then ESX. If the VM has been up a long time perhaps a memory leak. Was this VM a conversion from a physical server? What apps are running on the server?

0 Kudos
wila
Immortal
Immortal
Jump to solution

Hi Eric,

Thanks for the reply. Yeah like i said "i would chalk it up to windows update" but i don't understand where the ESX host errors in the /var/log/messages file are coming from.

Maybe totally unrelated, but maybe not.

The VM is indeed a conversion from a physical server, but all non existing devices are removed. Additionally non relevant hardware like parallel and serial ports are removed, even in the vBios, its a single vCPU vm.

The most important applications on that VM are IIS, as its hosting a few web sites and some custom web applications.

Several months ago there have been similar lockup issues and at that time i noticed the logs having some "virtual hardware wedged" messages in /var/log/vmkwarning

After that, the host was updated to the latest patch level and the VM was cleansed from old devices and irrelevant services. The Vm has ran flawless since then for months with only the MS reboot wednesday -if required- as short outage.

The VM is mostly doing nothing, it isn't busy at all, but i'll go out and check more logs (IIS in this case)

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
esiebert7625
Immortal
Immortal
Jump to solution

Sounds like you did all the recommended clean-up after a conversion. I always take the serial/parallel and floppy out of the BIOS also. I have had IIS5 completely lock-up servers before, usually as the result of memory leaks. If one of the patches was for IIS maybe that caused the problem you saw. Was there anything helpful in the vmware.log?

wila
Immortal
Immortal
Jump to solution

Nope, the vmware log completely missed the event, there's a gap until when i hit the reset button.

Jul 23 20:00:41.002: vcpu-0| GuestRpc: Channel 3, registration number 1, guest application toolbox-dnd.

Jul 23 20:00:41.002: vcpu-0| DISKUTIL: scsi0:1 : toolsVersion = 7202

Jul 23 20:00:41.002: vcpu-0| DISKUTIL: scsi0:0 : toolsVersion = 7202

Jul 24 09:05:35.995: vmx|

Jul 24 09:05:35.995: vmx|

Jul 24 09:05:35.995: vmx| VMXRequestReset

Jul 24 09:05:35.995: vmx| Stopping VCPU threads...

Jul 24 09:05:35.995: vcpu-0| VMMon_WaitForExit: vcpu-0: worldID=1173

Jul 24 09:05:36.030: mks| Async MKS thread is exiting

Jul 24 09:05:36.030: vmx| DnD rpc already set to 0

Maybe we should start thinking at rebuilding the VM from scratch, but then of course using windows server 2003 R2. Unfortunately we are still tight to using the .asp processor as the applications are not .net compatible yet.

There was only one windows 2000 patch for this machine, KB926122 and this one addresses some Active Directory issue, which doesn't seem to be relevant as it is a standalone vm. One other thing i forgot to mention is that the Vm runs f-prot antivirus and the av was expiring in 3 days. A previous version of f-prot had memory leaks when showing the "reminder popup", maybe that's still not covered? I updated the av to make sure this is not biting us.

As for taking out the serial/parallel interfaces, you are to blame as i've read about it in one of your comments. Thanks for that as it does make a lot of sense. Hmm.. i see now that the vmx still has a parallel port, i'll remove that along with the floppy and reboot. It does not explain the lock up as it must have been like this for a few months at least, but never say never...

thanks!

Wil

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
wila
Immortal
Immortal
Jump to solution

It never happened again... (and i did take out the floppy and parallel controller)

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
wila
Immortal
Immortal
Jump to solution

Too quick in my conclusion, it did happen again, twice this week to be precise. When it happens the VM just freezes, still shows as running in the VIC, but the vmware-tools are not listed any more.

The host was updated to 3.0.2 a couple of weeks ago with all patches. There's is absolutely nothing in the logs anywhere, not on the VM nor on the host.

Monday morning at night, it fluked out, i changed the memory reservation to full and updated the vmware tools to the latest one from 3.0.2 as it was still on the tools from 3.0.1

This morning again at night, same thing. But i noticed yesterday that the VM was defined in the VMX file as a "Windows 2000 Advanced Server" instead of a "Windows 2000 Server".

So i changed that to a plain w2k server instead as it is just a normal windows 2000 server.

Could that cause the vm to freeze?

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
esiebert7625
Immortal
Immortal
Jump to solution

I suppose it's possible, here's what that setting is used for...

This screen asks which operating system you plan to install in the virtual machine. Select both an operating system and a version. The New Virtual Machine Wizard uses this information to

  1. Select appropriate default values, such as the amount of memory needed

  2. Name files associated with the virtual machine

  3. Adjust settings for optimal performance

  4. Work around special behaviors and bugs within a guest operating system

If the operating system you plan to use is not listed, select Other for both guest operating system and version.

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

Thanks, Eric

Visit my website:

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

0 Kudos
wila
Immortal
Immortal
Jump to solution

Thanks Eric for your support.

Well if that's not the issue then i'm really starting to run out of arrows now to point at the VM and am wondering what i could do to resolve it. It is starting to become a little urgent as endusers are getting a little annoyed with the VM dying "all the time".

I understand that mis-labeling the VM with the wrong OS could have adverse effects and you are confirming my suspicion that this might have the strangest side effects. Especially as W2k advanced server isn't quite the same under the hood as just windows 2000 Server. But locking up the entire VM isn't something i have seen happen much EXCEPT with this one.

FWIW, the VM is in the DMZ, but so far, that seems to be unrelated.

Strangely enough the VM was P2V-ed using VMconverter which IIRC sets the OS type for you. When i noticed the mismatch yesterday, i decided to wait for a low usage time slot for changing the setting to the more logical alternative.

Unfortunately, the VM picked its own slot this morning at 4 am. Not a big issue as i was actually working at the time (ack!) so could resolve that "mislabeled" VM in a jiffy.

If the problem keeps on occuring then the only solution i see now is to rebuild the entire VM from scratch using Windows 2003 R2. Something that -in this case- will also take quite some energy and time.

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos
esiebert7625
Immortal
Immortal
Jump to solution

I would start with a repair of Windows rather then a re-install, before that you might try completely un-installing VMware Tools and re-installing it. Also what is your VM's hardware config look like (cpu/memory/disk/) and are you using the Buslogic SCSI controller? What type of disk is your VM's disk file located on? (FC/Local/iSCSI/NFS)

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

Thanks, Eric

Visit my website:

-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=-=-=-=-=-=-=-

wila
Immortal
Immortal
Jump to solution

This morning it happened again, so i am now preparing a rebuilt on windows 2003 from scratch. But i'll make a few more adjustments as to improve logging facilities now that i am making changes to the setup.

I really wish that it still is an option to stay on the troubleshoot path, unfortunately it is not. The VM isn't reliable right now for reasons unknown. I've tweaked it a number of times, scanned log files etcetera all to no avail. In my lab i have similar setups that work without a hitch. Same for a windows 2003 web server also setup on the same host as the problem VM. It just runs as is advertised.

In order to not loose the faith of customers i need it resolved now instead of the trial & error method that i had to use for tracking the issue so far. Going back to the customers, saying it happened again this morning and that i'm now trying yet another configuration option is starting to get silly. I personally lost faith in fixing the VM, i'm sure it is possible, but i don't have the time to keep on trying fix it, but instead need it addressed so i can get on with my normal day to day job.

Thanks again, your help is really appreciated.

| Author of Vimalin. The virtual machine Backup app for VMware Fusion, VMware Workstation and Player |
| More info at vimalin.com | Twitter @wilva
0 Kudos