VMware
28 Replies Last post: Apr 4, 2006 5:43 PM by petr   1 2 Previous Next

Disk turns read-only: Workstation 5 crash on RHEL4.1 w/ EXT3 journal error posted: Aug 12, 2005 5:21 PM

Click to view JPV's profile Novice 24 posts since
Jun 16, 2005
I have VMware Workstation 5 running on a host OS of CentOS-4.1 (RHEL 4.1 rebuild, see www.centos.org). Host OS is a *minimal* install, using Xfce4 and as little else as I could manage. SELinux is disabled.

For the second time now, I've crashed as follows. The first time was a few weeks ago and I ignored it and rebooted, so no info there. I *think* I was running a Windows WM that time too though. (Damn Windows, can even get it stable when running it on Linux! :-) )

This time I was running 2 Win2K server VMs (linked clones) with 3 other (various Linux) VMs paused. The box is a plain Dell Optiplex GX-260 P4-500 with 1G RAM and 80G IDE. It is company owned and I haven't messed with it (i.e. no overclocking). The machine is on an old Compaq PS/2 KVM, so no USB BS. The Windows VMs are fully patched (ironically I was testing a patch management tool in them).
VMware tools are installed on the 2 Windows VMs and they are using auto mouse-grab. I am not running full-screen.

A soft reboot of the host machine fixed the problem, for now. Naturally, on reboot an fsck was forced on that partition, and it completed with a couple of minor inod complaints.

I can't try another host as I don't have anything laying around with the hosepower needed. But up until recently that box was a Win2KPro workstation that worked fine except that it had Windows on it.

Possibly related to the following, though I have NOT been able to tie it to any mouse movement (haven't tried very hard either):
http://www.vmware.com/community/thread.jspa?threadID=17535
http://www.vmware.com/community/thread.jspa?threadID=7598
http://www.linuxquestions.org/questions/history/342887
http://www.webservertalk.com/archive235-2004-6-285876.html
http://ist.uwaterloo.ca/~kscully/crow.html

Symptom:
Dialog box: 'Operation on file "foo.vmdk" failed (Read-only files system.) Choose retry to attempt again… [Abort] [Continue] [Retry]'

Dmesg begins " start_transaction: Journal has aborted" then has 224: "EXT3-fs error (device hda4) in start_transaction: Journal has aborted"

Console had: "hostname kernel: journal_get_undo_access: No memory for committed data"

/var/log/messages has nothing relevant.

Running everything under non-root user. VMware was only running for 2 days or so, nothing in the vmware.log that looks relevant, head/tail as follows:

/home/jp/vmware/Windows_2000_Server# head vmware.log
Aug 11 00:41:22: vmx| Log for VMware Workstation pid=31004 version=5.0.0 build=build-13124 option=Release
Aug 11 00:41:22: vmx| Command line: "/usr/lib/vmware/bin/vmware-vmx" "-@" "pipe=/tmp/vmware-jp/vmx0e8fab5e213e2683;vm0e8fab5e213e2683" "/home/jp/vmware/Windows_2000_Server/Windows_2000_Server.vmx"
Aug 11 00:41:22: vmx| UI Connecting to pipe '/tmp/vmware-jp/vmx0e8fab5e213e2683' with user '(null)'
Aug 11 00:41:22: vmx| pcpu #0 CPUID numEntries=2 GenuntelineI
Aug 11 00:41:22: vmx| pcpu #0 CPUID version=0xf27 id1.edx=0xbfebfbff id1.ecx=0x400 id1.ebx=0x20809
Aug 11 00:41:22: vmx| pcpu #0 CPUID id80.eax=80000004 id81.edx=0x0 id81.ecx=0x0
Aug 11 00:41:22: vmx| CPUID id1.edx: 0xbfebfbff id1.ecx: 0x400 id81.edx: 0 id81.ecx: 0
Aug 11 00:41:22: vmx| changing directory to /home/jp/vmware/Windows_2000_Server/.
Aug 11 00:41:22: vmx| Config file: /home/jp/vmware/Windows_2000_Server/Windows_2000_Server.vmx
Aug 11 00:41:22: vmx| VMXVmdbCbVmVmxExecState: Exec state change requested to state poweredOn without reset

/home/jp/vmware/Windows_2000_Server# tail vmware.log
Aug 11 01:54:23: vmx| SCSI0:0: Command WRITE(10) took 6.118 seconds (ok)
Aug 11 01:54:31: vmx| SCSI0:0: Command WRITE(10) took 2.540 seconds (ok)
Aug 11 01:54:31: vmx| SCSI0:0: Command WRITE(10) took 2.606 seconds (ok)
Aug 11 01:56:47: vmx| SCSI0:0: Command WRITE(10) took 1.219 seconds (ok)
Aug 11 01:56:47: vmx| SCSI0:0: Command WRITE(10) took 1.379 seconds (ok)
Aug 11 01:56:47: vmx| SCSI0:0: Command WRITE(10) took 1.501 seconds (ok)
Aug 11 06:59:12: vmx| DISKLIB-LIB :numIOs = 50000 numMergedIOs = 2560 numSplitIOs = 1565
Aug 11 23:18:39: vmx| SCSI0:0: Command WRITE(10) took 1.221 seconds (ok)
Aug 12 00:41:21: vmx| LICENSE using: '/home/jp/.vmware/license.ws.5.0'
Aug 12 01:16:22: vmx| DISKLIB-LIB :numIOs = 100000 numMergedIOs = 4150 numSplitIOs = 3215

Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
Post 'dmesg' from host when this happens. It has nothing to do with guest, VMware just wants stable I/O subsystem, and your host remounts harddisk read-only when it sees someone really wants to read & write files a lot...
Click to view mattlav's profile Novice 12 posts since
Dec 18, 2004
Another thing to check would be the SMART status of the drive. The only time that I have had the system remount RO has been when the disk is about to fail. If that drive has about had it then you will see the failures in the SMART output and can grab the data fast.

Matthew
Click to view sacolcor's profile Lurker 3 posts since
Aug 19, 2005
I have also had this crash multiple times, running CentOS-4.1 hosting WinXPSP2. It seems to be most common during heavy disk I/O, particularly when installing large service packs. The dmesg is the same as the one JPV noted.

This would seem to be a serious problem; it's causing filesystem corruption on the host (and probably the guest, too). And if it's happening on CentOS4, RHEL4 is probably similarly affected.

petr: You indicated that the host "remounts harddisk read-only when it sees someone really wants to read & write files a lot..." Could you elaborate a bit? I don't see how that could be considered correct behavior.

Thanks!
Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
It is not correct behavior, but it is bug of your *host*, VMware only triggers it because it needs huge I/O bandwidth for guest OS.

BTW, I have no idea what dmesg JPV noted you talk about. I see no 'dmesg' output anywhere. /var/log/messages of course won't show anything - if you remount disk read-only, there is no way how to update /var/log/messages. You must do 'dmesg' when disk is remounted read-only, and write text by hand down to the paper...
Click to view sacolcor's profile Lurker 3 posts since
Aug 19, 2005
The line I was referring to was:

Dmesg begins " start_transaction: Journal has aborted" then has 224: "EXT3-fs error (device hda4) in start_transaction: Journal has aborted"

You indicate that this is a known bug for the Linux kernel...can you provide a link to a bug entry or mailing list record where we can get more information?
Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
Real error is before this message... You need 'dmesg -s131072' or something like that to find real cause. First message is truncated at the beginning because 'EXT3-fs error (device hda4) in' did not fit to the dmesg buffer size you are using.

Search on google for 'in start_transaction: journal has aborted' returns over 600 hits... I have no idea which particular instance you are hitting until you can provide all error messages. If you cannot find it with 'dmesg -s131072', you may have to use serial console or rebuild your kernel with bigger dmesg buffer or run netconsole to log messages over network to some other system.
Click to view sacolcor's profile Lurker 3 posts since
Aug 19, 2005
Will try that the next time this happens.

I suspect that this console message might be the original error, or close to it:
"<hostname> kernel: journal_get_undo_access: No memory for committed data".
Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
Yes, it probably is original error source. You've run out of physical memory. Add more memory, run smaller VMs, or configure system for more agressive swapping and releasing memory. First two are obvious, third one can be configured by /proc/sys/vm/swappiness, /proc/sys/vm/min_free_kbytes (and partially by /proc/sys/vm/lowmem_reserve_ratio). Bigger value you put to 'min_free_kbytes', the better. For start I would try double current value. 'swappiness' is by default 60, you may try 70 or maybe even bigger value (100 is max).
Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
Memory is used also by disk caches, and it can happen that ext3 needs free memory when memory is full of dirty pages with VM & their vmdk files, so no reasonable amount can be freeed at this moment, without doing I/O. And I/O cannot be done as all this lives on one filesystem :-( Increasing swappiness should force host to start writting dirty pages out sooner.

No application should be able to cause problem you see on correct kernel - kernel should start swapping long ago, and if you run out of both physical memory and swap, it should have killed offending application long ago, but not destroy/dismount/damage filesystem. But from your messages it seems like that you did not run out of virtual memory, just kernel tried to be smart about writting data to disk, and it was so smart that when it decided to write them out, it was too late...

You might want to determine what's your device with problem (say /dev/sda1), run 'tune2fs -l /dev/sda1' to find journal inode number (usually 8), and then run [code]echo 'stat <8>' | /sbin/debugfs -f - /dev/sda1[/code] to find journal size (default is 32MB).

If it will look too small, you may want to use tune2fs to remove journal from ext3 (it becomes ext2... 'tune2fs -O ^has_journal /dev/sda1') and add bigger one to the filesystem ('tune2fs -J size=<size you want> -j /dev/sda1'). Keep rescue CD near computer as tune2fs -O ^has_journal will probably require reboot, and it is possible that you have kernel without ext2 support, so maybe you'll have to run second tune2fs from installer's rescue session.
Click to view petr's profile Champion 7,218 posts since
Jul 10, 2003
min_free_kbytes = 957 is definitely too low for 1GB system. You should have something around 2500 (2.5MB) in that file for 1GB system, more if you have slow disks and/or fast processor.

VMware Developer

SDKs, APIs, Videos, Learn and much more in the Developer community.

Learn More

Developer Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

VMware vSphere

Come witness the next giant leap in virtualization.

Register Today

Communities