VMware Communities
JohnWiegley
Contributor
Contributor

Seeing a *lot* of virtual disk corruption

I've been using VMware Fusion on the Mac since it first came out. This time around, I'm using VMware Fusion 3.0 on Snow Leopard 10.6.0.

On all my VMs (I have 13), I'm running CentOS 5.3 as the guest OS.

My problem is that I'm seeing corrupted filesystems very frequently. As in, today I went to boot up my 13 VMs, and 3 of them had severe filesystem damage. I will probably have to throw them away and reconstruct.

This happens to me a LOT. I've had to rebuild certain VMs several times now. I have one that I use only for connecting to Cisco VPN, and it's suffered from corruption 5 separate times now. Fortunately, none of those was so bad fsck couldn't handle it, but twice this week I had situations where fsck couldn't restore the machine to a working state.

I've set the disk I/O mode for all of these VMs to unbuffered. Is there something else I can do to ensure they stay in a valid state?? I've already stopped suspending them, since that seems like asking for trouble.

I'm a bit shocked at how unstable my virtualization environment is, which I depend on to get real work done. It looks like I'm going to lose this evening rebuilding two of the machines that just failed on startup. And the weird thing is, both of these machines rebooted just fine yesterday.

This started happening on my MacBook Pro, and now it happens on a Mac Pro with a hardware RAID-10 controller.

Is there any way to make VMware be stable?

Thanks, John

0 Kudos
26 Replies
JohnWiegley
Contributor
Contributor

Ok, this problem definitely seems to be a result of using Suspend/Resume.

For weeks now I've been halting all 7 of my VMs every night, and running full fsck's on all of them once a week. No corruption anywhere.

Yesterday I suspended all of the machines for the first time, rather than halting them. Today I started them back up, and was using them all night long. Then, just now, I decided to drop them all down to init 1 and fsck them again. Guess what: 3 of the machines had filesystem corruption.

So, it looks like suspend/resume may not be reliable where CentOS is concerned.

John

0 Kudos
cdc1
Expert
Expert

I know you've found a circumvention for your issue, but I'm curious, and I'm wondering if you would take a moment to answer a quick question.

What disk types are you using for your CentOS VM's? (i.e.: Monolithic sparse (which is default), monolithic preallocated, split sparse, or split preallocated?)

0 Kudos
JohnWiegley
Contributor
Contributor

The disks that are seeing corruption are all dynamically allocated 2G chunks.

0 Kudos
JohnWiegley
Contributor
Contributor

I should note also that these disks are IDE, but I've seen the same issues in the past with SCSI.

0 Kudos
Pat_Lee
Virtuoso
Virtuoso

John,

Please file a formal support request so this gets tracked by support and we can work with you to figure out what is happening.

http://communities.vmware.com/docs/DOC-9689

We want to understand what is happening and how to get to the bottom of this.

Thanks,

Pat

0 Kudos
JohnWiegley
Contributor
Contributor

Can you perhaps walk me through how to do that? I went to the DOC you linked, which led me into a link maze where the only thing I could find was "Create customer support request", but the only "categories" it offers relate to issues like problems downloading, getting my license to work, etc. Isn't there a "file a bug" page?

0 Kudos
Pat_Lee
Virtuoso
Virtuoso

0 Kudos