VMware Communities
koiaka
Contributor
Contributor

Two VMs running on the same host on a RAID 0 disk produces a kernel panic

Hi,

We have a MacPro with 4 Gb RAM plus 2 Terabyte in four 500 Gb disks. We need to run several VMs at the same time. To gain performance we installed them on separated disks. We usually run up to four VMs without problems.

The performance working with separated disks is good, but we wanted to test if we can improve this configuring the disks in a RAID 0 configuration. We've found out that one VM running alone runs much faster than in a normal disk configuration, but at the moment we start the second VM, we either receive a kernel panic or the computer hangs completely.

It is not related with the guest OS, since the problem is happening with two XP VMs or one XP and a Linux or two Linuxes.

Any experiences with this problem?

Thanks,

Reply
0 Kudos
14 Replies
Pat_Lee
Virtuoso
Virtuoso

How are you creating the RAID set? How long does this take to reproduce this problem?

One thing to try is to go into Preferences and change performance setting to "Optimize for Mac OS application performance" instead of the default setting to see if this helps with this problem.

Pat

Reply
0 Kudos
koiaka
Contributor
Contributor

The Raid 0 set is created with the Disk Utility. I am using two internal Samsung SATA disks with 500 Gb each installed inside of the MacPro.

Normaly the problem is replicated immediately. As soon as the second VM is started the computer completely hangs or there is a kernel panic. I have replicated the problems with both Fusion 1.0 and 1.1.

But now I have changed the configuration that you have suggested and it seems to be working. After ten minutes both machines are perfectly stable and responsive. I have started an Oracle database in one of them and it is still running. Do you know what does it happen in lower levels of the system? Why would this change solve the problem?

Thanks for your help so far. I will test it for a while and let you know how it goes...

Reply
0 Kudos
koiaka
Contributor
Contributor

It was too good to be true. I was testings with VMs in different disks. One in the Raid and another in a separated disk. This works perfectly. But as soon as I started a second VM in the Raid the computer stops completely.

Any other ideas? Is there any way to report this problem to vmware?

Reply
0 Kudos
Pat_Lee
Virtuoso
Virtuoso

Does this happen with two VMs that don't do much at all (can the VMs just be powered on and nothing happening in each of them)? Are these newly created VMs with VMware Fusion or VMs imported from other VMware products?

Also, what happens if you put two large quicktime HD movies (250 MB or more) on the RAID 0 volume and play the movies both at the same time?

Message was edited by: Pat Lee added quicktime question

Reply
0 Kudos
bgertzfield
Commander
Commander

Apple might have the same deadlock bug in their RAID implementation as the one that affects FileVault encrypted volumes.

Can you try the old workaround that moves the .vmem file off onto a local volume? Assuming your root volume isn't RAIDed, this will work:

1) Control-click on your VM document in ~/Documents/Virtual Machines and select "Show Package Contents"

2) Open the .vmx file inside in TextEdit

3) Add the line:

mainmem.useNamedFile = “FALSE”

4) Save and exit

Reply
0 Kudos
HPReg
VMware Employee
VMware Employee

Would you mind sending us your /Library/Logs/panic.log file? There might be a clue there as to what is going on. Thanks!

Reply
0 Kudos
koiaka
Contributor
Contributor

Hello guys,

I could test your suggestions during this morning but the error still exists. These were my tests:

1. I took two raw DV video files with more than 6 Gb each. I have run them with quicktime both at the same time, while I have one of the VMs opened and everything run smoothly and perfect. I've kept them running for 20 minutes without problems.

2. I have made the change in the vmx file in both VMs. It looks that the error was somehow delayed, but it finally happened. I've kept both machines running during a while. One VM was running a Win XP with Mcafee full scan and the other with Oracle Enterprise Linux 4 running Oracle 10g and PeopleSoft CRM. It went correctly for a while until I restared the win XP machine. Then everything hangs. Even the Mac. No finder anymore even the process monitor stopped completely, all the network connections were killed as well.

3. I've tried removing the changes in the vmx file again and everything hangs as well but after a few seconds. But anyway I think that my previous test was just luck. That there is no relation with the change in the vmx file.

4. I cannot produce a kernel panic to get a most current log. The last one was a few weeks ago and it was produce by this problem. Here is the log, but it does not says much to me:

Wed Sep 19 14:06:49 2007

panic(cpu 3 caller 0x00141505): zalloc: "kalloc.64" (15013120 elements) retry fail 3

Backtrace, Format - Frame : Return Address (4 potential args on stack)

0x47fcb798 : 0x128d08 (0x3cc0a4 0x47fcb7bc 0x131de5 0x0)

0x47fcb7d8 : 0x141505 (0x3ccdb0 0x3cc334 0xe51500 0x3)

0x47fcb838 : 0x12d97d (0x19b1c10 0x1 0x47fcb888 0x9621000)

0x47fcb868 : 0x12d99c (0x3c 0x1 0x0 0x0)

0x47fcb888 : 0x179061 (0x3c 0x1 0x47fcb8a8 0x19e23a)

0x47fcb8a8 : 0x17ae4e (0x961dc40 0x961dc40 0x961dc38 0x0)

0x47fcb968 : 0x35b362 (0x961dc38 0x0 0x0 0x1000)

0x47fcb9a8 : 0x1ca6af (0x9621000 0x0 0x0 0x1000)

0x47fcbab8 : 0x1cb809 (0xb17 0x0 0xb57 0x0)

0x47fcbc18 : 0x2fe7f4 (0x9621000 0x47fcbe9c 0xb17 0x0)

0x47fcbd28 : 0x1e56b4 (0x47fcbd54 0x297 0x47fcbd88 0x1d1d63)

0x47fcbd88 : 0x1e0620 (0x9621000 0x47fcbe9c 0x3 0x47fcbde8)

0x47fcbe08 : 0x3501ec (0x7ba8300 0x47fcbe9c 0x6442104 0x0)

0x47fcbef8 : 0x350480 (0x6d8c7d0 0x7ba8300 0x8 0x300b20)

0x47fcbf58 : 0x37ad83 (0x6d8c7d0 0x67ee2e0 0x67ee324 0x0)

0x47fcbfc8 : 0x19b28e (0x63a2660 0x0 0x19e0b5 0x63a2660) Backtrace continues...

Kernel version:

Darwin Kernel Version 8.10.1: Wed May 23 16:33:00 PDT 2007; root:xnu-792.22.5~1/RELEASE_I386

*********

Any more ideas?

Thanks so far...

Reply
0 Kudos
admin
Immortal
Immortal

Does the RAID set include all your disks (specifically, does it include /tmp)?

Reply
0 Kudos
koiaka
Contributor
Contributor

Yes, it does. The Raid Set is the system disk including the /tmp.

One detail that I have not mentioned yet. It is Mac OS X Server 10.4.10. All the VMs that I am using were installed directly on this machine using vmware Fusion.

Reply
0 Kudos
HPReg
VMware Employee
VMware Employee

panic(cpu 3 caller 0x00141505): zalloc: "kalloc.64" (15013120 elements) retry fail 3

This means that the kernel wanted to grow the zone "kalloc.64" and failed, because the zone was already too large (with 15 million items in it, that is no surprise). This strongly indicate a memory leak of "kalloc.64" instances: somebody is allocating them and is never freeing them.

Now we need to determine who is guilty Smiley Happy

It would be awesome to decode this backtrace. On the box where this panic occured, can you please run 'nm /mach_kernel' and send me the ouptut of the command at hpreg@youknowthecompany.com?

There are a few other interesting things to do:

1) Run this command in a separate terminal, at the same time as your VM test: while true; do zprint | grep 'kalloc\.64'; sleep 1; done

The numbers should not grow unbounded. In particular, try to run 1 VM for a while, see how the numbers move, then power off that VM, did the numbers go back to normal? Now try again with two VMs: run them for a while, see how the numbers move, don't wait until the zone becomes too big and power off the VM, are the numbers back to normal?

2) Set sched.mem.pshare.enable = FALSE in the .vmx files of both VMs. Does the problem still occur?

Thanks!

Reply
0 Kudos
koiaka
Contributor
Contributor

Hello,

thanks for the explanation. I will try to help you to track this problem down.

Where should I be located with the terminal to run 'nm /mach_kernel'? I could not find the command 'mach_kernel' or a directory called 'nm'. Should I run it as root?

Reply
0 Kudos
admin
Immortal
Immortal

nm is command, /mach_kernel is the directory it acts on. I was able to run it as a regular user with pwd being my home directory.

Reply
0 Kudos
Mac_hatter
Contributor
Contributor

I was only running Vista x64 and trying to bring up SuSE 10.3 along with Ubuntu when I saw your thread and issue with RAID setups.

Just wanted to chime in that I was running Fusion on a RAID, but I tend to have the boot drive on non-RAID (updates not working, didn't benefit enough) but I have used RAID for data files and for home directory accounts. I've used this kind of setup since early 10.2.

Reply
0 Kudos
koiaka
Contributor
Contributor

Hello,

sorry for the delay in the answer.

I've found the root of the problem. One of the disks in the RAID had a lot of bad blocks. In Tiger when a bad block was found all the Finder stops until the timeout is reached. This also stops the application that it is running, in this case vmware. That is why I thought that the entire OS was stopping. In Leopard Finder is much more stable and it automatically unmount the defective device and vmware stops with an error message.

So, problem solved. Thanks for your help anyway,

Reply
0 Kudos