VM Workstation 11.1.2 on host Kubuntu Linux 15.04 64 bits, Intel Q6600 quad core with 8GB RAM.
Since long time ago (years) I'm experiencing a significative performance difference among a Win XP 32 bits guest vs. most recent Windows versions (Vista an up, both 32 and 64 bits). This is preventing me from using Workstation with Windows 7 and up.
A single thread task consisting on firing successive processes and obtaining its output takes 3x times more on the modern Windows than on the XP guest. On another type of task (running a compiler's test suite, which executes thousands of short-lived processes, running as much processes on parallel as available CPUs), modern Windows guests also require 3x times more. This is assigning all host CPUs to the virtual machine (only one virtual machine is on at any given moment). If only one CPU is assigned to the virtual machine, individual processes take less time to run but the overall time required by the parallelizable tasks is worse.
For comparison, the Windows XP guest takes 35 seconds to run the test suite, a Windows 8.1 64 bits guest takes 100 seconds, which is almost the same as it takes on an old Atom N270 Netbook.
The Windows XP guest uses binary translation and this special settings:
MemTrimRate = "30"
mainmem.backing = "swap"
sched.mem.pshare.enable = "FALSE"
prefvmx.useRecommendedLockedMemSize = "TRUE"
monitor.idleLoopSpinUS = "100"
Tried the Windows 8.1 machine with the same settings, with and without binary translation, and performance improved when only one CPU is required, but for the parallel tasks it remained the same.
Any idea on how can I obtain a similar performance from the Windows 7+ guests as from the XP guest?
BTW, memory assigned to the guest is not an issue, there is plenty of free memory at all times.
There does appear to be a lot more contention related to MMU virtualization in the Windows 7 case.
With the VM powered off, try adding the following options to the .vmx file:
monitor_control.disable_eagervalidate_wide = TRUE
monitor_control.disable_eagervalidate_narrow = TRUE
This may not help, and it may actually make things worse, but it's worth a shot.
On a Q6600, the dominant overheads for process creation should come from CR3 changes and page table monitoring. These are much more expensive with hardware-assisted virtualization than with binary translation. (Note that these overheads go away on CPUs with hardware support for MMU virtualization.)
You say that you tried the Windows 8.1 machine with the same settings as the Windows XP guest (presumably including binary translation), but you previously refer to "a Windows 8.1 64 bit guest." Binary translation is not supported for 64-bit guests, so Workstation will use hardware-assisted virtualization regardless of your chosen execution mode.
Just to clarify...Have you tried a 32-bit Windows 8.1 guest with binary translation?
Just to clarify...Have you tried a 32-bit Windows 8.1 guest with binary translation?
Thank you for your insightful info. Yes, recently I tried Windows 10 Preview 32 bits with binary translation, and it performed almost as bad as its 64 bits variant.
OTOH, I have some Linux 64 bits guests, and they work fine. Running the test suite on a Lubuntu 15.04 64 bits guest takes 24 seconds, vs 100 on the Windows 8.1 64 bits guest, 35 seconds on the Windows XP 32bits guest and 8 seconds on the host. The Lubuntu guest works with VMware default settings.
So do you think that a modern CPU (Haswell+) would improve things wrt the Q6600 for this type of tasks?
Thanks again.
Can I ask you to run a vprobe script against the (slow) VM while it's running your MT workload?
Download the attached script to your host.
With the VM powered off, add the following lines to /etc/vmware/config on your host:
vprobe.allow = TRUE
vprobe.enable = TRUE
Start the VM and your workload, and then run the following command on the host:
vmware-vprobe </path/to/the/VM/.vmx/file> vt-exit-hist.emt
The script will emit a chunk of statistics every 10 seconds. Please post a representative chunk of the output.
Here are two consecutive chunks of output while the test suite is running:
intKey0 avg count min max pct%
0xc 503 11 360 801 0.0%
0x10 364 105 297 774 0.0%
0x4d 666 384 459 20754 0.0%
0x1e 405 767 297 21177 0.0%
0xa 352 1068 288 657 0.0%
0x7 501 2172 288 24624 0.0%
0x2b 449 3634 297 20997 0.1%
0x1c 447 42526 297 112968 1.4%
0x1 495 72515 288 133407 2.6%
0xe 349 128801 288 97362 3.3%
0x4e 573 2160312 450 14395977 92.2%
intKey0 avg count min max pct%
0x10 391 135 297 1242 0.0%
0xc 467 372 315 18504 0.0%
0x4d 593 376 459 1350 0.0%
0x1e 337 833 297 675 0.0%
0xa 352 1098 288 729 0.0%
0x7 663 2356 297 350451 0.1%
0x2b 436 3628 297 21483 0.1%
0x1c 443 45883 288 114741 1.6%
0x1 491 57236 288 143703 2.2%
0xe 346 128380 288 375039 3.6%
0x4e 549 2072801 450 2297097 92.1%
EDIT: Forgot to mention that this was a Windows 8.1 64 bits guest.
The 0xe row is a little surprising. Those are RDPMC VM-exits, which usually do not show up that high. Are you using virtualized performance counters?
The dominant exits are page faults (0x4e), as expected. They are only about 50 times as frequent as CR accesses (0x1c) [likely to be CR3 changes]. If you are spawning about 4250 new processes per second, and they have a working set of about 200K, that would account for these VM-exits. (Actually, divide 4250 by the number of virtual processors.) However, with software MMU virtualization, page faults can be incurred for other reasons as well, including device virtualization.
Out of curiosity, what virtual hardware version is this VM? If it isn't virtual hardware version 11 (compatible with Workstation 11), I would suggest upgrading it. (Go to VM->Manage->Change Hardware Compatibility).
Are you using virtualized performance counters?
No.
The dominant exits are page faults (0x4e), as expected. They are only about 50 times as frequent as CR accesses (0x1c) [likely to be CR3 changes]. If you are spawning about 4250 new processes per second, and they have a working set of about 200K, that would account for these VM-exits.
Actually, the whole test suite run, which takes 100 seconds on the Windows 8.1 64 bits VM, spawns ~2000 processes. This makes 20 processes per second, on average, shared among the 4 processors. On other tasks, such as building C++ projects, which run multiple, but long-lived processes, the performance is not so bad. So it seems that process spawning is the key factor here, as if it were extraordinarily expensive. Another example: a Git frontend invokes the git executable about 40 times sequentially to show the detailed status of a source tree: it takes 12 seconds on the Windows 8.1 machine with 4 cores, 6 seconds with one core. An Atom N270 netbook takes less than 2 seconds, the Windows XP guest is slightly slower.
Out of curiosity, what virtual hardware version is this VM? If it isn't virtual hardware version 11 (compatible with Workstation 11), I would suggest upgrading it
The slow Windows 8.1 64bits guest is using version 11, the same as the "fast" Lubuntu 64 bits guest. The "fast" Windows XP 32 bits guest is using version 7. Anyways, I'm experiencing this problem since I installed VM Workstation on this machine, circa 2009. At that time, after much experimentation, I came with a config for the Windows XP guest that made it acceptable, on performance terms (it mostly related to using Binary Translation). Linux guests, 32 and 64 bits, performed well without special tweaks. But so far I was unable to make the Windows post-XP to perform acceptably. Now it is a serious problem, because the XP VM should be deprecated and replaced by something more modern, but there is no way this can be achieved with VMs that perform like a 5 year old single-core netbook.
Can you try another vprobes script? This one will produce voluminous output. I'm mainly interested in whether or not there are any high percentage entries at the end. This script may fail with 'vpqueue overflow.' If so, give it a couple of tries to see if you can get some output before it dies.
First attempt (showing the last lines of each chunk that starts with intKey0:
0xfffff68000002 | 50884 3.6% |
0xfffff680003bb | 53319 3.7% |
0xfffff680003bc | 54557 3.8% |
0xfffff6800000f | 58963 4.1% |
0xfffff6e800147 | 60157 4.2% |
0xfffff680003bd | 69510 4.9% |
0xfffff68000321 | 74646 5.2% |
0xfffff6bffd5b9 | 76538 5.4% |
0xfffff680003bb | 27828 3.5% |
0xfffff680003bc | 28312 3.6% |
0xfffff6800000f | 30507 3.9% |
0xfffff680003bd | 35571 4.5% |
0xfffff68000321 | 39198 5.0% |
0xfffff6bffd5b9 | 39518 5.0% |
0xfffff680003bc | 41740 3.1% |
0xfffff68000002 | 46314 3.5% |
0xfffff6800000f | 48723 3.6% |
0xfffff680003bd | 52607 3.9% |
0xfffff6bffd5b9 | 57482 4.3% |
0xfffff68000321 | 62553 4.7% |
0xfffff6800000f | 32661 3.1% |
0xfffff680003bc | 33319 3.2% |
0xfffff680003bd | 42534 4.0% |
0xfffff68000321 | 45414 4.3% |
0xfffff6bffd5b9 | 47297 4.5% |
On the second run something anomalous happened: the CPU usage dropped to almost 0 in the guest and the host and finally I killed the task on the guest. The percentages at the end of *some* chunks are larger. Those are some of those:
0xfffff6fb5ffea | 318 0.8% |
0xfffff6bffd5b9 | 446 1.1% |
0xfffff6e800148 | 538 1.4% |
0xfffff6e800168 | 680 1.8% |
0xfffff6bffd517 | 747 1.9% |
0xffffffffffd0f | 6405 17.0% |
0xfffff6e000c25 | 281 1.0% |
0xfffff6bffd5b9 | 356 1.3% |
0xfffff6fb5ffea | 362 1.4% |
0xfffff6bffd517 | 597 2.3% |
0xffffffffffd0f | 6397 25.0% |
0xfffff6fb5ffea | 254 0.9% |
0xffffc00185034 | 271 1.0% |
0xfffff6e800168 | 282 1.0% |
0xfffff6e800147 | 337 1.2% |
0xffffffffffd0f | 6415 23.7% |
0xfffff6bffd5b9 | 147 0.7% |
0xfffff6bffd5b8 | 165 0.8% |
0xfffff6e800162 | 219 1.1% |
0xfffff6fb5ffea | 279 1.4% |
0xffffffffffd0f | 6198 32.3% |
I tried a third run and the final percentages were between 4% and 10%.
Just created a Windows 7 32 bits VM, with Binary Translation, 4 cores assigned and this setting in the vmx file:
MemTrimRate = "30"
mainmem.backing = "swap"
sched.mem.pshare.enable = "FALSE"
prefvmx.useRecommendedLockedMemSize = "TRUE"
monitor.idleLoopSpinUS = "100"
This exactly matches the Windows XP 32 bits VM that I normally use because it delivers the minimum acceptable performance.
The Windows 7 VM is 3x slower than the XP VM on tasks that require the creation of multiple short-lived processes.
Puzzling.
Can you enable statistics collection for the two VMs (32-bit Windows 7 with BT and Windows XP with BT)? See VM->Settings->Options->Advanced, Gather debugging information.
I'll send you a private message on where to upload the stats, but you'll have to "follow" me on the forums before I can PM you.
I'll send you a private message on where to upload the stats, but you'll have to "follow" me on the forums before I can PM you.
I'm following you now. The statistics are ready (I guess is the directory "statistics" what I should upload, as a zip file).
I'll send you a private message on where to upload the stats,
Thank you. Statistics uploaded.
It looks like the overhead due to CR3 changes (indicative of process switching) is 2.5 times higher on WIndows 7. Perhaps the scheduling quanta are different for the two guests?
Can you try setting the scheduling quantum higher on Windows 7? See Figure 5-16 in https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7. Change 'Processor Scheduling' to Adjust for best performance of 'Background Services.'
Change 'Processor Scheduling' to Adjust for best performance of 'Background Services.'
That change has no effect at all. (I took the performance measurements after rebooting, just in case.)
There does appear to be a lot more contention related to MMU virtualization in the Windows 7 case.
With the VM powered off, try adding the following options to the .vmx file:
monitor_control.disable_eagervalidate_wide = TRUE
monitor_control.disable_eagervalidate_narrow = TRUE
This may not help, and it may actually make things worse, but it's worth a shot.
monitor_control.disable_eagervalidate_wide = TRUE
monitor_control.disable_eagervalidate_narrow = TRUE
This may not help, and it may actually make things worse, but it's worth a shot.
This puts the Windows 7 32bits VM on the same mark as the XP 32 bits VM, when running the test suite. So it helps a lot indeed. For the case of a single sequence of short lived processes (the git frontend that spawns 40 git processes sequentially I mentioned before) it still is 2x slower.
On the Windows 8.1 64 bits VM it has significant impact too: the test suite goes from 100 seconds to 52, still far from the 35 seconds of the XP VM, but an improvement nonetheless. If you know something that I can try on the Windows 64 bits VMs, I would appreciate it.
Are those settings something that I can permanently put on the .vmx? Is there any consequence that I should keep in mind?
Thank you very much for being so persistent troubleshooting this problem.
For 64-bit guests, my recommendation would be to upgrade your CPU to Westmere or later. Hardware MMU virtualization (EPT) will probably eliminate 90% of the virtualization overhead for this workload, and the improvements in VM-exit/VM-enter latency on newer processors will also help substantially.
As for the configuration options, you should be able to keep them permanently, but note that VMware does not officially support VMs with manual configuration file changes. This is not a configuration that we would test.
Okay, I'll think about replacing this machine with a new one. In the meantime, those vmx settings will alleviate the pain. Thank you.