Solved: Re: Bad MT performance XP vs modern Windows

RobertoRos · ‎08-03-2015

VM Workstation 11.1.2 on host Kubuntu Linux 15.04 64 bits, Intel Q6600 quad core with 8GB RAM.

Since long time ago (years) I'm experiencing a significative performance difference among a Win XP 32 bits guest vs. most recent Windows versions (Vista an up, both 32 and 64 bits). This is preventing me from using Workstation with Windows 7 and up.

A single thread task consisting on firing successive processes and obtaining its output takes 3x times more on the modern Windows than on the XP guest. On another type of task (running a compiler's test suite, which executes thousands of short-lived processes, running as much processes on parallel as available CPUs), modern Windows guests also require 3x times more. This is assigning all host CPUs to the virtual machine (only one virtual machine is on at any given moment). If only one CPU is assigned to the virtual machine, individual processes take less time to run but the overall time required by the parallelizable tasks is worse.

For comparison, the Windows XP guest takes 35 seconds to run the test suite, a Windows 8.1 64 bits guest takes 100 seconds, which is almost the same as it takes on an old Atom N270 Netbook.

The Windows XP guest uses binary translation and this special settings:

MemTrimRate = "30"
mainmem.backing = "swap"
sched.mem.pshare.enable = "FALSE"
prefvmx.useRecommendedLockedMemSize = "TRUE"
monitor.idleLoopSpinUS = "100"

Tried the Windows 8.1 machine with the same settings, with and without binary translation, and performance improved when only one CPU is required, but for the parallel tasks it remained the same.

Any idea on how can I obtain a similar performance from the Windows 7+ guests as from the XP guest?

BTW, memory assigned to the guest is not an issue, there is plenty of free memory at all times.

admin · ‎08-05-2015

There does appear to be a lot more contention related to MMU virtualization in the Windows 7 case.

With the VM powered off, try adding the following options to the .vmx file:

monitor_control.disable_eagervalidate_wide = TRUE

monitor_control.disable_eagervalidate_narrow = TRUE

This may not help, and it may actually make things worse, but it's worth a shot.

View solution in original post

admin · ‎08-03-2015

On a Q6600, the dominant overheads for process creation should come from CR3 changes and page table monitoring. These are much more expensive with hardware-assisted virtualization than with binary translation. (Note that these overheads go away on CPUs with hardware support for MMU virtualization.)

You say that you tried the Windows 8.1 machine with the same settings as the Windows XP guest (presumably including binary translation), but you previously refer to "a Windows 8.1 64 bit guest." Binary translation is not supported for 64-bit guests, so Workstation will use hardware-assisted virtualization regardless of your chosen execution mode.

Just to clarify...Have you tried a 32-bit Windows 8.1 guest with binary translation?

RobertoRos · ‎08-03-2015

Just to clarify...Have you tried a 32-bit Windows 8.1 guest with binary translation?

Thank you for your insightful info. Yes, recently I tried Windows 10 Preview 32 bits with binary translation, and it performed almost as bad as its 64 bits variant.

OTOH, I have some Linux 64 bits guests, and they work fine. Running the test suite on a Lubuntu 15.04 64 bits guest takes 24 seconds, vs 100 on the Windows 8.1 64 bits guest, 35 seconds on the Windows XP 32bits guest and 8 seconds on the host. The Lubuntu guest works with VMware default settings.

So do you think that a modern CPU (Haswell+) would improve things wrt the Q6600 for this type of tasks?

Thanks again.

admin · ‎08-03-2015

Can I ask you to run a vprobe script against the (slow) VM while it's running your MT workload?

Download the attached script to your host.

With the VM powered off, add the following lines to /etc/vmware/config on your host:

vprobe.allow = TRUE

vprobe.enable = TRUE

Start the VM and your workload, and then run the following command on the host:

vmware-vprobe </path/to/the/VM/.vmx/file> vt-exit-hist.emt

The script will emit a chunk of statistics every 10 seconds. Please post a representative chunk of the output.

RobertoRos · ‎08-03-2015

Here are two consecutive chunks of output while the test suite is running:

intKey0                  avg     count       min       max   pct%
               0xc       503        11       360       801   0.0%
              0x10       364       105       297       774   0.0%
              0x4d       666       384       459     20754   0.0%
              0x1e       405       767       297     21177   0.0%
               0xa       352      1068       288       657   0.0%
               0x7       501      2172       288     24624   0.0%
              0x2b       449      3634       297     20997   0.1%
              0x1c       447     42526       297    112968   1.4%
               0x1       495     72515       288    133407   2.6%
               0xe       349    128801       288     97362   3.3%
              0x4e       573   2160312       450 14395977 92.2%

intKey0                  avg     count       min       max   pct%
              0x10       391       135       297      1242   0.0%
               0xc       467       372       315     18504   0.0%
              0x4d       593       376       459      1350   0.0%
              0x1e       337       833       297       675   0.0%
               0xa       352      1098       288       729   0.0%
               0x7       663      2356       297    350451   0.1%
              0x2b       436      3628       297     21483   0.1%
              0x1c       443     45883       288    114741   1.6%
               0x1       491     57236       288    143703   2.2%
               0xe       346    128380       288    375039   3.6%
              0x4e       549   2072801       450   2297097 92.1%

EDIT: Forgot to mention that this was a Windows 8.1 64 bits guest.

admin · ‎08-04-2015

The 0xe row is a little surprising. Those are RDPMC VM-exits, which usually do not show up that high. Are you using virtualized performance counters?

The dominant exits are page faults (0x4e), as expected. They are only about 50 times as frequent as CR accesses (0x1c) [likely to be CR3 changes]. If you are spawning about 4250 new processes per second, and they have a working set of about 200K, that would account for these VM-exits. (Actually, divide 4250 by the number of virtual processors.) However, with software MMU virtualization, page faults can be incurred for other reasons as well, including device virtualization.

Out of curiosity, what virtual hardware version is this VM? If it isn't virtual hardware version 11 (compatible with Workstation 11), I would suggest upgrading it. (Go to VM->Manage->Change Hardware Compatibility).

RobertoRos · ‎08-04-2015

Are you using virtualized performance counters?

No.

The dominant exits are page faults (0x4e), as expected. They are only about 50 times as frequent as CR accesses (0x1c) [likely to be CR3 changes]. If you are spawning about 4250 new processes per second, and they have a working set of about 200K, that would account for these VM-exits.

Actually, the whole test suite run, which takes 100 seconds on the Windows 8.1 64 bits VM, spawns ~2000 processes. This makes 20 processes per second, on average, shared among the 4 processors. On other tasks, such as building C++ projects, which run multiple, but long-lived processes, the performance is not so bad. So it seems that process spawning is the key factor here, as if it were extraordinarily expensive. Another example: a Git frontend invokes the git executable about 40 times sequentially to show the detailed status of a source tree: it takes 12 seconds on the Windows 8.1 machine with 4 cores, 6 seconds with one core. An Atom N270 netbook takes less than 2 seconds, the Windows XP guest is slightly slower.

Out of curiosity, what virtual hardware version is this VM? If it isn't virtual hardware version 11 (compatible with Workstation 11), I would suggest upgrading it

The slow Windows 8.1 64bits guest is using version 11, the same as the "fast" Lubuntu 64 bits guest. The "fast" Windows XP 32 bits guest is using version 7. Anyways, I'm experiencing this problem since I installed VM Workstation on this machine, circa 2009. At that time, after much experimentation, I came with a config for the Windows XP guest that made it acceptable, on performance terms (it mostly related to using Binary Translation). Linux guests, 32 and 64 bits, performed well without special tweaks. But so far I was unable to make the Windows post-XP to perform acceptably. Now it is a serious problem, because the XP VM should be deprecated and replaced by something more modern, but there is no way this can be achieved with VMs that perform like a 5 year old single-core netbook.

admin · ‎08-04-2015

Can you try another vprobes script? This one will produce voluminous output. I'm mainly interested in whether or not there are any high percentage entries at the end. This script may fail with 'vpqueue overflow.' If so, give it a couple of tries to see if you can get some output before it dies.

RobertoRos · ‎08-04-2015

First attempt (showing the last lines of each chunk that starts with intKey0:

0xfffff68000002	50884 3.6%
0xfffff680003bb	53319 3.7%
0xfffff680003bc	54557 3.8%
0xfffff6800000f	58963 4.1%
0xfffff6e800147	60157 4.2%
0xfffff680003bd	69510 4.9%
0xfffff68000321	74646 5.2%
0xfffff6bffd5b9	76538 5.4%

0xfffff680003bb	27828 3.5%
0xfffff680003bc	28312 3.6%
0xfffff6800000f	30507 3.9%
0xfffff680003bd	35571 4.5%
0xfffff68000321	39198 5.0%
0xfffff6bffd5b9	39518 5.0%

0xfffff680003bc	41740 3.1%
0xfffff68000002	46314 3.5%
0xfffff6800000f	48723 3.6%
0xfffff680003bd	52607 3.9%
0xfffff6bffd5b9	57482 4.3%
0xfffff68000321	62553 4.7%

0xfffff6800000f	32661 3.1%
0xfffff680003bc	33319 3.2%
0xfffff680003bd	42534 4.0%
0xfffff68000321	45414 4.3%
0xfffff6bffd5b9	47297 4.5%

On the second run something anomalous happened: the CPU usage dropped to almost 0 in the guest and the host and finally I killed the task on the guest. The percentages at the end of *some* chunks are larger. Those are some of those:

0xfffff6fb5ffea	318 0.8%
0xfffff6bffd5b9	446 1.1%
0xfffff6e800148	538 1.4%
0xfffff6e800168	680 1.8%
0xfffff6bffd517	747 1.9%
0xffffffffffd0f	6405 17.0%

0xfffff6e000c25	281 1.0%
0xfffff6bffd5b9	356 1.3%
0xfffff6fb5ffea	362 1.4%
0xfffff6bffd517	597 2.3%
0xffffffffffd0f	6397 25.0%

0xfffff6fb5ffea	254 0.9%
0xffffc00185034	271 1.0%
0xfffff6e800168	282 1.0%
0xfffff6e800147	337 1.2%
0xffffffffffd0f	6415 23.7%

0xfffff6bffd5b9	147 0.7%
0xfffff6bffd5b8	165 0.8%
0xfffff6e800162	219 1.1%
0xfffff6fb5ffea	279 1.4%
0xffffffffffd0f	6198 32.3%

I tried a third run and the final percentages were between 4% and 10%.

RobertoRos · ‎08-04-2015

Just created a Windows 7 32 bits VM, with Binary Translation, 4 cores assigned and this setting in the vmx file:

MemTrimRate = "30"
mainmem.backing = "swap"
sched.mem.pshare.enable = "FALSE"
prefvmx.useRecommendedLockedMemSize = "TRUE"
monitor.idleLoopSpinUS = "100"

This exactly matches the Windows XP 32 bits VM that I normally use because it delivers the minimum acceptable performance.

The Windows 7 VM is 3x slower than the XP VM on tasks that require the creation of multiple short-lived processes.

Puzzling.

admin · ‎08-04-2015

Can you enable statistics collection for the two VMs (32-bit Windows 7 with BT and Windows XP with BT)? See VM->Settings->Options->Advanced, Gather debugging information.

I'll send you a private message on where to upload the stats, but you'll have to "follow" me on the forums before I can PM you.

RobertoRos · ‎08-04-2015

I'll send you a private message on where to upload the stats, but you'll have to "follow" me on the forums before I can PM you.

I'm following you now. The statistics are ready (I guess is the directory "statistics" what I should upload, as a zip file).

RobertoRos · ‎08-05-2015

I'll send you a private message on where to upload the stats,

Thank you. Statistics uploaded.

admin · ‎08-05-2015

It looks like the overhead due to CR3 changes (indicative of process switching) is 2.5 times higher on WIndows 7. Perhaps the scheduling quanta are different for the two guests?

Can you try setting the scheduling quantum higher on Windows 7? See Figure 5-16 in https://www.microsoftpressstore.com/articles/article.aspx?p=2233328&seqNum=7. Change 'Processor Scheduling' to Adjust for best performance of 'Background Services.'

RobertoRos · ‎08-05-2015

Change 'Processor Scheduling' to Adjust for best performance of 'Background Services.'

That change has no effect at all. (I took the performance measurements after rebooting, just in case.)

admin · ‎08-05-2015

There does appear to be a lot more contention related to MMU virtualization in the Windows 7 case.

With the VM powered off, try adding the following options to the .vmx file:

monitor_control.disable_eagervalidate_wide = TRUE

monitor_control.disable_eagervalidate_narrow = TRUE

This may not help, and it may actually make things worse, but it's worth a shot.

RobertoRos · ‎08-05-2015

monitor_control.disable_eagervalidate_wide = TRUE

monitor_control.disable_eagervalidate_narrow = TRUE

This may not help, and it may actually make things worse, but it's worth a shot.

This puts the Windows 7 32bits VM on the same mark as the XP 32 bits VM, when running the test suite. So it helps a lot indeed. For the case of a single sequence of short lived processes (the git frontend that spawns 40 git processes sequentially I mentioned before) it still is 2x slower.

On the Windows 8.1 64 bits VM it has significant impact too: the test suite goes from 100 seconds to 52, still far from the 35 seconds of the XP VM, but an improvement nonetheless. If you know something that I can try on the Windows 64 bits VMs, I would appreciate it.

Are those settings something that I can permanently put on the .vmx? Is there any consequence that I should keep in mind?

Thank you very much for being so persistent troubleshooting this problem.

admin · ‎08-06-2015

For 64-bit guests, my recommendation would be to upgrade your CPU to Westmere or later. Hardware MMU virtualization (EPT) will probably eliminate 90% of the virtualization overhead for this workload, and the improvements in VM-exit/VM-enter latency on newer processors will also help substantially.

As for the configuration options, you should be able to keep them permanently, but note that VMware does not officially support VMs with manual configuration file changes. This is not a configuration that we would test.

RobertoRos · ‎08-06-2015

Okay, I'll think about replacing this machine with a new one. In the meantime, those vmx settings will alleviate the pain. Thank you.

All

Bad MT performance XP vs modern Windows