gf111
Contributor
Contributor

slow mutex performance in windows xp guest on ESX 5 host

Jump to solution

I did some casual producer consumer benchmarks written in C with visual studio 2005 on dual core 2 ghz cpu, with identical o.s. win xp professional and cpu type it seems to me on ESX the performance of mutex, events and corresponding waitforsingleobject is from 1/10th to 1/2 the speed on the physical machine, on windows 7 instead the performance is about half of the corresponding physical machine, I was wondering if anyone else noticed, might this be due to how mutex/critical sections/events/semaphores are emulated in ESX binary translation, are there any settings for windows xp 32 bit guest under ESX 5 to enable boosting inter process/thread synchronisation speed to make it similar to physical machine perhaps ?

I enclose a small test project for vs 2005 in case anyone could take the trouble to try and compare raw speed of synchronisation primitives on physical and ESX especially if in win xp there is a macroscopic speed penalty or maybe I did something wrong like I ran the test on an overloaded ESX perhaps.

0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal

gf111 wrote:

Is there any way to wake up another process in win xp without incurring TPR register bashing ?

I don't know.  It's not clear to me why Windows wants to modify the TPR in the first place.  The TPR is for blocking external interrupts, and I can't see why Windows would modify which external interrupts are blocked based upon which user level process is active.


I noticed that windows 7 synchronisation is much faster on that same ESX with FlexPriority presumably not activated, presumably due to "lazy TPR" which I believe has been implemented in o.s. later than xp, so I guess the most realistic prospect is to go with win 7, given I don't think in my particular environment I am working on they'd be willing to risk life and limb with potentially unstable FlexPriority patching of bios, given that xp is at end of supported life anyway, but thanks for giving me a clue what may be going on anyway.

64-bit Windows XP is another option.  It still bashes on the TPR, but it uses CR8 rather than MMIO. (The AMD hotfix does the same with an AMD ISA extension that allows 32-bit code to access CR8.)

Also note that FlexPriority is not a panacea.  It can reduce the cost of a TPR update from ~2000 cycles to ~400 cycles, but this is still a far cry from what it costs natively, which is ~2 cycles.

With binary translation, the TPR updates are rather inexpensive, but BT has its own performance issues.

Is your Windows XP guest running with hardware-assisted virtualization or binary translation?  You can check the vmware.log file for 'HV Settings' to find out.

View solution in original post

0 Kudos
7 Replies
RebeccaG
VMware Employee
VMware Employee

Hi, unfortunately this is not the correct forum, as this is the VMmark specific forum. Try asking this in the general performance forum: https://communities.vmware.com/community/vmtn/performance

0 Kudos
admin
Immortal
Immortal

Windows XP is notorious for pounding on the TPR, which may be a factor in your performance issues.  On AMD hardware, you can try the hotfix at Low I / O performance and / or increased CPU utilization in a virtualized Windows XP 32-Bit System.  On Intel hardware, you can enable FlexPriority (with the advanced configuration option "monitor_control.disable_flexpriority=FALSE")  if you are running a version of ESX with FlexPriority disabled by default, but you should make sure that you have a BIOS update to correct the CPU erratum described as "Virtual-APIC Page Accesses With 32-Bit PAE Paging May Cause a System Crash." (This erratum has a myriad of alphanumeric designations, depending on the processor.)  In either case, you should change the execution mode to "Intel VT-x/EPT or AMD-V/RVI" to see any improvement.

gf111
Contributor
Contributor

Thanks for the useful information, my hypothesis is that when I do SetEvent in the producer and WaitForSingleObject in the consumer from a windows application, I am actually calling into the windows xp kernel which executes the corresponding kernel KeSetEvent/KeWaitForSingleObject which adjusts IRQL with the TPR register which is slower if FlexPriority is not enabled.

Is there any way to wake up another process in win xp without incurring TPR register bashing ?

Supposing for a moment I wanted to bypass the way the win xp o.s. thrashes the TPR register and wrote a device driver to implement my own "SetEvent/WaitForSingleObject", would it run faster on win xp 32 bit o.s. ? would the overhead of calling DeviceIoControl from user mode to access a hypothetical device driver which is able to control the windows user scheduler to precisely control activation of the consumer process bringing it's consumer thread into execution myself.

I noticed that windows 7 synchronisation is much faster on that same ESX with FlexPriority presumably not activated, presumably due to "lazy TPR" which I believe has been implemented in o.s. later than xp, so I guess the most realistic prospect is to go with win 7, given I don't think in my particular environment I am working on they'd be willing to risk life and limb with potentially unstable FlexPriority patching of bios, given that xp is at end of supported life anyway, but thanks for giving me a clue what may be going on anyway.

0 Kudos
admin
Immortal
Immortal

gf111 wrote:

Is there any way to wake up another process in win xp without incurring TPR register bashing ?

I don't know.  It's not clear to me why Windows wants to modify the TPR in the first place.  The TPR is for blocking external interrupts, and I can't see why Windows would modify which external interrupts are blocked based upon which user level process is active.


I noticed that windows 7 synchronisation is much faster on that same ESX with FlexPriority presumably not activated, presumably due to "lazy TPR" which I believe has been implemented in o.s. later than xp, so I guess the most realistic prospect is to go with win 7, given I don't think in my particular environment I am working on they'd be willing to risk life and limb with potentially unstable FlexPriority patching of bios, given that xp is at end of supported life anyway, but thanks for giving me a clue what may be going on anyway.

64-bit Windows XP is another option.  It still bashes on the TPR, but it uses CR8 rather than MMIO. (The AMD hotfix does the same with an AMD ISA extension that allows 32-bit code to access CR8.)

Also note that FlexPriority is not a panacea.  It can reduce the cost of a TPR update from ~2000 cycles to ~400 cycles, but this is still a far cry from what it costs natively, which is ~2 cycles.

With binary translation, the TPR updates are rather inexpensive, but BT has its own performance issues.

Is your Windows XP guest running with hardware-assisted virtualization or binary translation?  You can check the vmware.log file for 'HV Settings' to find out.

0 Kudos
gf111
Contributor
Contributor

I would guess the TPR gets altered when calling into SetEvent/WaitForSingleObject by the kernel when other kernel code executes needing to mask interrupts perhaps ? or the synchronisation primitives being "defensive" acting on TPR but that is beyond me, right now the numbers I am getting on a 2 ghz dual core Intel cpu :

SetEvent/WaitForSingleObject on physical win xp about 250 k/s

SetEvent/WaitForSingleObject on ESX (don't know how it's configured whether binary or VT) win xp about 60 k/s

so I bumped into this obscure LPC api :

http://blogs.msdn.com/b/ntdebugging/archive/2007/07/26/lpc-local-procedure-calls-part-1-architecture...

I used NtRequestPort sending a minimal message (no reply from the consumer) just to be able to wake up the consumer process (which does blocking NtReplyWaitReceivePort) seems to be about 300 k/s physical and 100 k/s on ESX so it would seem quite a bit faster 50% say on stock unoptimised ESX

also using semaphores seems equivalent to using events, I have to further test the LPC method to understand if it's as reliable as events/semaphores

out of interest using TCP socket to wake up a consumer doing blocking reads with 1 byte say seems about 50 k/s both physical and ESX on localhost / 127.0.0.1 nic with 4 mb read/write buffers with default windows xp tcp/ip stack so that seems a no go

0 Kudos
admin
Immortal
Immortal

gf111 wrote:

SetEvent/WaitForSingleObject on ESX (don't know how it's configured whether binary or VT) win xp about 60 k/s

It's worth trying both BT and VT.  For this application, I'm not sure that the default selection is going to give the best performance.

gf111
Contributor
Contributor

I tried this MS condition variables example in a "what the heck" style not expecting much difference :

http://msdn.microsoft.com/en-us/library/ms686903%28v=vs.85%29.aspx

I enclose a vs 2010 project that on 64 bit win7 it seems to run at about 20 million wake consumer/producer per second on 2 ghz 2 core cpu BOTH physical and ESX 5 (I don't have system level access to verify logs to tell whether binary or VT translation)

Notice in this test I am not trying to measure memory bandwidth, i.e. how fast I can transfer data bidirectionally from/to processes, but, rather how fast I can wake the consumer, which in my case is a real time database which carries out a field read/write transaction at each wake in the db process, from my tests the cpu utilization in terms of transactions (i.e. client producer W, db consumer R, consumer W, producer R) is much higher doing controlled wake up of the consumer db engine rather than have it haphazardly polling the producer because the o.s. scheduler misinterprets the consumer polling as useful "work" whereas it's just wasting memory bandwidth with unnecessary read cycles polling a data ready variable set by the client.

If this was true (?) and I haven't done any clumsy mistakes it would really be too good to be true imho, it would seem it's actually doing like 1 wake / 100 cycles ? possibly it may be running with VT and FlexPriority because it's very close to the 400 cycles mentioned, I am aware modern cpus don't work in terms of opcodes/cycles but actually the other way around i.e. many opcodes per cycle where possible with pipelining, so it would seem at first glance to "blow your socks off", so definitely the way to go seems win7 which was probably developed virtualisation aware from the word "go" it's about 300 x faster than win xp 32 bit, unfortunately WakeConditionVariable / SleepConditionVariableCS seem only inter thread not inter process, I could create a remote thread in the db process but it would have to be signaled from another process, I wonder if perhaps using vmware VMCI socket api it's possible to get low latency inter process ipc ?

I also looked at using WM_COPYDATA but it's really slow just like normal sockets about 50k/s signals.

Also there is UMS user mode scheduling but again only inter thread in win 7  User-Mode Scheduling (Windows) but I am in need of inter process scheduling ?

I also attempted with this MS example to call into a kernel device driver :

http://code.msdn.microsoft.com/windowshardware/Event-d245ecb4

with these results on win 7 64 bit :

event 180 k/s both physical and ESX

irp 180 k/s physical 110 k/s ESX

but I would need to make 2 blocking DeviceIoControl calls from both producer and consumer thus halving the rate making it not any faster than inter process event/waitforsingleobject 100 k/s  ... ?

I enclose visual studio 2013 example HardwareEventSample.rar if anyone is interested to benchmark ? you need to sign event.sys on win7 or you won't be able to load the driver, use "event 0 0" for irp and "event 0 1" for events methods respectively.

Another method is APC asynchronous procedure calls with QueueUserAPC, I enclose a visual studio 2005 project that does :

on win 7 64 bit 500 k/s both physical and ESX which is the highest rate of one directional consumer wakes so far I could get on ESX (I have the producer polling for the reply in order not to incur inter process waiting with APC)

on xp 32 bits it's only 80 k/s wakes though wrt to xp physical 500 k/s (I also tried putting consumer/consumer affinity:

DWORD res1= SetThreadAffinityMask( GetCurrentThread(), 1 );// consumer core 1

DWORD res1= SetThreadAffinityMask( GetCurrentThread(), 2 );// producer core 2

but it didn't change thread switching rate appreciably possibly because windows is not really real time oriented)


0 Kudos