VMware Cloud Community
sispeo
Contributor
Contributor
Jump to solution

Performance issue using virtualization

When comparing our software installed in a physical host and in a comparable VM (same CPU, memory), we notice that the product is two times slower when running in a VM (Guest OS is Windows 2003, using an ESXi 4.0 host. Software uses only one CPU)

As we are not in production environnement, we tried with all others VM powered off. We tested our VM with and without CPU reservation (best results with), with and without memory (quite no difference), and with 1, 2 and 4 vCPU

After several tests, it seems that the problem comes from usage of semaphores : when replacing them by critical sections (but we can't replace all), performances are quite the same between the physical host and the VM. All code is executed with similar performances but when using semaphores, the VM consumes CPU longer than the physical host.

Does anyone already eared something about such a problem ? Is there a reason to explain bad performances when using semaphores under Windows hosted by ESXi 4.0 ?

For example, we wrote a simple program to benchmark semaphores under Windows hosted by ESX (in our lab, it took 10sec on a physical host and 22sec in a VM) :

#include "stdafx.h"

int _tmain(int argc, _TCHAR* argv[])

{

unsigned __int64 nCount;

DWORD nTickCount = ::GetTickCount();

HANDLE hSemaphoreBridgets = CreateSemaphore(NULL, 1, 1, NULL);

for (nCount = 0; nCount < 10000000; ++nCount)

{

WaitForSingleObject(hSemaphoreBridgets, INFINITE);

ReleaseSemaphore(hSemaphoreBridgets, 1, NULL);

}

printf("Duration %d s\r\n", (::GetTickCount() - nTickCount) / 1000);

CloseHandle(hSemaphoreBridgets);

return 0;

}

Reply
0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

I profiled your benchmark and found that it spends most of its time in these three Windows HAL functions:

39.83% hal!KfLowerIrql

19.82% hal!KeRaiseIrqlToDpcLevel

19.07% hal!KeRaiseIrqlToSynchLevel

The hot spots in each function are TPR accesses (0FFFE0080h is the address of the TPR in the local APIC):

hal!KfLowerIrql:

807168e4 890d8000feff mov dword ptr ds:\[0FFFE0080h],ecx

807168ea a18000feff mov eax,dword ptr ds:\[FFFE0080h]

hal!KeRaiseIrqlToDpcLevel:

807168a0 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]

807168a6 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h

hal!KeRaiseIrqlToSynchLevel:

807168bc 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]

807168c2 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h

Since the local APIC is virtualized, a TPR access typically causes a VM-Exit under hardware virtualization. However, Intel has introduced FlexPriority, which avoids the VM-Exit for all TPR reads and for some TPR writes. Because of this, ESX 4.0 defaults to VT-x for 32-bit Windows 2003 on Intel chips with FlexPriority. Unfortunately, FlexPriority is not a panacea. On native hardware, TPR accesses generally take only a few cycles. With FlexPriority, TPR accesses that do not cause a VM-Exit may still take several hundred cycles. TPR accesses that do cause VM-Exits take several thousand cycles. Fortunately, we still have the option of using binary translation. Under binary translation, TPR accesses generally take tens of cycles.

For this particular workload, you should configure your guest to use binary translation. On my Penryn system, the benchmark runs in 22 seconds using VT-x (with FlexPriority), but it only takes 13 seconds using binary translation. (For completeness, it takes 90 seconds using VT-x without FlexPriority).

Your client's situation is different. AMD has never introduced a technology equivalent to FlexPriority. However, if your client has configured their VM to use hardware MMU support, then the VM will be using AMD-V, which suffers from the same problems as VT-x without FlexPriority. Make sure that they have configured the VM to use software MMU support so that it will execute using binary translation. (The default execution mode for this guest under ESX 3.5 is binary translation.)

View solution in original post

Reply
0 Kudos
22 Replies
admin
Immortal
Immortal
Jump to solution

The default execution mode for Windows 2003 is binary translation. You may be measuring system call overheads, though it is not clear to me why a semaphore implementation would require system calls.

If ESX supports VT-x or AMD-V on your hardware and you have SP2 installed in the guest, I would recommend changing the execution mode to 'VT-x or AMD-V.' Then try the experiment again.

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Unfortunately, VT-x mode has already been set...

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Can you upload your benchmark program?

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Here it comes... it is a 64 bit binary.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Ah. So you are running Windows 2003 x64? If so, you can ignore what I said about the default execution mode; I was assuming you were running 32-bit Windows 2003.

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Sorry, I should have mentionned it before.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

This seems to be a well-behaved benchmark with low virtualization overheads. I can't really explain your 2x slowdown. Can you tell me which CPU you are using and exactly which Windows release you are testing?

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Our ESX hosts are "small" servers as they are used for tests. The one used for this bench is a Xeon 5130 running under ESXi 4.0. Guest OS is Windows Server 2003 64 bits

, Enterprise Edition, Service Pack 2

We are asking environment of our client having same problem under ESX 3.5

7341_7341.jpg

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Here is our client ESX's configuration

VMWare

installed

ja

OS

Windows Server 2003 EE

SP2

BITs per

OS

32

Servertyp

HP ProLiant DL585 G5

Prozessortyp

AMD

Opteron

Cores (in Klammer verfügbare

Cores)

4 (of

16)

CPU-Taktfrequenz

2,3

Hauptspeicher

8 (of

64)

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

So, the problem occurs on both AMD and Intel processors, with both 32-bit and 64-bit versions of Windows 2003, on ESX 3.5 and ESX 4? That sounds pretty widespread. I'm surprised that nothing jumped out at me. I'll file a bug report with our performance team.

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

May be I was drunk ...

Both guest OS are 32 bits, Windows Server 2003 Enterprise... (one running on ESXi 4.0/Intel, the other one running on ESX 3.5/AMD)

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

Do you have any news on this subject ? Is there something we can do ?

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

I was unable to replicate your results with the 64-bit benchmark you sentm using Windows 2003 x64. If you package up a 32-bit version of your benchmark, I'll have another look.

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

"This is it !"

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

I profiled your benchmark and found that it spends most of its time in these three Windows HAL functions:

39.83% hal!KfLowerIrql

19.82% hal!KeRaiseIrqlToDpcLevel

19.07% hal!KeRaiseIrqlToSynchLevel

The hot spots in each function are TPR accesses (0FFFE0080h is the address of the TPR in the local APIC):

hal!KfLowerIrql:

807168e4 890d8000feff mov dword ptr ds:\[0FFFE0080h],ecx

807168ea a18000feff mov eax,dword ptr ds:\[FFFE0080h]

hal!KeRaiseIrqlToDpcLevel:

807168a0 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]

807168a6 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h

hal!KeRaiseIrqlToSynchLevel:

807168bc 8b158000feff mov edx,dword ptr ds:\[0FFFE0080h]

807168c2 c7058000feff41000000 mov dword ptr ds:\[0FFFE0080h],41h

Since the local APIC is virtualized, a TPR access typically causes a VM-Exit under hardware virtualization. However, Intel has introduced FlexPriority, which avoids the VM-Exit for all TPR reads and for some TPR writes. Because of this, ESX 4.0 defaults to VT-x for 32-bit Windows 2003 on Intel chips with FlexPriority. Unfortunately, FlexPriority is not a panacea. On native hardware, TPR accesses generally take only a few cycles. With FlexPriority, TPR accesses that do not cause a VM-Exit may still take several hundred cycles. TPR accesses that do cause VM-Exits take several thousand cycles. Fortunately, we still have the option of using binary translation. Under binary translation, TPR accesses generally take tens of cycles.

For this particular workload, you should configure your guest to use binary translation. On my Penryn system, the benchmark runs in 22 seconds using VT-x (with FlexPriority), but it only takes 13 seconds using binary translation. (For completeness, it takes 90 seconds using VT-x without FlexPriority).

Your client's situation is different. AMD has never introduced a technology equivalent to FlexPriority. However, if your client has configured their VM to use hardware MMU support, then the VM will be using AMD-V, which suffers from the same problems as VT-x without FlexPriority. Make sure that they have configured the VM to use software MMU support so that it will execute using binary translation. (The default execution mode for this guest under ESX 3.5 is binary translation.)

Reply
0 Kudos
Scissor
Virtuoso
Virtuoso
Jump to solution

jmattson,

I just want to say how impressed I am with the level of technical detail you provided in your post. Even if your reply doesn't help the original poster, posts like this are the reason why these forums are such a great resource.

Thank you!

Reply
0 Kudos
sispeo
Contributor
Contributor
Jump to solution

I am really impressed too ! :smileygrin:

I thought I had to set binary translation by setting monitor.virtual_exec to software but hardware value made our benchmark runs in 10 seconds rather than the initial 22 sec.

For our client using AMD based ESX, will we just need to ajust monitor.virtual_exec and monitor.virtual_mmu ?

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Thanks. I hope you found this information helpful.

ESX 3.5 does not respect monitor.virtual_exec. It only supports hardware virtualization on AMD CPUs with RVI, and you get both AMD-V and RVI by requesting RVI:

monitor.virtual_mmu = "hardware"

You can specifically request binary translation on ESX 3.5 by requesting a software MMU:

monitor.virtual_mmu = "software"

Note that this has changed slightly with ESX 4.0. To specifically request binary translation on ESX 4.0, you need to specify:

monitor.virtual_exec = "software"

admin
Immortal
Immortal
Jump to solution

After the kudos, it's embarrassing to admit this, but I did all of this testing with Windows 2003 RTM. Windows 2003 SP2 has addressed this particular issue. See this Microsoft TechNet article.

After installing SP2, my new timings are 16 seconds for binary translation and only 6 seconds for VT-x (with or without FlexPriority).

To summarize all of these findings: if you are running this kind of a workload on Windows 2003 pre-SP2, you should use binary translation, but on Windows 2003 SP2, you should use hardware virtualization.