We have 2 physical machines in company. Both have the same HW configuration, running the same CPU:
Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
One is a regular linux, and on second one we have ESX, version 4.
In the ESX we have linux, which should be almost identical with the linux on first machine.
The kernel version is: (a bit old for these days, but needed cause of old project)
Linux x 2.4.21-53.ELhugemem #1 SMP Wed Nov 14 03:46:17 EST 2007 i686 i686 i386 GNU/Linux
The problem is that virtualized linux is running slower. I have read, that the overhead should be ~8%, which is something I could live with. But the performance hit can be seen by naked eye.
I made 2 test programs:
First was just doing some extensive work in userspace (e.g. giant loop and counting numbers). Here the performance hit is around 8%-10%, which is fine.
Second program is doing syscalls - "close(0);" in loop. And this is where things aren't pretty anymore:
Linux running on real HW:
% time seconds usecs/call calls errors syscall
-
-
-
-
-
-
99.65 0.963257 10 100002 99999 close
0.15 0.001403 33 43 41 open
0.14 0.001368 34 40 36 stat64
0.06 0.000566 566 1 execve
0.00 0.000027 5 5 old_mmap
0.00 0.000007 4 2 fstat64
0.00 0.000006 6 1 read
0.00 0.000006 6 1 munmap
0.00 0.000004 4 1 uname
0.00 0.000003 3 1 brk
-
-
-
-
-
-
100.00 0.966647 100097 100076 total
real 0m4.613s
user 0m0.760s
sys 0m3.730s
Process 14702 detached
% time seconds usecs/call calls errors syscall
-
-
-
-
-
-
77.76 17.1206772 182 100002 99999 close
3.01 0.703602 703602 1 execve
2.99 0.700382 700382 1 set_thread_area
2.99 0.700337 700337 1 munmap
2.99 0.700328 700328 1 uname
2.99 0.700123 700123 1 read
2.99 0.700108 700108 1 brk
2.14 0.500571 100114 5 old_mmap
1.71 0.400229 200115 2 fstat64
0.43 0.100360 33453 3 1 open
-
-
-
-
-
-
100.00 23.412812 100018 100000 total
real 0m48.434s
user 0m5.410s
sys 0m40.610s
Process 14702 detached
% time seconds usecs/call calls errors syscall
-
-
-
-
-
-
77.76 17.1206772 182 100002 99999 close
3.01 0.703602 703602 1 execve
2.99 0.700382 700382 1 set_thread_area
2.99 0.700337 700337 1 munmap
2.99 0.700328 700328 1 uname
2.99 0.700123 700123 1 read
2.99 0.700108 700108 1 brk
2.14 0.500571 100114 5 old_mmap
1.71 0.400229 200115 2 fstat64
0.43 0.100360 33453 3 1 open
-
-
-
-
-
-
100.00 23.412812 100018 100000 total
real 0m48.434s
user 0m5.410s
sys 0m40.610s
Process 14702 detached
% time seconds usecs/call calls errors syscall
-
-
-
-
-
-
77.76 17.1206772 182 100002 99999 close
3.01 0.703602 703602 1 execve
2.99 0.700382 700382 1 set_thread_area
2.99 0.700337 700337 1 munmap
2.99 0.700328 700328 1 uname
2.99 0.700123 700123 1 read
2.99 0.700108 700108 1 brk
2.14 0.500571 100114 5 old_mmap
1.71 0.400229 200115 2 fstat64
0.43 0.100360 33453 3 1 open
-
-
-
-
-
-
100.00 23.412812 100018 100000 total
real 0m48.434s
user 0m5.410s
sys 0m40.610s
Linux running on ESX:
Process 14702 detached
% time seconds usecs/call calls errors syscall
-
-
-
-
-
-
77.76 17.1206772 182 100002 99999 close
3.01 0.703602 703602 1 execve
2.99 0.700382 700382 1 set_thread_area
2.99 0.700337 700337 1 munmap
2.99 0.700328 700328 1 uname
2.99 0.700123 700123 1 read
2.99 0.700108 700108 1 brk
2.14 0.500571 100114 5 old_mmap
1.71 0.400229 200115 2 fstat64
0.43 0.100360 33453 3 1 open
-
-
-
-
-
-
100.00 23.412812 100018 100000 total
real 0m48.434s
user 0m5.410s
sys 0m40.610s
The machine running on ESX spent 1200% more time doing the same thing.
Any ideas why this is happening? It seems, that the context switch is very expensive for some reason.
You are correct that EPT starts with Nehalem; Core (2) has no EPT.
Regarding your comment
We are running hugemem kernel because we have more than 4g of RAM.So,
I'm thinking to use hugemem kernel config as base, but switching from
4g/4g to 3g/1g split.
let me first point out that other kernels than hugemem (e.g.,
bigsmp) can address up to 64 GB of memory, using PAE in 32 bit
mode.
Novell has some verbiage here
http://www.novell.com/coolsolutions/tip/16262.html
that you can use.
I have not personally tried to switch the hugemem to 3/1 (I didn't
even know it could be done), so I can't say if this will help you or
not. But if it doesn't, the bigsmp kernel seems to meet your needs
for memory addressability beyond 4 GB (and it is supported by VMware).
Best of luck,
Ole
That's a surprisingly large slowdown. Using binary translation, system calls should run about 2000 cycles more than native. See this ASPLOS paper.
For this workload, I would recommend reconfiguring your VM to use hardware-assisted virtualization, which runs system calls at native speed. Your Xeon E5420 should support VT-x.
I'm looking at the /proc/cpuinfo and I can't see vmx. Is this reliable source to check if vmx is currently turned on?
I'm going definetly check the BIOS settings on Monday regarding vmx.
I was thinking, that maybe because "sysenter" is also missing at procinfo, that int 0x80 is perhaps the root cause.
This is what I see inside the guest (virtualized) linux:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 7
model name : Intel(R) Xeon(R) CPU E5420 @ 2.50GHz
stepping : 10
cpu MHz : 2493.779
cache size : 6144 KB
physical id : 1
siblings : 4
core id : 7
cpu cores : 4
runqueue : 7
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm nx lm
bogomips : 4980.73
The guest will never report vmx, because we do not virtualize VT-x. Sysenter is there; it's called "sep".
Use esxcfg-info on the host to see if VT-x is enabled. Look for HV Support. A value of 3 means that the system supports VT-x and VT-x is enabled in the BIOS.
As a follow-up to the comment that Jim made earlier
That's a surprisingly large slowdown. Using binary translation, system calls
should run about 2000 cycles more than native. See this ASPLOS paper.
it is, in a way, not surprising that system calls run very slow
when using the hugemem kernel with BT.
This hugemem kernel (and only this kernel) uses separate address
spaces for kernel and user space. As a consequence, every system
call will requires two address space changes: the first on the way
into the kernel and the second on the way back from the kernel to
user space. For this reason, the hugemem kernel is also known as
teh 4g/4g kernel since it provides a full 4G address space to both
user-mode and kernel space.
Other kernels, including windows and "normal" versions of linux map
the kernel into the top 2 GB (or 1 GB, it depends) of the address
space. This allows system calls to proceed without change of address
space. As a result, system calls are much faster.
The hugemem kernel's approach slows down system calls, even natively.
In a VM, the slowdown can get further amplified because the address
space change (%cr3 assignment) is in itself slower than native (unless
you run with RVI/EPT support).
Enabling VT-x will help some, but it will not fix all the performance
problems. Running on an Intel CPU with EPT, or an AMD CPU with RVI will
help even more. If this is not possible, you should probably change the
kernel.
By the way, VMware does not support guests that run with the hugemem
kernel (because it is too slow, not because we know of any correctness
problems with it).
Hope this helps,
Ole
Ole: Thank you, your reply is very helpful, I completely missed that the kernel is using 4g/4g split.
From what I could find, EPT support starts with Nehalem microarchitecture, and E5420 is Intel core.
We are running hugemem kernel because we have more than 4g of RAM.So, I'm thinking to use hugemem kernel config as base, but switching from 4g/4g to 3g/1g split.
You are correct that EPT starts with Nehalem; Core (2) has no EPT.
Regarding your comment
We are running hugemem kernel because we have more than 4g of RAM.So,
I'm thinking to use hugemem kernel config as base, but switching from
4g/4g to 3g/1g split.
let me first point out that other kernels than hugemem (e.g.,
bigsmp) can address up to 64 GB of memory, using PAE in 32 bit
mode.
Novell has some verbiage here
http://www.novell.com/coolsolutions/tip/16262.html
that you can use.
I have not personally tried to switch the hugemem to 3/1 (I didn't
even know it could be done), so I can't say if this will help you or
not. But if it doesn't, the bigsmp kernel seems to meet your needs
for memory addressability beyond 4 GB (and it is supported by VMware).
Best of luck,
Ole
By switching away from bigsmp kernel, the performance of (virtual build) machine increased by 25-35%. Comparing this to real HW, the overhead of virtualization is about: 4-15% (depending of how the build is done: clearmake vs. make)
oh, not bigsmp,.. I meant hugemem