Hello,
we have horrible performance problems with both, ESXi 3.5 and 4.0. First I will list some server setups:
Proliant DL160 G5 with LSI Logic HW RAID 1:
ESXi 3.5
4 x 1,994 ghz
10 Gbyte RAM
8 machines, one windows server and 7 Debian machines, all with installed and up to date.
There are mostly some little web hosting services, one DNS server and 4 not very often used database servers (mysql).
Proliant DL140 G3 with e200 SmartArray RAID 1:
ESXi 4.0
4 x 1,6 ghz
13 gbyte RAM
6 machines, one win srv 2008, three win srv 2003 and two debian machines
There is some more I/O and CPU load needed
There are 4 databases: one MSSQL, two MySQL, one adaptive anywhere
Proliant DL 380 G5 with smartarray p400:
8 gbyte RAM
4 x 2,5 ghz
Three debian machines. Two with apache and two with mysql (one for replication).
Prolaint DL 180 G5 smartarray e200:
12 gbyte ram
4 x 2,5 ghz
9 machines:
two debian webhosting servers which are in a good usage (apache, mysql, ftp etc)
one webhosting server with the same configuration, but it is not heavily used, yet
one idling server, also with apache and mysql
one centos server with postgresql and a big java application
three servers with apache and mysql which haven't got so much load
So our problem is, that most servers does not show up any load, esx reports for everything something like:
RAM usage: 60%
CPU Usage: 25%
Sometimes the system are so slow, that we get following:
$ time curl -H Host:\ <hostname>/script.php
hello world!
real 0m20.866s
user 0m0.016s
sys 0m0.000s
20 seconds just for a simple hello world printing php script.
Where could be the problem? Personaly I think (also if they are mostly not often used) the amount of database servers may be the problem..
Cheers.
Need to get using ESXtop to see where the contention is occurring. How many vCPUs does each machine have? Ideally use 1 vCPU per machine unless there are very good reasons to use more, as often using 4vCPUs will result in slower performance than 1 as the VM needs to wait for all 4 pCPUs to become free.
esxtop:
10:11:21am up 49 days 21:09, 137 worlds; CPU load average: 0.51, 0.56, 0.45
PCPU(%): 14.23, 10.30, 14.51, 14.13 ; used total: 13.29
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP %CSTP %MLMTD
532 532 <webhosting1> 5 3.36 3.36 0.02 499.04 0.20 97.01 0.13 0.00 0.00
561 561 <simple_web> 5 0.59 0.59 0.01 500.00 0.07 99.89 0.13 0.00 0.00
1582 1582 <webhosting2> 7 41.30 41.33 0.26 660.95 1.39 158.57 0.48 0.00 0.00
3036652 3036652 <new_webhosting> 6 0.60 0.61 0.00 600.00 0.11 99.98 0.08 0.00 0.00
4639441 4639441 <postgres_and_java> 6 3.69 3.68 0.02 598.30 1.14 94.96 0.41 0.00 0.00
Only one server has got two vcpus. I added a second one because I hoped to get a better performance
Hmmm, your %WAIT looks very high. It may be waiting for Disk/IO to complete. You can use esxtop to view your CPU,memory, diskIO and NetworkIO, so you will have to troubleshoot this. Here is a guide on how to do that...
And some infor on diagnosing results...
http://communities.vmware.com/docs/DOC-9279
Good Luck!
Yes %RDY looks problematic, but also looking at idle that's very low so it's not like the CPU's themselves are idling in the guests which they would be if they were blocked. You don't have any maximums set in the vm's do you? Youhave plenty of free cpu at a macro level by the looks of it.
No, there aren't any limits.
First I'd agree with the above, change ALL VMs to 1 vCPU.
Next look at the RAM committed vs available.
Next look at the disk controllers. Do any of these have BBWC? If not that is absolutely essential.
HTH
Couple of thoughts - your test load - does that exhibit the same poor performance on all the servers?
If you have v.motion I would try iterativley moving workloads away from your test load until you get reasonable performance back. This may be able to show which load is causing problems - obviously if you get down to just your test load you almost certainly have a serious config or fault condition. If you don't have vmotion you can acheive a 'similar' effect by tuning down the guests using cpu shares if you can't take the VM's down at all
If you mean BBWC (battery backed up write cache)? this is a very good point - make sure the batteries are good otherwise the controller will fall back to non cache mode and poor performance will ensue.
You can look into this by looking at KAVG and DAVG values in esxtop, they will give you a good idea of what i/o latencies you are facing from the i/o subsystem. DAVG should be in the order of your avg disk seek time (5-10 mS or so).