Re: ESXi 3.5 and 4.0 are mostly horrible slow

pmatthaei · ‎08-25-2009

Hello,

we have horrible performance problems with both, ESXi 3.5 and 4.0. First I will list some server setups:

Proliant DL160 G5 with LSI Logic HW RAID 1:

ESXi 3.5

4 x 1,994 ghz

10 Gbyte RAM

8 machines, one windows server and 7 Debian machines, all with installed and up to date.

There are mostly some little web hosting services, one DNS server and 4 not very often used database servers (mysql).

Proliant DL140 G3 with e200 SmartArray RAID 1:

ESXi 4.0

4 x 1,6 ghz

13 gbyte RAM

6 machines, one win srv 2008, three win srv 2003 and two debian machines

There is some more I/O and CPU load needed

There are 4 databases: one MSSQL, two MySQL, one adaptive anywhere

Proliant DL 380 G5 with smartarray p400:

8 gbyte RAM

4 x 2,5 ghz

Three debian machines. Two with apache and two with mysql (one for replication).

Prolaint DL 180 G5 smartarray e200:

12 gbyte ram

4 x 2,5 ghz

9 machines:

two debian webhosting servers which are in a good usage (apache, mysql, ftp etc)

one webhosting server with the same configuration, but it is not heavily used, yet

one idling server, also with apache and mysql

one centos server with postgresql and a big java application

three servers with apache and mysql which haven't got so much load

So our problem is, that most servers does not show up any load, esx reports for everything something like:

RAM usage: 60%

CPU Usage: 25%

Sometimes the system are so slow, that we get following:

$ time curl -H Host:\ <hostname>/script.php

hello world!

real 0m20.866s

user 0m0.016s

sys 0m0.000s

20 seconds just for a simple hello world printing php script.

Where could be the problem? Personaly I think (also if they are mostly not often used) the amount of database servers may be the problem..

Cheers.

JoJoGabor · ‎08-25-2009

Need to get using ESXtop to see where the contention is occurring. How many vCPUs does each machine have? Ideally use 1 vCPU per machine unless there are very good reasons to use more, as often using 4vCPUs will result in slower performance than 1 as the VM needs to wait for all 4 pCPUs to become free.

pmatthaei · ‎08-25-2009

esxtop:

10:11:21am up 49 days 21:09, 137 worlds; CPU load average: 0.51, 0.56, 0.45

PCPU(%): 14.23, 10.30, 14.51, 14.13 ; used total: 13.29

ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP %CSTP %MLMTD

532 532 <webhosting1> 5 3.36 3.36 0.02 499.04 0.20 97.01 0.13 0.00 0.00

561 561 <simple_web> 5 0.59 0.59 0.01 500.00 0.07 99.89 0.13 0.00 0.00

1582 1582 <webhosting2> 7 41.30 41.33 0.26 660.95 1.39 158.57 0.48 0.00 0.00

3036652 3036652 <new_webhosting> 6 0.60 0.61 0.00 600.00 0.11 99.98 0.08 0.00 0.00

4639441 4639441 <postgres_and_java> 6 3.69 3.68 0.02 598.30 1.14 94.96 0.41 0.00 0.00

Only one server has got two vcpus. I added a second one because I hoped to get a better performance

JoJoGabor · ‎08-25-2009

Hmmm, your %WAIT looks very high. It may be waiting for Disk/IO to complete. You can use esxtop to view your CPU,memory, diskIO and NetworkIO, so you will have to troubleshoot this. Here is a guide on how to do that...

And some infor on diagnosing results...

http://communities.vmware.com/docs/DOC-9279

Good Luck!

dburgess · ‎08-25-2009

Yes %RDY looks problematic, but also looking at idle that's very low so it's not like the CPU's themselves are idling in the guests which they would be if they were blocked. You don't have any maximums set in the vm's do you? Youhave plenty of free cpu at a macro level by the looks of it.

pmatthaei · ‎08-25-2009

No, there aren't any limits.

J1mbo · ‎08-25-2009

First I'd agree with the above, change ALL VMs to 1 vCPU.

Next look at the RAM committed vs available.

Next look at the disk controllers. Do any of these have BBWC? If not that is absolutely essential.

HTH

dburgess · ‎08-25-2009

Couple of thoughts - your test load - does that exhibit the same poor performance on all the servers?

If you have v.motion I would try iterativley moving workloads away from your test load until you get reasonable performance back. This may be able to show which load is causing problems - obviously if you get down to just your test load you almost certainly have a serious config or fault condition. If you don't have vmotion you can acheive a 'similar' effect by tuning down the guests using cpu shares if you can't take the VM's down at all

dburgess · ‎08-25-2009

If you mean BBWC (battery backed up write cache)? this is a very good point - make sure the batteries are good otherwise the controller will fall back to non cache mode and poor performance will ensue.

You can look into this by looking at KAVG and DAVG values in esxtop, they will give you a good idea of what i/o latencies you are facing from the i/o subsystem. DAVG should be in the order of your avg disk seek time (5-10 mS or so).

All

ESXi 3.5 and 4.0 are mostly horrible slow