I have a single SLES10 sp2 VM that's giving excrucitatingly bad performance, hosted on our ESX 3.5u2 cluster.
The Virtual hardware is
40GB Scsi disk as
/dev/sda3 on / type xfs (rw,noatime,nodiratime,logbufs=8) 39.9GB
/dev/sda2 on /boot type ext3 (rw,acl,user_xattr) 200MB
Linux testnewac 188.8.131.52-0.34-vmi SMP
The VM has the VMI enabled kernel installed and running, and Paravirtualisation is enabled in the VM settings (as per VMWare KB article http://kb.vmware.com/selfservice/docume=ntLink.do?externalID=3D1005701 )
Similar SLES10 sp2 VM's with similar setups do not exhibit this crippled performance.
The performance issue seems to be disk related - Top showed kblockd to have had a large share of the cpu and anything disk intensive seems to make the problem worse. Running the YaST 'Software management' module can take upwards of five minutes to get going, and during that time, everything else is crippled as well.
Here's output from top - I rebooted three hours back, and as you can see, kblockd has had a large slice of the time.
top - 14:31:03 up 2:59, 7 users, load average: 0.18, 0.30, 0.99
Tasks: 124 total, 2 running, 118 sleeping, 0 stopped, 4 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 98.7%id, 0.0%wa, 1.3%hi, 0.0%si, 0.0%st
Mem: 3896200k total, 3628428k used, 267772k free, 348k buffers
Swap: 1052216k total, 0k used, 1052216k free, 3375752k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 16 0 732 280 244 S 0 0.0 0:16.35 init
2 root RT 0 0 0 0 S 0 0.0 0:00.21 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.21 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.26 migration/1
5 root 34 19 0 0 0 S 0 0.0 0:00.61 ksoftirqd/1
6 root 10 -5 0 0 0 S 0 0.0 0:04.60 events/0
7 root 10 -5 0 0 0 S 0 0.0 0:08.82 events/1
8 root 10 -5 0 0 0 S 0 0.0 0:00.28 khelper
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
13 root 10 -5 0 0 0 S 0 0.0 18:01.72 kblockd/0
14 root 10 -5 0 0 0 S 0 0.0 17:31.37 kblockd/1
I'm not sure how to go about troubleshooting this - I've got a few ideas, but don't know if they're valid or not:
- the system clock is running variably. Time is kept in sync by NTP to an external ntp server, but I have turned off the setting in VMWare Toolbox to synchronise the clock with the ESX host and I've also removed the clock= option from Grub's boot config as I thought these were unnecessary with the VMI kernel
- something (looking at kblockd) is chewing up CPU while disk access happening.
The most noticeable effect was scp'ing a 1.4GB file to the VM. What should have taken a couple of minutes (from a machine on the same gigabit switch) currently has an ETA of 35 minutes, at a data rate that fluctuates between 300KB/s and 700KB/s (it started at 17MB/sec and dropped like a stone).
I'm very puzzled by this. Any clues?
I had rebooted before I could try that.
However, rebuilding the VMTools has fixed the problem. I don't really know why, since I rebuilt them after switching to the VMI-enabled kernel; and the version number hasn't changed since I did that - but re-doing the install, and rebuilding the modules has fixed the performance issue. The same scp takes a more reasonable couple of minutes, and the same tar operation takes five minutes as opposed to four hours.