Poor Linux I/O performance / long machine pauses o...

flyerguybham · ‎03-09-2008

I have been trying to migrate some services from one VM to another, where the old VM is SLES 9 (2.6.5-7.276-smp) and the new VM is CentOS 5 (2.6.18-53.1.14.el5). The VMs both have 768MB of memory and 2 CPUs. VMware environment is ESX 3.0.2.

Whenever even light to moderate I/O is taking place, the machine because almost unresponsive. Some examples:

- Making a new ext3 filesystem on a partition. It starts writing the inode tables out, and the first few hundred go fine, but then it starts -crawling-, like 1 inode table per 20 seconds. A "vmstat 3" or a "sar -B 3 0" takes about 20-30 seconds for each new line of output. The output itself will show the system at like 20-30% iowait. No major page faults reported by sar.

- Rsyncs that I have seen take 20 minutes on older kernels will take several hours on this system. And during the rsync, attempting to do anything else on the system (ssh to it, run 'ls', etc...) pauses for many seconds before returning.

- Any attempt to run a yum update/install/search results in major pauses and hangs on the order of minutes. It often is 1-2 minutes before yum produces any output at all.

- I have noticed very similar behavior out of a Fedora Core 7 VM on the same ESX host.

During these episodes, system load numbers stay low. The machine remains pingable. Just the rest of the system becomes unresponsive. In the case of making ext3 filesystems, the slowness lasted about 5 minutes, then it picked up speed again. In general, it lasts anywhere from seconds to minutes.

Things I have tried already:

- Updated all of the packages to latest from CentOS, including kernel.

- Went down to 1 CPU for the VM just to see if that helped (it did not).

- Run vmstat and sar and atop during episodes to look for suspect numbers. Noting really obvious yet, everything sort of seems within "normal" params but I am not an expert with these tools. The only thing that seems to increase is iowait time.

I have read similar-sounding threads online that seem to point a finger at the kernel tick system or the tickless kernels. I have seen a lot of people trying various kernel flags to alter the behaviour of the ticks but they tend to report back that it had no effect.

Questions:

- Does this sound familiar to anyone? Is this a known issue and if so, is there a fix?

- Is there a way for me to debug this from the ESX host side, to see if it is seeing lots of io requests from the VMs? Virtual Center performance graphs show very little activity.

Happy to provide more debugging data, just didn' t know what would be useful to include here yet.

Thanks!

flyerguybham · ‎03-10-2008

More info...

1. vmware-tools is up to date in the CentOS5 guest.

2. Kernel flags I am currently using are: noapic nolapic apci=off clocksource=acpi_pm elevator=noop and ntpd service is off in guest.

I am also going to upgrade to CentOS 5.1 and see if adding divider=10 to reduce the clock interrupt rate to 100Hz helps any.

davemiller41 · ‎03-28-2008

We're seeing very similar things, though with Fedora 8 on ESX 3.5. yum update will sometimes slow - once in a while to the point that it's stopped for minutes (at least). CPU usage on the virtual host (if we can open top) will go through the roof, as will the CPU usage on the ESX server.

Figures it happens twice then goes through fine when I fire up sar. It appers to happen in spurts, but otherwise randomly. Has happened to several machines on the ESX 3.5 box. Haven't seen this on our 2 ESX 3.0.2 boxes. No external storage - plenty of RAM and CPU to go around, so that's not a problem.

Stock Fedora 8 kernel. i386. VMware-tools not installed yet.

Any suggestions would be helpful.

Thanks.

Dave

jadamt · ‎04-02-2008

I haven't witnessed this but its definitely concerning.

devzero · ‎05-07-2008

i think this is related: http://communities.vmware.com/message/938157 , Linux guests stutter under load

please post information about controller type and disk subsystem!

Jlutgen · ‎05-07-2008

This is the same problem my company is having right now..

we are connected to a netapp 3070 and on the virtual disk if we try to run anything it becomes brutally unstable and unresponsive.

our application off loads all of the files into ram and processes it throught there pushing it back to a text doc.

We are processing records of text de-dupeing them... and its unreal on centos 3.9 centos 4.6 and centos 5.1

Texiwill · ‎05-07-2008

Hello,

Be sure hotplug is not installed. THat often causes issues for me.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

Jlutgen · ‎05-07-2008

hotplug is running the the servers that we are having problems with how can that cause an issue?

obviously on the centos 3.9 boxes they will be running hotplug vs. udev...

but with 4.6 and higher it has both

I tried a yum remove hotplug

yeah not a good plan it removes everything.

3 questions:

1. how would i remove just hotplug

2. is there any way to disable it completely.

3. what issues could it cause in the vm world?

Thanks,

devzero · ‎05-08-2008

i think this isn`t related

Texiwill · ‎05-08-2008

Hello,

Hotplug is used to detect new hardware being added during the run of the system, since VMs tend to be static and hardware changes require a reboot, using hotplug is not required.

I remove hotplug during all my installs and I have never had an issue.

I use the following to remove hotplug and anything related to pcmcia.... I tend to remove hardware items that do not exist in the virtual world.

rpm -e --nodeps hotplug kernel-pcmcia-cs

Effectively getting anything waiting on hardware to be added to the VM while the VM is running. Now for Workstation or things on which I use USB, I would not do this, but for a Server I always do. Effectively disable anything that tries to use hardware not connected to the VM.

I also disable the following services as they are not useful for a VM with a single vCPU. note if you use md devices keep mdmonitor.

sgi_fam apmd hpoj isdn autofs cups sendmail kudzu gpm pcmcia xfs bluetooth hidd hplip avahi-daemon cpuspeed firstboot mdmonitor pcscd rpcgssd rpcidmapd nscd fuse irqbalance

I would keep irqbalance if using more than one vCPU however.

I also disable the following , if they exist, as I do not use NFS mounts on or from my servers VMs.

nfslock portmap netfs

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

devzero · ‎05-10-2008

thanks for the hints - very useful indeed!

anyway, i`m still not convinced that this is related since that doesn`t explain why my problem exists with sata raid1 but not with sas raid1

Texiwill · ‎05-10-2008

Hello,

If the problem is within the virtual machine then those would help, however, if you consider it to be in the SATA vs SAS arena then you are talking about the entire Host and not just a single VM. Looking at SATA vs SAS, what are the disk I/O times for a seek? Are they comparable? SATA is not faster than SAS so that could be an issue as well. I would start to investigate the /var/log/vmkernel and /var/log/vmware/hostd.log files for specific issues related to performance of your Datastores. I would still do the things I mentioned as it will keep a lot of processes from running that are just not necessary.

Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education. CIO Virtualization Blog: http://www.cio.com/blog/index/topic/168354, As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill

All

Poor Linux I/O performance / long machine pauses on recent kernels?