EthanSommer
Contributor
Contributor

Weird timing voodoo. Linux top command very fast

On a few of my VMs running linux (but not most of them) I get this weird timing issue.

The easiest way to see if it is happening is to run "top" which normally would update every second, but when the problem is occurring, top will update repeatedly as fast as possible. I believe there are other things being affected by it though, such as things timing out prematurely. (The screen saver turns on VERY quickly, etc)

I have three hosts in a EVC (Enhanced VMotion Compatibility) cluster. Two of the machines are older and one is newer. If I migrate the VM to the new host, it fixes the issue.

The older machines CPU info:

[root@megatron ~]# cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5420  @ 2.50GHz
stepping	: 6
cpu MHz		: 2493.796
cache size	: 6144 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips	: 4990.33
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

the newer machine's cpu info

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
stepping	: 5
cpu MHz		: 2394.057
cache size	: 8192 KB
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm
bogomips	: 4790.88
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: [8]

In the short term, I can just run the VMs on the new machine, but I wanted have some of those VMs spread out to provide redundancy.

Any ideas on how to solve the issue would be very helpful.

0 Kudos
7 Replies
AANDCPITSolutio
Contributor
Contributor

I have just encountered the same problem on a cluster of 8 nodes... all exactly the same hardware and same revision.

A reboot of the virtual machine guest didn't fix the problem, but we shutdown the guest and checked the VM Version. The machine with the issue was running as a version 4 machine.

While the guest was shutdown, we upgraded it to version 7 and restarted. It is now working OK?

Are your Linux boxes version 4 or version 7?

0 Kudos
ceri
Contributor
Contributor

I'm seeing this too, on hosts that are exactly the same running ESX 4.0 Update 1, build 244038.

Have seen this with Linux guests on virtual hardware versions 4 and 7, but we can "fix" the problem by doing a Power Off and Power On cycle. A Reset or a reboot from the guest OS doesn't fix the issue - I think an operations that destroys and recreates the relevant world is required. We've seen the issue persist after a Suspend/Power On and after a VMotion to a different host.

Looking at esxtop while the problem is present shows the vcpu-0:vmname world of each affected VM is pegging CPU.

I raised a ticket with VMware but they tried to blame our storage, so I didn't get too far with that.

I noticed that, as well as top, the screen blanking on the console seemed to be kicking in very quickly, although services don't seem to see any ill effect.

Would love to hear that we have something in common and can get this fixed between us with VMware's help. Has everyone raised an SR? Perhaps we can reference each other's.

Our cpuinfo from the ESX hosts:

# cat /proc/cpuinfo

processor : 0

vendor_id : GenuineIntel

cpu family : 6

model : 26

model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz

stepping : 5

cpu MHz : 2527.096

cache size : 8192 KB

fpu : yes

fpu_exception : yes

cpuid level : 11

wp : yes

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm

bogomips : 5057.05

clflush size : 64

cache_alignment : 64

address sizes : 40 bits physical, 48 bits virtual

power management:

0 Kudos
DSTAVERT
Immortal
Immortal

Have a look at this KB article. http://kb.vmware.com/kb/1006113

-- David -- VMware Communities Moderator
0 Kudos
ceri
Contributor
Contributor

Hmm, I don't know. The clock is fine; I think that article is saying that it wouldn't be on an affected system. We have kernel version 2.6.18-164.el5 on the affected VMs.

Message was edited by: ceri to add kernel rev

0 Kudos
dmadden
VMware Employee
VMware Employee

ESX 4 Update 2 contains a fix for the redhat timing issue. It would typically occur after a long period of uptime for the ESX host (1 month) and after a vmotion. Top refreshing quickly is one of the symptoms. Please apply Update 2 when you have an opportunity.

0 Kudos
ceri
Contributor
Contributor

That's wonderful news, thank you!

0 Kudos
ceri
Contributor
Contributor

Just to confirm that 4.0 Update 2 did fix this for us; VMotioning an affected VM on to a host running Update 2 fixed the problem without requiring a reboot of any kind.

0 Kudos