On a few of my VMs running linux (but not most of them) I get this weird timing issue.
The easiest way to see if it is happening is to run "top" which normally would update every second, but when the problem is occurring, top will update repeatedly as fast as possible. I believe there are other things being affected by it though, such as things timing out prematurely. (The screen saver turns on VERY quickly, etc)
I have three hosts in a EVC (Enhanced VMotion Compatibility) cluster. Two of the machines are older and one is newer. If I migrate the VM to the new host, it fixes the issue.
The older machines CPU info:
[root@megatron ~]# cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5420 @ 2.50GHz stepping : 6 cpu MHz : 2493.796 cache size : 6144 KB fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm bogomips : 4990.33 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:
the newer machine's cpu info
processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5530 @ 2.40GHz stepping : 5 cpu MHz : 2394.057 cache size : 8192 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm bogomips : 4790.88 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8]
In the short term, I can just run the VMs on the new machine, but I wanted have some of those VMs spread out to provide redundancy.
Any ideas on how to solve the issue would be very helpful.
I have just encountered the same problem on a cluster of 8 nodes... all exactly the same hardware and same revision.
A reboot of the virtual machine guest didn't fix the problem, but we shutdown the guest and checked the VM Version. The machine with the issue was running as a version 4 machine.
While the guest was shutdown, we upgraded it to version 7 and restarted. It is now working OK?
Are your Linux boxes version 4 or version 7?
I'm seeing this too, on hosts that are exactly the same running ESX 4.0 Update 1, build 244038.
Have seen this with Linux guests on virtual hardware versions 4 and 7, but we can "fix" the problem by doing a Power Off and Power On cycle. A Reset or a reboot from the guest OS doesn't fix the issue - I think an operations that destroys and recreates the relevant world is required. We've seen the issue persist after a Suspend/Power On and after a VMotion to a different host.
Looking at esxtop while the problem is present shows the vcpu-0:vmname world of each affected VM is pegging CPU.
I raised a ticket with VMware but they tried to blame our storage, so I didn't get too far with that.
I noticed that, as well as top, the screen blanking on the console seemed to be kicking in very quickly, although services don't seem to see any ill effect.
Would love to hear that we have something in common and can get this fixed between us with VMware's help. Has everyone raised an SR? Perhaps we can reference each other's.
Our cpuinfo from the ESX hosts:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
stepping : 5
cpu MHz : 2527.096
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm
bogomips : 5057.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
Have a look at this KB article. http://kb.vmware.com/kb/1006113
Hmm, I don't know. The clock is fine; I think that article is saying that it wouldn't be on an affected system. We have kernel version 2.6.18-164.el5 on the affected VMs.
Message was edited by: ceri to add kernel rev
ESX 4 Update 2 contains a fix for the redhat timing issue. It would typically occur after a long period of uptime for the ESX host (1 month) and after a vmotion. Top refreshing quickly is one of the symptoms. Please apply Update 2 when you have an opportunity.
That's wonderful news, thank you!
Just to confirm that 4.0 Update 2 did fix this for us; VMotioning an affected VM on to a host running Update 2 fixed the problem without requiring a reboot of any kind.