We've got ourselves a curious problem. We migrated our cluster with 4 hosts from ESX4.0 to ESXi4.1 and moved the hosts in a perimeter network. (HP media used and new install). Our Exchange 2003 sp2 server suddenly peaks after a day's use. Only a live vmotion settles the vm down. There's no ballooning. No resourcepool restriction. No contention and other vm's on the host have no problem whatsoever when exchange peaks. I've got a hard time troubleshooting the machine.
Within the VM there's no process peaking but the taskmanager performance flatlines at 100%. KB1001133 is not applicable here since the store is not peaking at all. There's 1vCPU on a uniprocessor HAL. No CPU reservations. No memory problems. Only a live migration fixes it.
esxtop busy:
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY
642719 642719 exchange 4 100.34 101.84 0.12 297.95 0.28
esxtop after migration
642719 642719 exchange 4 6.44 16.33 0.20 383.21 0.47
I temporarily removed the vmware tools. Stopped the store and iis. Disabled and uninstalled dataprotector and antivirus. A reboot or shutdown does not settle the VM down... only a live migration does. This happens every day once.
>>Within the VM there's no process peaking but the taskmanager performance flatlines at 100%.
This means something within the VM is hogging CPU ..
Can you check if the kernel is peaking ?? Task Manager->View->Show Kernel Times
You could also get performance diagnostic data using
There's no specific process responsible for the CPU Hog. It just totals to 100%
check the differences between hog and normal user (after vmotion). resp. hog.gif vs normal.gif
Kernel times during the CPU hog are normal? check kernel.gif
I've found something to replicate the CPU hog. When I run "C:\WINDOWS\system32\winmsd.exe /report C:\SysInfo.Txt" the vm hogs... like a pig. Even after the winmsd job is finished. It does so until it's vmotioned as stated in the subject. Since the winmsd job invokes "wmiprvse.exe" could it be a corrupt WMI repository? If so, I'll try "rundll32 wbemupgd, RepairWMISetup" after production hours.
Also you might want to check if there is a registry corruption
made a support call:
Had the /3GB switch in the boot.ini because of but ESX4.1 has some trouble with this so I used . (from the 14th of september!!) Fixed with "Use Intel VT-X/AMD-V for instruction set virtualization and Intel EPT/AMD RVI for MMU virtualization"
The permanent fix for this issue has been designed and is included in ESX 4.1 Update 1
/edit: the /3GB CPU load problem we had is fixed in http://kb.vmware.com/kb/1027021. It's in ESXi410-201010401-SG.
If you start a Microsoft Windows Server 2003 32-bit virtual machine with /3GB switch defined in the boot.ini file on VMware ESXi 4.1, you might see the following symptoms:
Read or Write memory errors occur in the guest operating system.
A Remote Procedure Call (RPC) error is reported and the virtual machine is forced to reboot often.
A stop code of type 0x000000F4 occurs.
Microsoft .NET or Java applications might fail with memory errors.
The Microsoft Windows Event log might contain error messages similar to the following:
Event Type: Error
Event Source: .NET Runtime
Event Category: None
Description:.NET Runtime version 2.0.50727.3615 - Fatal Execution Engine Error (7A0979AE) (80131506)
We are having a very similar problem with our VMware View cluster. We have Windows XP Pro SP3 32bit guest VMs and they will randomly get stuck consuming around 35-40% CPU each and max out the hosts CPU power, but when you vmotion the guests around the issues immediately cure themselves. You can even vmotion to another host and then right back to the same host and the CPU usage will normalize. We are running ESXi 4.1.0 #320137.
I am very open to suggestions!
Thanks so much,
Shane
Has anyone ran across the problem I mentioned above? I have Windows XP SP3 32bit guests that get stuck with high CPU to the point where they consume all of the ESX hosts availble CPU, but if I vMotion the guest VMs the issue clears up?? It happens randomly and this far I have found what is causing the issue.
Shane
i have seen some problems like this for view and other servers. here is what i found that fixed my problem.
edit the settings on the virtual machine
click resources
on the high lighted cpu make sure the check box is checked for unlimited
the click and high light memory and then check to make sure it is also checked for unlimited.
the problem i was having was the memory was being swaped out, ballooned, Compressed etc. and once i checked those boxes my problems went away.
Stephen