VMware Cloud Community
wayneoakley99
Enthusiast
Enthusiast

extreme CPU Wait on VM's in Cluster

I assist a customer with an ESX 3.5u3 cluster that is in dire straits apparently due to extreme CPU WAIT times on all the VM's i have looked at so far, in the range of 16,000 - 20,000 milliseconds.

platforms 2 x HP DL580G4 4P dual processors with 64GB memory each, 4 x HP DL380G4 with 2P Xeon processors with 16GB each. Backend is an MSA1000 FC SAN partitioned in two main disks of 1.36TB for production servers and 1.25TB for VDI workstations.

the CPU WAIT time is the same no matter which ESX host, no matter how loaded (1 vm or many), no matter where the VM is stored, on the SAN or on local to the host datastore.

right out of the gate at boot of the guest it jumps to thousands and then climbs quickly to the 16-20k range and pretty much stays there whether the VM is in use or totally idle.

i have been looking all over the place at various counters and indicators but i am rather stumped at this point where to go next.

anybody have suggestions on what might be a course of investigation.

thanks

Reply
0 Kudos
5 Replies
Neth66
Enthusiast
Enthusiast

Are the VMs single CPU or mult-CPU? If the hosts have 2 CPUs and the VMs are 2CPU VMs it could be a scheduling issue as the host can't get two free CPUs at the same time to schedule the VM. If you can, try downloading some trial versions of software that monitors a VI.

Reply
0 Kudos
wayneoakley99
Enthusiast
Enthusiast

the vm's are a mix of 2 vcpu and 1 vcpu. i have been testing / monitoring using mostly a 1vcpu vm but it does not appear to matter.

i am beginning to wonder if this is a normal reading for a machine that is not busy because when i monitor a valid cpu busy vm the graph is the opposite of the cpu utilization in that when the cpu goes up the CPU WAIT time goes down.

the information i have been able to find on the meaning of CPU WAIT indicates that it is due to I/O Blocking or waiting for some I/O activity to complete, but the readings seem to indicate otherwise.

i guess i will have to bother the Support folks with this and get them involved in checking the performance of everything.

thanks

Reply
0 Kudos
Ken_Cline
Champion
Champion

Opening a support case is a good idea. While you're doing that, run esxtop on the physical host where the VM is running and see what the %READY time is. %READY is the time that a VM is ready to execute on a CPU but no CPU is available. Also while you're in esxtop, you can monitor the disk I/O activity to see if it's what you were expecting.

How many hosts and VMs do you have hitting your MSA? You have to recognize that it is an entry level SAN. Also, you said you've got it carved in two pieces...how many VMFS volumes have you created and how many VMs are on each? Let's walk through the simple "best practice" stuff to see where you stand...

Ken Cline

Technical Director, Virtualization

Wells Landers

TVAR Solutions, A Wells Landers Group Company

VMware Communities User Moderator

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
Reply
0 Kudos
wayneoakley99
Enthusiast
Enthusiast

Thanks for the interest Ken.

the equipment is certainly dated, the msa1000 was not intended for this environment and the customer is cost justifying a c3000 with 4 full height 4p quad core and a new msa2000 (2300??).

i have always had a concern about the msa1000 throughput but in this case it does not appear to be overloaded and moving a vm onto the local disk of a server with no other vm's present is as light a load on a server an i/o as is possible (not that the local disk is very fast) considering that the guest xp workstation has booted and done nothing else so it is idle and not using much more than 1-3% cpu and not doing i/o after the boot is complete. Yet the CPU Wait time is still around 18,000 milliseconds on average.

i tried esxtop and the %ready is low to none because the guest is not doing anything. on the other esx hosts with guests actually doing stuff the reading for %ready is quite low typically and consistent with the CPU Ready time reported in the VC performance graphs.

so in general i am thinking the CPU Wait time figure can be a bit of a red herring in that on an idle system it is very high and on an active system it drop down consistent with the actual cpu load going up. i do not know what that perf counter was intended to be but at this time it does not appear to be reporting i/o blocking, at least in this case.

i intent to get support involved but would first like to determine a likely area of investigation, at this time there is nothing in particular that jumps up and waves a red flag, yet the performance of everything is slow, in particular the part about the windows task manager indicating 100% cpu utilization yet esx logs what appears to be rather normal cpu utilization certainly nowhere near the 100% windows thinks it is using.

thanks

Reply
0 Kudos
Ken_Cline
Champion
Champion

in particular the part about the windows task manager indicating 100% cpu utilization yet esx logs what appears to be rather normal cpu utilization certainly nowhere near the 100% windows thinks it is using.

That's the part I would worry least about. The CPU utilization reported by Windows is skewed due to the fact that the guest OS does not have full control of the system. The way that Windows calculates CPU utilization is that, during its idle loop, it increments a counter and then it subtracts the amount of idle time from the wall clock time to determine utilization. In the physical world, this works just fine because Windows had complete control of the system. In a VM, things are different. When Windows enters its idle loop, the hypervisor recognizes this and essentially "puts Windows to sleep" to conserve system resources for other VMs that are actually doing something. This means that Windows is no longer incrementing its "idle time counter" - but it thinks it is. So...when it comes time to report CPU utilization, it still subtracts the idle loop counter from the wall clock time - and guess what? The idle counter is very low, so Windows reports very high utilization.

For more detailed information, check out http://kb.vmware.com/kb/2032 or, if you really want to get a thorough understanding of how time flies within a VM, read Timekeeping in VMware Virtual Machines.

For some good information on performance in general, check out the documents in the Performance forum

Hope this helps...

Ken Cline

Technical Director, Virtualization

Wells Landers

TVAR Solutions, A Wells Landers Group Company

VMware Communities User Moderator

Ken Cline VMware vExpert 2009 VMware Communities User Moderator Blogging at: http://KensVirtualReality.wordpress.com/
Reply
0 Kudos