All,
I am troubleshooting an environment where we are seeing very high CPU Ready and CPU Wait values. I am looking for some advice, pointers on what metrics to look at and suggestions on what you think the underlying issue may be. Also, if my assumptions so far seem to be incorrect, please let me know.
The environment is as follows:
ESXi 3.5 update 5
8 Hosts in the cluster
All Hosts are HP Proliant BL495c G5 Servers in Blade c7000 enclosure.
Each host has 2 quad core processors and 64 gigs of RAM.
Virtual Connect for Network and SAN connectivity hosted on an EMC SAN.
Scratch Config location is currently set to SAN storage. This will be moved to local SSD on the ESX Hosts. Could this be an issue? We do see that the Scratch Config location is heavily using the SAN.
This cluster is used for the VDI environment. There are currently 224 VDI's in the cluster and the VDI's are distributed across each Host. Each VDI is configured in exactly the same way:
1 CPU
1 gig of RAM
Windows XP
The issue that we are seeing is that performance is very poor. On investigation it can be seen that CPU Ready and Wait times are very high.
I have attached a spreadsheet of the CPU Ready and Wait times.
The physical CPU's utilization on the hosts do not go above roughly 80%.
So, from these readings it is clear that the VM Guests are suffering from poor performance because of the high CPU Ready values. This suggests to me that there is a lack of CPU resources available to the cluster.
However, I also believe that the high CPU Wait values will also be causing the high CPU Ready values as the VM Guest will have locked the CPU cycles during the CPU Wait duration and therefore preventing those resources from being available the rest of the cluster.
I dont believe that we have a memory issue on our VM Guests. We have 64 gigs of RAM on each Host and each VDI has 1 gig of RAM. I dont see any ballooning on the VDI's. This therefore leaves Disk I/O and network I/O.
I am now looking at Disk I/O. What should I be looking at for Disk I/O? What values would show me that there is an issue?
Can you also advise what an acceptable value should be for CPU Wait? It is easy to find general rules about CPU Ready, but not so much for CPU Wait.
Many thanks,
Ben
Hi,
I have also found that the CPU Wait value dramatically decreases when CPU intensive activities on the Guest occur. For instance when Word, Excel, Powerpoint etc are all opened at the same time CPU Ready increases, which is understandable, however the CPU Wait dramatically decreases. So, in an idle state, the VM Guest has a very high CPU Wait figure, however when the CPU is required to perform intensive activities the Wait value decreases. I dont really understand why this is happening. CPU Wait occurs when the CPU is waiting for memory, disk I/O or network I/O. If there are bottlenecks on these resources I would have expected for the CPU Wait to increase when the CPU is asked to perform a task and not to decrease.
Any help would be most appreciated.
Thanks,
Ben
VMware Tools are installed?
AWo
Hi,
Yes. VMWare Tools are installed on all Guests.
Have you gone through this: http://communities.vmware.com/docs/DOC-9279
What does %IDLE show when the wait time increases. %WAIT includes %IDLE.
Is Hyperthreading available/enabled?
AWo
Hi Thanks for your reply.
I've been reading through that document.
When Wait times increase so do Idle times. Here is an example on a particular VM:
Wait Idle
440 46
446 52
450 59
455 61
458 64
463 67
Hyperthreading is not enabled on these servers.
Would you agree that Wait times are extremely high here?
Wait times, when evaluated alone, are meaningless. Just like any other raw performance metric.
When nothing is happening on a system, your wait times are going to be pegged at their highest and that is basically a "good" thing.
High value WAIT times are a problem when disk usage or network usage are 'high'. WAIT times are a measure of IO bottlenecks. The CPU is WAITing on another resource.
So in other words, as long as WAIT times decrease proportionally when activity goes up, all is well.
Disk latency and network latency (or throughput) need to be measured additionally to determine what IO resource could be causing a high WAIT value.
CPU Ready Time values are entirely a factor of CPU Resource Scheduling. Trust me, get *off* ESX 3.5 The CPU scheduler performance improvements even in 4.0 (and 4.1 especially) are enormous and if v5 has improved even more (since support of 32 vCPU) I can't imagine the boost it gives.
A lot of HP blades ship with power management enabled which can have an impact on performance.
If you haven't checked it make sure that power management isn't set to Dynamic in the BIOS, it should be set to maximum performance.
I'm sure you've already done this but it can be a really quick fix if you've missed it.