you are only seeing part of the picture. this includes all processes linked to the world. if you expand one of them and look at the actual vcpu worlds and then subtract %idle you have the time it actually is sitting in the queue waiting for IO to finish.
Check this out:
Do you actually encounter a performance problem here , the %RDY is quite low which doesn't indicate much of a performance problem.
cpu would max at 100%. when that does, %USED is at 100%+ why would %USED be over 100%?
yes the server is running slow during that time. while the cpu is not pegged at 100%, it does fluctuate between 80-100%.
should i try adding 2 more cpus to it?
In my opinion the vSphere hosts PCPUs are not utilized heavily they are at an average of around 20%. you can try adding more CPUs but be aware that it may or may not make the situation better as the cpu co scheduling will have it's own lag.
Duncan will have a better suggestion for you.
If a VM does a heavy I/O, %used can be greater than 100%.
Take a look on Interpreting esxtop 4.1 Statistics.
Michael.
as mentioned by the vastro
http://kb.vmware.com/kb/1017926
by seeing your screen shot, the % wait for all VMs are too high
Wait, %WAIT:
This value represents the percentage of time the virtual machine was waiting for some VMkernel activity to complete (such as I/O) before it can continue.
If the virtual machine is unresponsive and the %WAIT
value is proportionally higher than %RUN
, %RDY
, and %CSTP
, then it could indicate that the world is waiting for a VMkernel operation to complete.
You may observe that the %SYS
is proportionally higher than %RUN
. %SYS
represents the percentage of time spent by system services on behalf of the virtual machine.
A high %WAIT
value can be a result of a poorly performing storage device where the virtual machine is residing. If you are experiencing storage latency and timeouts, it may trigger these types of symptoms across multiple virtual machines residing in the same LUN, volume, or array depending on the scale of the storage performance issue.
A high %WAIT
value can also be triggered by latency to any device in the virtual machine configuration. This can include but is not limited to serial pass-through devices, parallel pass-through parallel , and USB devices. If the device suddenly stops functioning or responding, it could result in these symptoms. A common cause for a high %WAIT
value is ISO files that have been left mounted in the virtual machine accidentally that have been deleted or moved to an alternate location. For more information, see Deleting a datastore from the Datastore inventory results in the error: device or resource busy (101....
If there does not appear to be any backing storage or networking infrastructure issue, it may be pertinent to crash the virtual machine to collect additional diagnostic information.
Also check the storage peroformance...
From the vcenter you can get the latency reports, and IOPS report, write rate etc, refer below section i took from the vSphere Datacenter Administration Guide for vsphere 4.1, page 120
Disk I/O Performance
Use the vSphere Client disk performance charts to monitor disk I/O usage for clusters, hosts, and virtual
machines. Use the guidelines below to identify and correct problems with disk I/O performance.
The virtual machine disk usage (%) and I/O data counters provide information about average disk usage on a
virtual machine. Use these counters to monitor trends in disk usage.
The best ways to determine if your vSphere environment is experiencing disk problems is to monitor the disk
latency data counters. You use the Advanced performance charts to view these statistics.
n
The kernelLatency data counter measures the average amount of time, in milliseconds, that the VMkernel
spends processing each SCSI command. For best performance, the value should be 0-1 milliseconds. If the
value is greater than 4ms, the virtual machines on the ESX/ESXi host are trying to send more throughput
to the storage system than the configuration supports. Check the CPU usage, and increase the queue depth.
n
The deviceLatency data counter measures the average amount of time, in milliseconds, to complete a SCSI
command from the physical device. Depending on your hardware, a number greater than 15ms indicates
there are probably problems with the storage array. Move the active VMDK to a volume with more
spindles or add disks to the LUN.
n
The queueLatency data counter measures the average amount of time taken per SCSI command in the
VMkernel queue. This value must always be zero. If not, the workload is too high and the array cannot
process the data fast enough.
A high %WAIT
value can be a result of a poorly performing storage device where the virtual machine is residing. If you are experiencing storage latency and timeouts, it may trigger these types of symptoms across multiple virtual machines residing in the same LUN, volume, or array depending on the scale of the storage performance issue.
I do not understand why it says a %WAIT time can be a result of a poor storage performance. I am running esxtop on different hosts connected to emc and now netapp storage with almost nothin on the aggregate. the wait times between the emc and netapp is the same.