VMware Cloud Community
tdubb123
Expert
Expert

high %WAIT times

any idea why there waould be such a high %WAIT time on all the vms here?

Reply
0 Kudos
9 Replies
depping
Leadership
Leadership

you are only seeing part of the picture. this includes all processes linked to the world. if you expand one of them and look at the actual vcpu worlds and then subtract %idle you have the time it actually is sitting in the queue waiting for IO to finish.

Reply
0 Kudos
vastro
VMware Employee
VMware Employee

Reply
0 Kudos
tdubb123
Expert
Expert

is there a bottleneck here? the 2 vcpus are frequently maxed out.

Reply
0 Kudos
kooltechies
Expert
Expert

Do you actually encounter a performance problem here , the %RDY is quite low which doesn't indicate much of a performance problem.

Blog : http://thinkingloudoncloud.com || Twitter : @kooltechies || P.S : If you think that the answer is correct/helpful please consider rewarding points.
Reply
0 Kudos
tdubb123
Expert
Expert

cpu would max at 100%. when that does, %USED is at 100%+ why would %USED be over 100%?

yes the server is running slow during that time. while the cpu is not pegged at 100%, it does fluctuate between 80-100%.

should i try adding 2 more cpus to it?

Reply
0 Kudos
kooltechies
Expert
Expert

In my opinion the vSphere hosts PCPUs are not utilized heavily they are at an average of around 20%. you can try adding more CPUs but be aware that it may or may not make the situation better as the cpu co scheduling will have it's own lag.

Duncan will have a better suggestion for you.

Blog : http://thinkingloudoncloud.com || Twitter : @kooltechies || P.S : If you think that the answer is correct/helpful please consider rewarding points.
Reply
0 Kudos
mal_michael
Commander
Commander

If a VM does a heavy I/O, %used can be greater than 100%.

Take a look on Interpreting esxtop 4.1 Statistics.

Michael.

Reply
0 Kudos
Gkeerthy
Expert
Expert

as mentioned by the vastro

http://kb.vmware.com/kb/1017926

by seeing your screen shot, the % wait for all VMs are too high


Wait, %WAIT:

  • This value  represents the percentage of time the virtual machine was waiting for  some VMkernel activity to complete (such as I/O) before it can continue.

  • If the virtual machine is unresponsive and the %WAIT value is proportionally higher than %RUN, %RDY, and %CSTP, then it could indicate that the world is waiting for a VMkernel operation to complete.

  • You may observe that the %SYS is proportionally higher than %RUN. %SYS represents the percentage of time spent by system services on behalf of the virtual machine.

  • A high %WAIT value can be a result of a poorly performing storage device where the  virtual machine is residing. If you are experiencing storage latency and  timeouts, it may trigger these types of symptoms across multiple  virtual machines residing in the same LUN, volume, or array depending on  the scale of the storage performance issue.

  • A high %WAIT value can  also be triggered by latency to any device in the virtual machine  configuration. This can include but is not limited to serial  pass-through devices, parallel pass-through parallel , and USB devices.  If the device suddenly stops functioning or responding, it could result  in these symptoms. A common cause for a high %WAIT value is  ISO files that have been left mounted in the virtual machine  accidentally that have been deleted or moved to an alternate location.  For more information, see Deleting a datastore from the Datastore inventory results in the error: device or resource busy (101....

  • If there does not  appear to be any backing storage or networking infrastructure issue, it  may be pertinent to crash the virtual machine to collect additional  diagnostic information.

Also check the storage peroformance...

From the vcenter you can get the latency reports, and IOPS report, write rate  etc, refer below section i took from the vSphere Datacenter Administration Guide  for vsphere 4.1, page 120

Disk I/O Performance

Use the vSphere Client disk performance charts to monitor disk I/O  usage for clusters, hosts, and virtual

machines. Use the guidelines below to identify and correct  problems with disk I/O performance.

The virtual machine disk usage (%) and I/O data counters provide  information about average disk usage on a

virtual machine. Use these counters to monitor trends in disk  usage.

The best ways to determine if your vSphere environment is  experiencing disk problems is to monitor the disk

latency data counters. You use the Advanced performance charts to  view these statistics.

n

The kernelLatency  data counter measures the average amount of time, in milliseconds, that the  VMkernel

spends processing each SCSI command. For best performance, the  value should be 0-1 milliseconds. If the

value is greater than 4ms, the virtual machines on the ESX/ESXi  host are trying to send more throughput

to the storage system than the configuration supports. Check the  CPU usage, and increase the queue depth.

n

The deviceLatency  data counter measures the average amount of time, in milliseconds, to complete a  SCSI

command from the physical device. Depending on your hardware, a  number greater than 15ms indicates

there are probably problems with the storage array. Move the  active VMDK to a volume with more

spindles or add disks to the LUN.

n

The queueLatency  data counter measures the average amount of time taken per SCSI command in  the

VMkernel queue. This value must always be zero. If not, the  workload is too high and the array cannot

process the data fast enough.

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)
Reply
0 Kudos
tdubb123
Expert
Expert

A high %WAIT value can be a result of a poorly performing storage device where the  virtual machine is residing. If you are experiencing storage latency and  timeouts, it may trigger these types of symptoms across multiple  virtual machines residing in the same LUN, volume, or array depending on  the scale of the storage performance issue.

I do not understand why it says a %WAIT time can be a result of a poor storage performance. I am running esxtop on different hosts connected to emc and now netapp storage with almost nothin on the aggregate. the wait times between the emc and netapp is the same.

Reply
0 Kudos