Things other than CPU overlap are included in RDY time.
Poor storage performance can cause this....
If you aren't on at least 5.0 Update 1 there was a bug with AMD processors (I'm assuming you have AMD procs since you said no HT) that did not properly balance VM's and could introduce rdy time even though you have more pCPUs than vCPUs. Also in esxtop keep in mind unless you expand the VM you are looking at that 40% rdy time will be cumulative for your 4 vCPUs so that it's really probably about 10% per vCPU.
If you aren't on at least 5.0 Update 1 there was a bug with AMD processors (I'm assuming you have AMD procs since you said no HT) that did not properly balance VM's
I have AMD, I have HT, Windows will see logical processors as "hyper-threaded" regardless of manufacturer.. so I am curious why would you assume AMD? AMD has the same capability.... as Intel. Not a valid assumption.
Maybe they simply "disabled" HT in the BIOS.
as mentioned by Matt - the ready time high value is not only due to CPU contention ..also if there is high disk latency.. then also it will appear..
so check the STORAGE section from the esxtop... and refer the http://communities.vmware.com/docs/DOC-9279 for more details..
disk latency can cause all mess..in the vsphere...
The percentage of time the world spent in wait state.
This %WAIT is the total wait time. I.e., the world is waiting for some VMKernel resource. This wait time includes I/O wait time, idle time and among other resources. Idle time is presented as %IDLE.
+Q: How do I know the VCPU world is waiting for I/O events?+
+A: %WAIT - %IDLE can give you an estimate on how much CPU time is spent in waiting I/O events. This is an estimate only, because the world may be waiting for resources other than I/O.+ +Note that we should only do this for VMM worlds, not the other kind of worlds. Because VMM worlds represent the guest behavior the best. For disk I/O, another alternative is to read the disk latency stats which we will explain in the disk section.+
+Q: How do I know the VM group is waiting for I/O events?+
+A: For a VM, there are other worlds besides the VCPUs, such as a mks world and a VMX world. Most of time, the other worlds are waiting for events. So, you will see ~100% %WAIT for those worlds. If you want to know whether the guest is waiting for I/O events, you'd better expand the group and analyze the VCPU worlds as stated above.+
+Since %IDLE makes no sense to the worlds other than VCPUs, we may use the group stats to estimate the guest I/O wait by "%WAIT - %IDLE - 100% * (NWLD - NVCPU)". Here, NWLD is the number of worlds in the group; NVCPU is the number of VCPUs. This is a very rough estimate, due to two reasons. (1) The world may be waiting for resources other than I/O. (2) We assume the other assisting worlds are not active, which may not be true.+
Well I can say we dont have the best san in the world but the good news is we are currently migrating to one that performs much better. I have never read anything that said CPU RDY was anything but a VM waiting to be scheduled on a CPU though.
I am on 5.0 U1 on AMD processors.
My guess based on what you all have said is that it has to do with disk latency. We do experience high disk latency at certain times throughout the day. I learned something new today about RDY time so thanks for that.
Any ideas as to why I am seeing such a high ready time even though I am not oversubscribing CPUs?
Could you take screenshots of the CPU and Memory views in ESXTOP while you have the problem and post them here?
You can actually get a little more granular and see what makes up RDY.
If you look at the %CSTP value, thats the most clear picture of CPU scheduling penalties - its part of %RDY (along with other stuff like %WAIT).
Actually my bad it's the september 2012 patch "ESXi500-201209001" that fixed that issue.
Are these EPCCTX0X systems 8vCPU systems - they look like it. Above you said they are 4 vCPU.
Either way, these RDY times aren't bad. You have to remember that %RDY is the aggregate of all the combined vCPUs in the system. So if you see 20% RDY on a 4vCPU system, each one is only 5%, which is a reasonably healthy number. If these are 8vCPU, you are at 2.5%, which is perfectly healthy....
No there are 8 VMs each with 4 vCPU. The physical server has 32 cores and no HT. I understand that 5% per vCPU isnt terrible, but I guess I would expect it to not be that high since I am not oversubscribing CPUs at all. There are 32 vCPUs total and 32 total physical cores. Again I realize the hypervisor itself is using processor 0, but not heavily so I still wouldnt expect to see these high of RDY values.
I'm starting to wonder if lakey81 is right. I do in fact have AMD processors and I probably dont have the patch that he mentioned. I'll try and get the hosts patched next week and see if the problem goes away.
One way you can kind of check in esxtop is switch to the memory view an enable the NUMA stats. I believe is NHN which is numa home node and that will tell you which numa node/socket the VM is running on. Normally they should be spread fairly evenly based on load over all your physical processors but in the case with this bug it would not move VMs around and favor 1 to 2 nodes. In my case with 4 cpu blades it would load everything up on 0 and 1 and rarely use nodes 2 and 3 which caused major issues with rdy time.
So I applied the patch that lakey81 mentioned, but it didnt fix the problem. After I applied the patch I had 8 VMs running on a host and RDY was fine. I vMotioned 4 VMs off of the host and then back on to it and RDY time is now terrible for 3 of the VMs. I've attached the RDY and Numa stats.
Any other ideas? Maybe I should apply all available ESXi patches?
esxtop.txt 2.5 K