So recently I have been going through our environment and resolving many SMP VM's that have been displaying high READY time's improving performance quiet dramatically. Recently we have had a couple fellow coworkers attend the vmware administration training course. When they made it to the resource management section they learned about some of the basic concepts of utilization / reservations etc that are involved. Afterwords they have come back and asked me how in our environment we have hosts that have relatively LOW CPU utilization but VM's that indicate high READY time's and I can not for the life of me think of how to explain to them what there instructor in the class told them isn't necessarily wrong, just not the whole truth.
How would you go about explaining to a new esx administrator that you can have VM's on a host with high READY times but the over all host CPU utilization is low? What I mean by this is if you connect to a host using the viclient and in the summary tab you will see a CPU usage bar in blue. That bar may be 40% utilized but VM's on that host are reporting a warning or alert notification for the alarm notification we setup with the recommended settings of 1200ms or 2000ms (based on the poling intervals).
I very much dislike how vmware tells people to think of processors (in mhz) as stacks of resources when there is a physical limitation with the number of VMs a hosts scheduler can schedule at one given time.
Any idea's on how to explain it?
I am on the same boat as you. Our consolidation ratio in a virtualized desktop environment with View- XP is about 50 - 60 VMs per host and the CPU utilization of the host is not even 20% on average, yet some VMs would have ready values peaking upto 15% and average around 10%. I have gone through numerous articles and have worked with VMware support on this issue as well.
Their typical answer is that "CPU utilization has nothing to do with ready value. It is the amount of vCPUs the scheduler has to work with. We are overprovisioned and need to add a host to alleviate this issue". But somewhere in between they are not telling us that forget those 4 RU monster servers with 1 TB of RAM. If the number of vCPUs to schedule crosses a certain threshold, the scheduler just chokes. And that is the fact for the HP DL 585s that we are working with and they are not maxed out on memory from a server config standpoint.
I am hoping someone with more experience on CPU scheduler can chime in on this issue.
Sounds like you are experiencing the exact same issue. And from my understanding on how processor's work you cant simply give a process 200mhz and another process 800mhz of a single 2000mhz processor. When the processes are scheduled on the processor, the processor works on that process until the process is complete or has been unscheduled. With this limitation alone I don't see how the scheduler in ESX/ESXi (or any virtualization software package) could ever get around having a limit of to many processes potentially being available and not requiring all the possible CPU power available. And with such a limit would also limit the number of VM's that can created per host. Just because processors can complete tasks quicker and quicker as the generations progress you still can over saturate the processor's with the number of processes requesting processor time. I as well would like a little better explination and understanding on how the scheduler works. We have had vmware come in and do a "health check" and there recommendations were to add additional processor core's into the environment.
One thing I would be interested in understanding better is when you run into this issue, are the VM's that are being scheduled with the %RDY% times in acceptable ranges recieving the correct amount of CPU time allowing them to obtain the required speed's they need or are they being limited to the amount of processor power they can achieve by the scheduler de-scheuling the VM due to other VM's requesting processor time? And when you do add additional resources bringing the average %RDY% time's down across the board, will the VM's in acceptable ranges prior to the resource additions see a performance increase? I would think that if you have more processes available than processor time can handle, that you would have the CPU's running at nearly 100% for the time that each processes is schedule on the processor in an effort to complete the tasks at hand as quickly as possible in order to move onto the next process needing available processor time. Otherwise you are affectively governing how quick a process can complete. Which I suspect the schedulre is actually doing as it needs to keep multiple thread's in sync when working with SMP vm's.
It just bugs me that vmware (and I am sure others) teach it the way they do and when confronted about it have no real answer's. I'm not certain tech support even fully understands the scheduler in detail (and wouldn't expect any real front line support to.)
%RDY or %READY stands for VM waiting in the queue to be scheduled on the physical CPU, so that it can use/utilize the CPU. If this value is more definitley the utilization will be low.
What we are trying to get to the root of is … if there are physical resources available on the CPU, why are the VM level ready value so high. If the CPU is in contention then the high ready value can be justified.
It just proves that there isn’t an efficient algorithm in place that can utilize all available resources. Yes, the scheduling algorithm has improved from the previous generations, but still there is a huge shortcoming when we are talking about VDI environments where people would like to see high consolidation ratios. There should be guidelines against higher consolidation or the optimum/max number of vCPUs for best CPU scheduling under best practices at least.
Hoping someone from engineering or similar “inside” or high level knowledge would chime in.
After doing a review of our hardware configuration and getting a better understanding of the different options available in the BIOS of the machines we found that our power regulator configurations on our blades were configured for "HP Dynamic Power Savings Mode". In this configuration it did not allow ESX to manage the power management features of the hardware. After changing the settings to "OS Control Mode" and rebooting the hosts we are now able to configure the power management options at each host (leaving them at the default maximum performance) and since the settings change we are no longer seeing nearly as many READY time issue's. We also have seen the host CPU usage counters increase and decrease unlike prior to the configuration change the bar's stay at a fairly steady 30%.
The READY time issue's have not completly gone away (as I expected) since we are overprovisioning our hardware but at least now I am a bit happier to see the %CPU bar's move around.