I've been using VMware virtual infrastructure software for about 7 years, in very large enterprise environments - so I know a bit, but what I'm seeing below is weird, and I'm sure it is not my imagination...
ok, when I look at some vCenter advanced performance charts, such as the Realtime CPU chart with CPU Ready times in particular (although the same behaviour shows in other charts and counters too) I see the CPU Usage% about 10% average, and CPU Ready in (say) 50 milliseconds on average. When I look at the next larger time period (ie past day) the CPU Usage% will be similar, but the CPU Ready will be several orders of magnitude higher, such as over 1,000ms on average including for the period in the chart of the last hour, and the next higher period (ie past week) the CPU Ready will be around 10,000ms or more, including the most recent bit of the chart which will be the last day... and it just gets ridiculous on the Monthly charts with impossible CPU Ready times of many 100,000ms...
Please refer to the attached Word document which shows some screenshots of the CPU charts as an example.
I have not seen this behaviour anywhere else - using vCenter v4.1, ESXi v4.1 and vSphere client v4.1.
I've raised this with VMware Support and they initially said this is not normal, then they got back to me later and said it was normal... Then they sent me a link (here: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200218...) to calculate the CPU Ready times from the CPU Ready %. Really, do they expect me to get a calculator whenever I need to do some performance monitoring on a VM and looking at a chart beyond the "real time" charts?! that's insane... and it has not been this way in the past - the CPU Ready values used to show correct (or similar figures in the rollups on the longer timeframe charts in other companies). Furthermore, the CPU Ready summation on the charts is marked as being in milliseconds, so I shouldn't have to be calculating anything!
I did some more checking and I managed to see (just once) on one CPU chart, the CPU Ready figures changed automatically after the chart was displayed, from very high numbers in the 10,000-100,000 level back to 100-1000 level... So I'm figuring that my vCenter server VM is just displaying the actual raw CPU Ready% numbers from the SQL database and is struggling (because of contention with other VMs on the host/cluster) to automatically calculate the actual CPU Ready times in milliseconds? (I know I need to reserve more resources to the VC VM, but that's a political discussion at the moment, and more host server resouces are on the way...)
This same behaviour with the CPU Ready times on the vCenter charts happens with many other counters too, but not all of them.
So is it just me that's seeing this behaviour in this one company, as I've seen the charts working elsewhere?
Also another question on CPU Ready times in the vCenter charts - a tech from our capacity monitoring team is wondering whether the CPU Ready counter that adds up the individual CPU Ready times from the individual vCPUs in a VM, is actually a valid counter to monitor? This (non-vmware informed/trained) tech is saying that the individual vCPU Ready times are the ones to monitor, not the summed total. Is this a valid statement, or not? (ie, if the summed total of the vCPU Ready times go over the 5% (or 1000ms) threshold which VMware recommend keeping under, but the individual vCPUs are approx 500ms or 2.5% on average, this tech is saying this is not a problem for the VM, and hence, not a CPU contention issue). Based on this tech's argument, if an 8-vCPU VM was showing 500ms (ie 2.5%) CPU Ready times on average (with higher peaks) on each vCPU, this wouldn't be a problem, even if the total CPU Ready time for the VM would be 4000ms (ie, 8x 500ms = 4 seconds)? (I beg to differ...)
But from my experience, a VM that is SMP (4-8 vCPUs) and heavily used (ie 40%+ CPU usage on average with higher peaks) can suffer from performance issues in a host/cluster with high over-commitment with a combined CPU Ready total of even 500ms (or 2.5%), not 5% on the individual vCPU counters. So IMHO, the 5% CPU Ready threshold is a guide only, and the real indicator of a VM with performance issues, is the user perception or user experience, combined with higher CPU usage %.
many thanks in advance for any comments!