Sudden %ready issue -- can resource pools have an ...

booradley201110 · ‎08-17-2012

All -- I'm seeing something very odd.

I have seven 4.1 vmWare ESXi hosts. These are dual-socket, quad-core servers configured with Hyperthreading 'on' so I have 32vcpu/host. I'm running a test enterprise workload of java-based application servers, reverse proxies, etc, sized from 2vcpu to 8vcpu.

(A while ago, we did some benchmarks and determined we really did get a benefit running at 8vcpu....)

All was good: we didn't overcommit -- for example I might put 3 x 8vcpu servers, and a 4vcpu server on a single host for 28vcpu total, and even running at 60-80% total busy on the *host*, we saw very little evidence of ready time.

Now -- I have a new environment (same hosts and license). We're testing a deployment automation that uses 'resource pools,' which is new. All resource pools are simply the default settings: "normal" is set for CPU and memory allocation and 'expandable reservation' and 'unlimited' are both checked. BUT, ever since we've been deploying in this manner, the environment is incredibly sensitive to vCPU allocation, and I'm seeing high % ready times even on systems that only have 3 VM's -- 2 x 8vcpu and 1 x 2vcpu, and are only running at 30% busy total, on the host!

This is very puzzling. Since the only 'big' change to our environment was these resource pools, I'm suspicious they're affecting the CPU scheduling, but I don't see how, if I've left them to default values. Thx!

MKguy · ‎08-17-2012

These are dual-socket, quad-core servers configured with Hyperthreading 'on' so I have 32vcpu/host.

Um, then your physical servers only have 16 threads available and not 32.

How large exactly is your %RDY and does it result in a notable impact? The general rule of thumb goes is less than 5% per vCPU is ok. So on your 8 vCPU VM, it's shouldn't be much of an issue as long as you stay below 40%.

What about other metrics such as %CSTP or %MLMTD?

If you suspect the resource pools to be the cause, why not just remove them and test? Or disable DRS temporarily on the cluster (which is required for using resource pools).

You could also try playing with the parameter mentioned in this KB article (might as well upgrade to 5.0 where this is fixed too):

http://kb.vmware.com/kb/1020233

I also highly recommend the famous HA and DRS book by Duncan Epping and Frank Denneman or at least the DRS deepdive here:

http://www.yellow-bricks.com/drs-deepdive/

-- http://alpacapowered.wordpress.com

jhanekom · ‎08-18-2012

I'd agree with the "ramblings of some guy from Germany": "high %READY" is a relative term; a good starting point for baseline purposes could also be Duncan's ESXTOP threshold page: http://www.yellow-bricks.com/esxtop/

Resource pools do have some gotchas, but one of the key things to understand is that share-based prioritisation (like you describe) - whether with resource pools or not - kicks in only if there is contention for resources. So either this mechanism is kicking in and you somehow have resource contention, or there is some other cause somewhere else (or there's really no problem and the figures you're seeing are quite normal, if unexpected.)

As MKguy suggests, if you have reason to suspect something is really wrong, go ahead and move all the VMs to the root of the resource pool tree (or delete them entirely), which would eliminate it as a possibility. This can be done without downtime. If it resolves the "problem", one can try to troubleshoot from there.

Other things I can think of that might be worth mentioning:

Double-check that you don't have any limits set on resource pools *or* individual VMs. This could definitely result in higher %READY times. See Duncan's page above for counters that can show whether limits of some sort are kicking in (%MLMTD.)
You mention substantial differences in %READY times before and now. Did you measure in the same way? For example, some of the ready counters in vCenter are cumulative for the measurement period, whereas the values in ESXTOP are averages. That could be confusing if you don't expect it - in the realtime view in vCenter, CPU READY values of 200ms is equivalent to 10% on a 1vCPU VM.
~20-30 workloads over 7 machines: you might have a different mix of workloads on each server, which is showing different results, though you do mention lower CPU utilisation percentages.
on multi-CPU VMs, %READY can be "high" but still normal - a %READY time of 10% on a 1vCPU VM could be equivalent to a %READY time of 80% on an 8vCPU VM
Increased %READY times are quite normal for busy VMs
Multi-tier resource pools complicate things tremendously. Simplicity is key here - try to stick to one level of resource pools if at all possible. If you're just wanting to organise your VMs, use folders, not resource pools.
Read the resource management guide, it's unfortunately quite big, but explains the technical concepts and best practices very clearly: http://pubs.vmware.com/vsphere-50/topic/com.vmware.ICbase/PDF/vsphere-esxi-vcenter-server-50-resourc...

jrmunday · ‎08-28-2012

In addition to the great advice already posted, see this interesting discussion that I started 2 months ago;

http://communities.vmware.com/thread/391284?tstart=0

I first noticed this on 4.1 and saw the same behavior on 5.0. Could it be that you are seeing the same issues as me, but only noticed it after implementing resource pools?

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77

All

Sudden %ready issue -- can resource pools have an effect?