Re: Report on Over-allocated vCPU

simonadams · ‎06-21-2011

We have a problem that seems to be fairly common with more CPU's allocated to VM's than they actually need - resulting in contention in an evironment that actually has plenty of capacity.

Has anyone got any ideas as to how to easily identify VM's that have too many CPU's allocated ? Ideally this would allow short bursts to go over the reported 'recomended' allocation.

So a 4 vCPU VM that averages 20% total CPU ultilisation over the past week, but has two peaks over 50% for less than 10 mins each time, should definitely be a valid candidate to be reduced to 2 vCPU (unless people have huge amount of spare cash to support a very low consolidation ratio).

But a 2 vCPU VM that averages 10% CPU total CPU utilisation over the past week, but has one 'peak' of 3 hours at 80% CPU utilisation, should probably not be reduced to 1 vCPU.

There must be an optimal level of consolidation where a script could easily pick out the worst cases of over allocation fairly reliably ??

This seems such a key factor in gettng the most out of vSphere that I'm surprised I can't find anything like it around! The performance impact of having many VM's in a cluster with 4 vCPU means that performance of a VM can be terrible on an ESX hosts which is only using 30% of its CPU...

LucD · ‎06-21-2011

Getting the data out is not that difficult.

I would use the Get-Stat cmdlet with the cpu.usage.average metric to get the values.

The more difficult part will be the application of the rules.

Coding each such rule is easy but making an intelligent script that interpretes the data for all possible overallocation situations is more difficult.

And it wil depend on your environment. It could well be that the 10-minute peaks of 80% CPU are very important in my environment and neglegible in yours.

Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference

simonadams · ‎06-21-2011

If I use that average metric, what is the largest/longest peak I could miss ?

For example if we had a generic script that used this average (which is an hourly average?), and if the average of a 4 vCPU VM never goes above 50 % it recommends a change to 2 vCPU, and if never above 25 % it recommends a change to 1 vCPU, I guess there could still be quite large/long peaks that it will ignore ?

LucD · ‎06-21-2011

The length of the interval depends on the period for which you retrieve statistical data.

See my PowerCLI & vSphere statistics – Part 1 – The basics post for further details on the intervals.

Depending on the statistical levels you have defined for each historical interval, there are also cpu.usage.maximum and cpu.usage.minimum, which give you the highest/lowest value over the interval but not the duration I'm afraid.

The best way to tackle this, imho, is to use Realtime statistics (20 second samples) or to use the Get-EsxTop cmdlet (5 second intervals).

The latter is quite a bit more difficult to use I'm afraid.

But with Realtime statistics you can only go back for about 1 hour.

So you will have to record the values for a longer period in an external file.

Once you have the data, a 2nd script can read the file and apply the rules to report on overallocated guests.

Blog: lucd.info Twitter: @LucD22 Co-author PowerCLI Reference

zsoltesz · ‎06-21-2011

It is not PowerCLI, but as I know you can try VMware CapacityIQ for 60 days. After runing it a few days or weeks, you can get useful reports from it, for example about over sized machines.

All

Report on Over-allocated vCPU