VMware Cloud Community
simonadams
Contributor
Contributor

Report on Over-allocated vCPU

We have a problem that seems to be fairly common with more CPU's allocated to VM's than they actually need - resulting in contention in an evironment that actually has plenty of capacity.

Has anyone got any ideas as to how to easily identify VM's that have too many CPU's allocated ?  Ideally this would allow short bursts to go over the reported 'recomended' allocation.

So a 4 vCPU VM that averages 20% total CPU ultilisation over the past week, but has two peaks over 50% for less than 10 mins each time, should definitely be a valid candidate to be reduced to 2 vCPU (unless people have huge amount of spare cash to support a very low consolidation ratio).

But a 2 vCPU VM that averages 10% CPU total CPU utilisation over the past week, but has one 'peak' of 3 hours at 80% CPU utilisation, should probably not be reduced to 1 vCPU.

There must be an optimal level of consolidation where a script could easily pick out the worst cases of over allocation fairly reliably ??

This seems such a key factor in gettng the most out of vSphere that I'm surprised I can't find anything like it around!  The performance impact of having many VM's in a cluster with 4 vCPU means that performance of a VM can be terrible on an ESX hosts which is only using 30% of its CPU...

0 Kudos
4 Replies
LucD
Leadership
Leadership

Getting the data out is not that difficult.

I would use the Get-Stat cmdlet with the cpu.usage.average metric to get the values.

The more difficult part will be the application of the rules.

Coding each such rule is easy but making an intelligent script that interpretes the data for all possible overallocation situations is more difficult.

And it wil depend on your environment. It could well be that the 10-minute peaks of 80% CPU are very important in my environment and neglegible in yours.


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

simonadams
Contributor
Contributor

If I use that average metric, what is the largest/longest peak I could miss ?

For example if we had a generic script that used this average (which is an hourly average?), and if the average of a 4 vCPU VM never goes above 50 % it recommends a change to 2 vCPU, and if never above 25 % it recommends a change to 1 vCPU, I guess there could still be quite large/long peaks that it will ignore ?

0 Kudos
LucD
Leadership
Leadership

The length of the interval depends on the period for which you retrieve statistical data.

See my PowerCLI & vSphere statistics – Part 1 – The basics post for further details on the intervals.

Depending on the statistical levels you have defined for each historical interval, there are also cpu.usage.maximum and cpu.usage.minimum, which give you the highest/lowest value over the interval but not the duration I'm afraid.

The best way to tackle this, imho, is to use Realtime statistics (20 second samples) or to use the Get-EsxTop cmdlet (5 second intervals).

The latter is quite a bit more difficult to use I'm afraid.

But with Realtime statistics you can only go back for about 1 hour.

So you will have to  record the values for a longer period in an external file.

Once you have the data, a 2nd script can read the file and apply the rules to report on overallocated guests.


Blog: lucd.info  Twitter: @LucD22  Co-author PowerCLI Reference

0 Kudos
zsoltesz
Enthusiast
Enthusiast

It is not PowerCLI, but as I know  you can try VMware CapacityIQ for 60 days. After runing it a few days or weeks, you can get useful reports from it, for example about over sized machines.

0 Kudos