VMware Cloud Community
steeple82
Contributor
Contributor

CPU over commitment performance benchmarking

Hello all, I'm at a loss and could really use some help.

I am an End user and do not have access to vsphere as this is all handled by a 3rd party vendor.

Known current configuration

Vsphere 6.7

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

2 CPUs totaling 44 physical cores

10 VMs with 4 cores each = 44 total cores used. (One VM has 8 cores, all others have 4 allocated)

No Resource pools

Now, from above, there is no cpu over commit and everything runs fine. However adding another VM with 4 cores cause a 5-10% decrease in performance, not a big big deal however the vendor's policy is to double the CPU count, which isn't bad, however doing this causes a 400% decrease in performance, even if VMs are powered off.

Background:

We use a software(ediscovery processing) where data is divvied out to several VM machines to be worked on. When I first came on with the company. it would take 2.5 - 3 hours to process a 16GB dataset. Having several years experience at other companies, I found this extremely slow and unacceptable. Several talks with the software vendor and our infrastructure vendor, no one could determine the cause of this. After speaking to a close friend of mine, his first reaction was CPU overcommitment and no resource pools.

After many many many emails back and forth and testing, moving VMs off the host and putting us at a 1:1 cpu ratio, we get a 25-40 minute benchmark on my data.

The vendor refuses to use resource pooling, they indicate that we never cap CPU usage in our current or our previous configuration. The software we run is not inherently CPU intensive so I would never expect to cap CPU in the first place. This issue is preventing us from scaling our environment and not allowing us to make full use of VMs and cores. I feel we should be able to have a 2:1 or 5:1 ratio but doing so causes a huge decrease in performance.

My question is, what could be causing this, what would recommendations be.

Is there a way I can benchmark CPU Ready time without having access to vsphere. what are clear indications that I can convey to the vendor to make them understand. I'm trying to find the smoking gun as the vendor is not interested in finding the problem as there monitoring says CPU isnt being utilized. Having another software to benchmark with would be fantastic so as to eliminate a problem with our current software as that is where everyone is trying to shift the blame, even with my results.

I am stuck between a rock and hard place and anything you all can recommend would be helpful.

Thank you community people!

Intel(R)   Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Reply
0 Kudos
6 Replies
scott28tt
VMware Employee
VMware Employee

Difficult to do a deep investigation without having access to the host or cluster metrics...


-------------------------------------------------------------------------------------------------------------------------------------------------------------

Although I am a VMware employee I contribute to VMware Communities voluntarily (ie. not in any official capacity)
VMware Training & Certification blog
Reply
0 Kudos
nachogonzalez
Commander
Commander

hey, hope you are doing fine

As scott said, is kind of difficult to do a benchmark without access to the vSphere host.

let me ask you a few questions:

Do you have a host dedicated for you?
Do all your VMs reside on a single host?

Are the applications single threaded or multi threaded?
Do you know if the host has hyperthreading enabled?

Two things i would do if i had access to the plattform:

1. Run a oversized VMs capacity report on vROPs Example: Reclaiming Resources from Oversized VMs
2. open an ssh connection to the host and run esxtop and check %RDY and %CSTP (last one might increase if you create a new VM)
Second scenario will indicate that your resources are overcommited and you should get more resources or optimize your usage.


hope that works

Reply
0 Kudos
steeple82
Contributor
Contributor

Thank you (and to Scott) for the reply, I definitely agree, makes it difficult and dare I say impossible without access. I'm praying for miracles though, anything to guide these people to the solution cause they fully accept that they are right and nothing is wrong and its the processing software's fault.

let me ask you a few questions:

Do you have a host dedicated for you? For this particular software that we run, we have a dedicated host and intentionally keep it under subscribed.
Do all your VMs reside on a single host? We have several hosts with a total of 456 CPUs with 912 vCPUs allocated

Are the applications single threaded or multi threaded? Application is multi threaded, and makes really good use of CPU usage when processing data.

Do you know if the host has hyperthreading enabled? From conversations, they have indicated yes.

Thank you for your recommendation, I'm gonna find a way to sneak that in and have them do that.

Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

Hi

Few questions

1. Are you sure that this is CPU that makes bottleneck ?

2. Did you check for storage latency or memory over-commitment ?

3. Could you ask your infrastructure provider to ensure that there is turbo enabled on the ESXi host ?

4. Do you per chance have environment running through vCloud Director ?

Regards

Reply
0 Kudos
steeple82
Contributor
Contributor

Hi ZibiM.

Few questions

1. Are you sure that this is CPU that makes bottleneck ? Am I sure. heck no, but its the closest thing i can get to with the multitude of tests i could run and having them change CPU count on the VMs. Adding machines or adding more CPUs to existing machines cause a decrease in performance.

2. Did you check for storage latency or memory over-commitment ? Memory allocation actually seems to be reasonable not at all over committed.

3. Could you ask your infrastructure provider to ensure that there is turbo enabled on the ESXi host ? I will definitely ask this.

4. Do you per chance have environment running through vCloud Director ? I will have to ask, but i find it doubtful.

Thanks for the response and suggestion!

Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

If you observe serious performance degradation, when you only just add few vCPUs, then this is really serious culprit.

Still - it shouldn't behave like that. CPU contention should not manifest in such performance degradation.

Your friend is right - this really looks like hitting resource pool CPU limit.

It would be great if you could ask for the CPU performance graph of the ESXi host

Default real time CPU Usage % while you are doing data processing could really provide some insights what's going on.

Anyway

Please check storage latency and memory usage

What is storage latency you observe on your VMs ?

Does your VMs actively utilize swap / page file ?

Reply
0 Kudos