I am running Redhat VMs on ESXi 6.0 hosts.
My hosts have 2 physical sockets of 8 cores each, at 2600 Mhz
I am defining 2 VMs per host, one has 8 vCPUs (2x4), the other one as 6 vCPUs (2x3).
I was hoping leaving 2 "unused" cores will prevent to have any CPU overcommitment problems, but apparently, this is not the case.
What I am observing, when I start to use the VM for some performance measurements, is that the Co-stop metric of my VMs quickly goes to several sec, evens up to 10 sec !
Then, I tried to use CPU reservation, setting a value corresponding the the nb of cores times the frequency.
For example, for the 8 cores VM, I set a reservation of 8 x 2600 Mhz, and for the 6 cores VM, I set the reservation to 6 x 2600 Mhz.
Doing that simply prevent the second VM to even start, no error message, nothing, hitting "Power on" is just ignored.
Can someone explain what I am doing wrong ?
First of all, you should configure your VMs to be "wide" by only configuring 1 core but multiple sockets. So rather than 2 x 4, configure 1 x 8. Do the same for the other. This allows ESXi to place the VMs into the appropriate NUMA node. Re-test and you should see slightly different results.
Ok, will try that.
BTW, I saw we can set VM affinity to CPUs, would it be a good idea to try it ?
My setup is fairly stable, 2 VMs per host, all VMs running real time applications very sensitive to latency.
I setup as you recommended, 8 sockets of 1 core on 1st VM and 6 sockets of 1 core on the other, no CPU affinity, but I do not see any improvements for the moment.
As soon as the application is under medium load, ( host used at 40%), I have regular spikes of co-stop values of several seconds for the entire VM.
Now the question is: How to identify the missing resource that is causing this Co-stop ?
Are we sure this is a lack of CPU ?
One cause of high co-stop (%CTSP) is over-use of vSMP while the host tries to wait for parallel CPUs to become available. You should try to decrease the number of vCPUs on this system as allocating vCPUs to a VM which aren't being utilized will also cause performance degradation (just like having too few).
Well, I confirm that after giving 6 vCPUs instead of 8, Co-Stop no longer goes above a few msec.
But the fact is that my application, when heavily used, will require more vCPUs.
Coming back to the meaning of this metric, I thought that high values were a sign that my VM was requesting more resources than the host can give.
But here, that was not the case, since the host has more resources than the 2 VMs running on the host were normally asking for.
I suspect there is some misunderstanding on my side about the resources a VM can ask for, the reason being that there is not a one to one relation between physical cores, and vCPUs.
Any good explanation would be welcome.
This is a subject of protracted discussion and no three or four sentence response will do. What I'd recommend is that you dive into how the CPU scheduler works and how resource consumption and scheduling works in a VMware virtual environment. Many books and articles cover this, but the most recent one that goes into all the details you'd possibly want to know is the excellent resource by Neils Hagoort and Frank Denneman entitled VMware vSphere 6.5 Host Resources Deep Dive. It's being offered free from Rubrik, and I'd highly recommend you give it a read, specifically the chapter on CPU Scheduler.