VMware Cloud Community
Rugbot
Contributor
Contributor

Troubleshooting high VM % CPU Contention in vROps

Hope someone can help ... (been through all the good blogs out there and clearly there is still more to understand !) ...

Have a few 2 vCPU software firewalls (VMs) on our environment  with CPU Contention > 20%  (Demand ~70% and Usage ~50%) but can't find VM CPU counters that can explain possible cause?

Observation 1: each of the following VM CPU counters in vROps are well under 1%  [% CPU Idle|% CPU Ready|%CPU Co-stop|%CPU IO Wait| %CPU System | %CPU Swap Wait]I

                        So , no help there . What other counter could show where that 20% CPU contention comming from ?
                        Or am I showhow mis-understanding what the % CPU contention is indicating (I expect one of the above counters to give me the root-cause of the contention) ?


Observation 2: when I create a stacked-chart of cpu|used(ms) + cpu|idle(ms) + cpu|ready(ms) + cpu|co-stop(ms) + cpu|iowait (ms) + cpu|swap-wait (ms) , vROps clearly shows used(ms) =~50% and idle(ms)=~50% (of the total 40000ms total cpu-time).
                         So why is this time-based CPU| idle(ms) counter indicating ~50% idle , whereas the %cpu idle counter (for same VM , same time-period) says less than 1% ?             

                       

Thanks

6 Replies
mark_j
Virtuoso
Virtuoso

Usage is 50%, however 'COULD' be 70% is it wasn't 20% constrained. The Q is where is the contention coming from? Check all the other VMs on the host to ensure vCPU don't > pCPU? What do the other VMs show for contention? Limits in place? The alert in vR  Ops is actually a nice starting point.. then Analysis tab.. then troubleshooting tab.

If you find this or any other answer useful please mark the answer as correct or helpful.
Reply
0 Kudos
MacVay
Enthusiast
Enthusiast

One thing to consider is that CPU Ready and CPU Usage don't have to be mutually exclusive to each other.  I have seen it many times where CPU Ready is high and usage is low.  In my experience, the cause it typically either many oversized VM's on the cluster creating scheduling problems. (Co-Stop can hunt this down) and the other is that the BIOS settings for the hosts are set for some sort of power management for the CPU.  (During times of lower CPU utilization the BIOS is scaling back the CPU and VMware is unaware that the available resource is not fully available.)

Cheers,

Reply
0 Kudos
Rugbot
Contributor
Contributor

Thanks for the responses mark.j and MacVay.

Still not found the root-cause, but here is an update.

  • The customer (firewall guys) are not seeing any service issues from their application perspective. 

    Anyway, I'm keen to drill-down and learn why this maybe for future reference ...

  • Learnt that the main (and only) applications on these VMs are single-threaded - I can see 1 of the 2 cores maxing CPU usage, and the
    other core never reaching more than 10% used (presumably system processes). Could this somehow be reflected in the vROps contention measure  ?

  • Looking at the ESX hosts these VMs reside on :

[a] Allocated vCPUs are > pCPUs - but none of VMs on them show significant contention (ready <2%|co-stop <0%).

[b] The power-management of these hosts are in "balanced mode", so it well maybe a cause. Unfortunately I am unable to  "test" on our production
systems by changing this to "performance" mode and see the impact.   

Thanks again. Will update - if I learn anything new.

Reply
0 Kudos
liverson20
Contributor
Contributor

I know this is a slightly old post but we just ran into this and thought I'd share our findings.  You are on to something with the power management.  We have a few hosts in our environment that say they are balanced.  Most of ours say high performance.  You more than likely need to change this in the BIOS of the host.  VMware will then automatically set it to high performance.  Check with your hardware vendor on how to enable high performance for the cpus.  I was able to test this by vmotioning a vm that had >20% contention to a host that had high performance configured.  We saw an instant drop in contention.  I know it wasn't because there was less of a load on the host.  I actually moved it to a host that had slightly higher cpu demands from the vm's.  Do you have enough hosts in your prod cluster that you could put one in maintenance mode?  Make the change in the BIOS.  Verify it is running high performance in vcenter.  Take it out of maintenance mode.  Move on to the next one.  What I haven't been able to figure out is why we don't see any wait or ready with this.  Seems like if the vm is waiting for the power management to "enable" some more GHz that we ought to see where the vm is waiting for that.  Hope this helps....sorry it was a bit lengthy.

Reply
0 Kudos
jengl
Enthusiast
Enthusiast

Yeah, I also agree with my pre-posters, it should be the power management in the BIOS.

I had the same issue with different VMs and at the end it was the metric CPU latency, who was creating the high CPU contention. The ESXi were configured for maximum performance, but the Dell BIOS needed updating the power management to OS DPM. After that the CPU latency dropped and the contention vanished.

Greetings,

jengl

Reply
0 Kudos
Jemimus
Contributor
Contributor

We are having exactly the same issue

We have run into an issue where the ESX Host was putting CPU Cores into power saving C-states. (even though it wasnt suppose to, v1.0 BIOS of Dell Poweredge R630)

This caused CPU contention that was not easily visible. %RDY, %CSTP, %VMWait where all low.  But in vROPS, it was showing a %contention value of anywhere between 5% and 65%


It turns out that this was visible in the $LAT_C  metric, but no where else.


---------------------------------------
%LAT_C Percentage of time the resource pool or world was ready to run but was not scheduled to run because of CPU resource contention.

-------------------------------------------------

This metric can be made visible in ESXTOP, by turning on the extra columns option "i" (  SUMMARY STATS = CPU Summary Stats )

Also, the Latency metric van be seen in the vSphere client if you have stastics turned up high enough.

In any case, I know now that %LAT_C is also used for the vROPS %Contention counter.

I suspect that there are also several disk and network i/o metrics that go into it.

Reply
0 Kudos