CCTeck
Contributor
Contributor

High CPU on Core 0 - ESXi 5.5 and Cisco M2 hardware

Hello there,

We are experiencing an issue update an update to 5.5U2  We are seeing high CPU on a single core (core 0) while all other cores look to be performing as expected.  Interesting we only seem to see this issue on our Cisco M2 hardware,  M3 or later and HP Proliant hardware does not seem to be affected.  I have seen other people have had similar issues but appears Update 2 has addressed the issue for some.

https://communities.vmware.com/message/2428266

https://communities.vmware.com/thread/458133

VMware support have initially said that we should not be concerned, that the host has miminal load and the CPU scheduler has decided that the best optimization will occur if all VMs are kept on a single CPU. 

I don't pretend to be an expert on CPU scheduling, but this behaviour does not seem right.  The fact that this only seems to occur on a specific comibination of hardware and ESXi versions.

Does anyone see similar issues when using Cisco B series hardware? 

We have provided the following information and it points to the VMK ACPI Interrupt

From esxtop – this is instantaneous interrupt activity/s

12:31:18am up 23 days  5:24, 961 worlds, 10 VMs, 76 vCPUs; CPU load average: 0.12, 0.12, 0.13

VECTOR  COUNT/s TIME/int COUNT_0  COUNT_1  COUNT_2  COUNT_3  COUNT_4  COUNT_5  COUNT_6  COUNT_7  COUNT_8  COUNT_9  COUNT_10 COUNT_11 COUNT_12 COUNT_13 COUNT_14 COUNT_15 COUNT_16 COUNT_17 COUNT_18 COUNT_19 COUNT_20 COUNT_21 COUNT_22 COUNT_23 COUNT_24 COUNT_25 COUNT_26 COUNT_27 COUNT_28 COUNT_29 COUNT_30 COUNT_31 COUNT_32 COUNT_33 COUNT_34 COUNT_35 COUNT_3

0x20        0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.

0x21    61450.2     14.0  61450.2      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.0      0.

Here’s what it belongs to:

/var/log # vmkvsitools irqinfo

0x21:  114633018426          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0          0 VMK ACPI Interrupt

Attached is a screenshot of what we are seeing.

Capture.JPG

0 Kudos
3 Replies
kgouldsk
Contributor
Contributor

A little extra detail.

The ACPI driver is not balancing interrupts across cpus.  This is very similar to how drivers loaded into the console OS in pre 4.0 days would be bound to cpu0 as the console OS was, and when interrupts were shared with devices such as network or storage requiring access by the kernel, the resulting conflict would bottleneck by only being handled on cpu0.

In this case, in addition to being limited to cpu0, the rate of interrupt generation would appear to be some kind of spurious action. 

0 Kudos
Alistar
Expert
Expert

Since the VMkernel's processes are always bound to core 0, I'd suggest upgrading drivers and firmware of your hardware to the newest possible and if that doesn't help, open a ticket with the vendor. This is definitely not a normal nor a healthy behavior for your hypervisor. Good luck!

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
0 Kudos
hostasaurus
Enthusiast
Enthusiast

Just adding to this ancient thread that if you call Cisco about this, they may not find this issue because they believe it only occurs on specific B220 M2 hardware with linux as the OS.  It also occurs on other hardware, such as the B440, with the E7 CPU's and obviously vsphere as the OS.  The fix involves a change to the BMC settings that end users are not allowed access to, as it requires Cisco's symmetric key auth they'll do via webex session; so any blade will need to be reboot and have the fix applied, one at a time, via webex with them.  It does resolve the issue.

0 Kudos