Hello there,
We are experiencing an issue update an update to 5.5U2 We are seeing high CPU on a single core (core 0) while all other cores look to be performing as expected. Interesting we only seem to see this issue on our Cisco M2 hardware, M3 or later and HP Proliant hardware does not seem to be affected. I have seen other people have had similar issues but appears Update 2 has addressed the issue for some.
https://communities.vmware.com/message/2428266
https://communities.vmware.com/thread/458133
VMware support have initially said that we should not be concerned, that the host has miminal load and the CPU scheduler has decided that the best optimization will occur if all VMs are kept on a single CPU.
I don't pretend to be an expert on CPU scheduling, but this behaviour does not seem right. The fact that this only seems to occur on a specific comibination of hardware and ESXi versions.
Does anyone see similar issues when using Cisco B series hardware?
We have provided the following information and it points to the VMK ACPI Interrupt
From esxtop – this is instantaneous interrupt activity/s
12:31:18am up 23 days 5:24, 961 worlds, 10 VMs, 76 vCPUs; CPU load average: 0.12, 0.12, 0.13
VECTOR COUNT/s TIME/int COUNT_0 COUNT_1 COUNT_2 COUNT_3 COUNT_4 COUNT_5 COUNT_6 COUNT_7 COUNT_8 COUNT_9 COUNT_10 COUNT_11 COUNT_12 COUNT_13 COUNT_14 COUNT_15 COUNT_16 COUNT_17 COUNT_18 COUNT_19 COUNT_20 COUNT_21 COUNT_22 COUNT_23 COUNT_24 COUNT_25 COUNT_26 COUNT_27 COUNT_28 COUNT_29 COUNT_30 COUNT_31 COUNT_32 COUNT_33 COUNT_34 COUNT_35 COUNT_3
0x20 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
0x21 61450.2 14.0 61450.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
Here’s what it belongs to:
/var/log # vmkvsitools irqinfo
0x21: 114633018426 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 VMK ACPI Interrupt
Attached is a screenshot of what we are seeing.
A little extra detail.
The ACPI driver is not balancing interrupts across cpus. This is very similar to how drivers loaded into the console OS in pre 4.0 days would be bound to cpu0 as the console OS was, and when interrupts were shared with devices such as network or storage requiring access by the kernel, the resulting conflict would bottleneck by only being handled on cpu0.
In this case, in addition to being limited to cpu0, the rate of interrupt generation would appear to be some kind of spurious action.
Since the VMkernel's processes are always bound to core 0, I'd suggest upgrading drivers and firmware of your hardware to the newest possible and if that doesn't help, open a ticket with the vendor. This is definitely not a normal nor a healthy behavior for your hypervisor. Good luck!
Just adding to this ancient thread that if you call Cisco about this, they may not find this issue because they believe it only occurs on specific B220 M2 hardware with linux as the OS. It also occurs on other hardware, such as the B440, with the E7 CPU's and obviously vsphere as the OS. The fix involves a change to the BMC settings that end users are not allowed access to, as it requires Cisco's symmetric key auth they'll do via webex session; so any blade will need to be reboot and have the fix applied, one at a time, via webex with them. It does resolve the issue.