VMware Cloud Community
rapid4cloud
Contributor
Contributor

ESXi, 6.0.0,4192238, Got error: @BlueScreen: PCPU 13 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 13).

Hi,


We need some help with the PSOD error.


Our system (ESXi, 6.0.0,4192238) running on a SuperMicro X10DAX has hung a couple weeks ago (it was running just fine for a month since it has been built.).


We have tried to look through the KB but couldn't find the one that matches with my case and the system patch still up to date (i.e. Build Number 4192238).

Also, Found from the old thread that some recommended to try disabled "Collaborative Power Control" in BIOS https://communities.vmware.com/thread/498681?start=0&tstart=0

So, We've tried to update the BIOS to the latest version and disable PM in my BIOS configuration (Change Power Technology from "Energy Efficient" to "Disabled" )

But I have no idea will it fix or not as the issue seems to randomly occur.


Binary dump log given the errors following below.

[7m2016-09-20T23:26:44.725Z cpu18:40932)WARNING: Heartbeat: 796: PCPU 12 didn't have a heartbeat for 8 seconds; *may* be locked up.
[7m2016-09-20T23:26:44.725Z cpu18:40932)WARNING: Heartbeat: 796: PCPU 13 didn't have a heartbeat for 8 seconds; *may* be locked up.

2016-09-20T23:26:44.965Z cpu40:33272) [45m [33;1mVMware ESXi 6.0.0 [Releasebuild-4192238 x86_64]
[0mPCPU 13 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 13).

2016-09-20T23:26:44.966Z cpu40:33272)@BlueScreen: PCPU 13 locked up. Failed to ack TLB invalidate (total of 1 locked up, PCPU(s): 13).
2016-09-20T23:26:44.966Z cpu40:33272)Code start: 0x418001400000 VMK uptime: 2:15:13:23.275
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bbd0:[0x418001477bea]PanicvPanicInt@vmkernel#nover+0x37e stack: 0x43914fc1bc68
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bc60:[0x418001477eb5]Panic_NoSave@vmkernel#nover+0x4d stack: 0x43914fc1bcc0
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bcc0:[0x41800148bf05]TLBGetLockedCPUBacktraces@vmkernel#nover+0x25d stack: 0x9
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1be80:[0x41800148c1f6]TLBDoInvalidate@vmkernel#nover+0x21a stack: 0x4391cc927000
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bed0:[0x4180019d1fb0]UserMem_CartelFlush@<None>#<None>+0xc0 stack: 0x0
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bf50:[0x418001a42cd6]UserMemTouchedEstimationLoop@<None>#<None>+0x1d2 stack: 0x26cff5dd57
2016-09-20T23:26:44.966Z cpu40:33272)0x43914fc1bfd0:[0x418001614c1e]CpuSched_StartWorld@vmkernel#nover+0xa2 stack: 0x0
2016-09-20T23:26:44.968Z cpu40:33272)base fs=0x0 gs=0x41804a000000 Kgs=0x0

2016-09-20T23:27:00.087Z cpu18:40948)World: 9740: PRDA 0x418044800000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0
2016-09-20T23:27:00.087Z cpu18:40948)World: 9742: TR 0x4020 GDT 0x43923fa21000 (0x402f) IDT 0x4180014c9000 (0xfff)
2016-09-20T23:27:00.087Z cpu18:40948)World: 9743: CR0 0x80010031 CR3 0x4318c4000 CR4 0x42768
2016-09-20T23:27:00.087Z cpu18:40948)Panic: 634: Panic from another CPU (cpu 18, world 40948): ip=0x4180014780a0 randomOff=0x1400000:
Machine Check Exception: Fatal (unrecoverable) MCE on PCPU18 in world 40948:vmx-vcpu-0:I
System has encountered a Hardware Error - Please contact the hardware vendor
2016-09-20T23:27:00.087Z cpu18:40948)Backtrace for current CPU #18, worldID=40948, rbp=0x43127f433e20
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1be80:[0x418001413e68]Interrupts_SetFlags@vmkernel#nover+0x4 stack: 0x0, 0x43923fa1bf38, 0
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1be88:[0x4180014777ec]PanicFreezeForPanicInt@vmkernel#nover+0x8c stack: 0x43923fa1bf38, 0x
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1bea8:[0x4180019c132b]UserKernelExit@<None>#<None>+0x53 stack: 0x80, 0x4180019c1400, 0x439
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1bec8:[0x4180019c1400]UserGenericSyscallExit@<None>#<None>+0x6c stack: 0x1, 0x80, 0x7e, 0x
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1bef8:[0x4180019c1776]User_LinuxSyscallHandler@<None>#<None>+0x18a stack: 0x3ffeb2f6960, 0
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1bf28:[0x41800148e5d1]User_LinuxSyscallHandler@vmkernel#nover+0x1d stack: 0x10b, 0x0, 0x0,
2016-09-20T23:27:00.087Z cpu18:40948)0x43923fa1bf38:[0x4180014c7044]gate_entry_@vmkernel#nover+0x0 stack: 0x0, 0xffffffffffffffe0, 0x8fe
2016-09-20T23:27:00.087Z cpu18:40948)Panic: 769: Halting PCPU 18.


System Specification.

CPU: 2 x Intel® Xeon® Processor E5-2687W v4 (30M Cache, 3.00 GHz)

MAINBOARD: 1 x Supermicro X10DAX Workstation Motherboard

RAM: 16 x Samsung DDR4 2133MHzCL15 32GB (PC4 2133) Internal Memory M386A4G40DM0-CPB

RAID CARD: 2 x Adaptec RAID 81605ZQ with maxCache Components 2281600-R

OS: VMware vSphere Essentials Kit

Tags (1)
0 Kudos
6 Replies
zXi_Gamer
Virtuoso
Virtuoso

Unfortunately, I have to side this one on the hardware to be the suspect. However, to ensure that the system itself is healthy, you can run a system diagnostic before concluding.

Also, I am not aware of SuperMicro server management, but most server vendors do provide remote access such as ILO or DRAC which would definitly capture the hardware events or failures

0 Kudos
Dee006
Hot Shot
Hot Shot

Hi,

I'm not good with supermicro,But PSOD mostly due to faulty driver module which loaded in the server.Have you raised the case with Super micro to check the firmware or driver comparability.Can you try to degrade the ESXi build  and see the PSOD is re-occuring?

For Power mode,I prefer you to change to high performance power mode to avoid the performance degrade scenario.

0 Kudos
rapid4cloud
Contributor
Contributor

I will check on the server management.

Thank you for your suggestion.

0 Kudos
rapid4cloud
Contributor
Contributor

Thank you for your suggestion Dee006.

I have raised the case to Super micro and they provided me the latest BIOS version to fix the problem in VMware.

The similar issue from another version of ESXi has been addressed in https://www.supermicro.com.tw/support/faqs/faq.cfm?faq=23729

We will try to update the BIOS soon.

But now, what we concern is how can we ensure that the issue really is gone as we don't have a way to reproduce the issue.

0 Kudos
Dee006
Hot Shot
Hot Shot

The way to find the buggy thing is bit difficult.Keep the server under test load for observation.

Close the discussion if you'r issue resolved.

-DK

0 Kudos
Arthos
Enthusiast
Enthusiast

Intel Broadwell CPU has an Errata which causes ESXi to PSOD . Refer the below KB. Here , cpu 40 sends a heart beat to cpu13 which it misses [ locked , where ? no IDEA !! ] and then cpu 40 panics and dumps it trace.I suggest upgrading the Bios to the latest version for the remedial fix.

ESXi host fails with PSOD when using Intel Xeon Processor E5 v4, E7 v4, and D-1500 families (2146388...

0 Kudos