Just upgraded one of our ESXi hosts from 6.0 P5 to 6.5 U1 and ran about 15 VM's on it for about a week and everything was fine. I then migrated 3 more hosts with VFRC enabled and about 8 hours later I got the PSOD below? Has anyone ever come across anything like this? I've got a service request open with support but so far have not been able to give me a concrete answer.
What is hardware model/maker of this server, what is CPU configuration?
Can you give me the SR number?
SR 17549907708
Thanks Manfred
This came in from support today:
#0 Panic_WithBacktrace (sbt=sbt@entry=0x4300c8081e28, fmt=fmt@entry=0x41802ca63538 "PCPU %d: no heartbeat (%u/%u IPIs received)") at bora/vmkernel/main/panic.c:135
135 Panic_SaveRegs();
(gdb) bt
#0 Panic_WithBacktrace (sbt=sbt@entry=0x4300c8081e28, fmt=fmt@entry=0x41802ca63538 "PCPU %d: no heartbeat (%u/%u IPIs received)") at bora/vmkernel/main/panic.c:135
#1 0x000041802c8b3736 in HeartbeatHandleLockup (lockedUpInMS=49000, i=51) at bora/vmkernel/reliability/heartbeat.c:818
#2 HeartbeatCheckPCPU (timestampInMS=<optimized out>, i=51) at bora/vmkernel/reliability/heartbeat.c:716
#3 Heartbeat_DetectCPULockups (data=<optimized out>, timestamp=<optimized out>) at bora/vmkernel/reliability/heartbeat.c:517
#4 0x000041802c6fd40c in TimerBHHandlerLoop (list=0x43910afb6f50, curTC=994112000013552, t=0x43910afb6000) at bora/vmkernel/main/timer.c:2618
#5 Timer_BHHandler (unused=unused@entry=0x0) at bora/vmkernel/main/timer.c:2727
#6 0x000041802c6b176b in BHCheckBegin (canReschedule=1 '\001') at bora/vmkernel/main/bh.c:996
#7 BH_DrainAndDisableInterrupts (canReschedule=1 '\001') at bora/vmkernel/main/bh.c:1094
#8 0x000041802c6d3372 in IntrCookie_VmkernelInterrupt (vector=239, vectorData=vectorData@entry=0, fullFrame=fullFrame@entry=0x439153e9bc50) at bora/vmkernel/main/intrCookie.c:3958
#9 0x000041802c72e93d in IDTHandleInterrupt (fullFrame=0x439153e9bc50) at bora/vmkernel/main/idt.c:1288
#10 IDT_IntrHandler (fullFrame=0x439153e9bc50) at bora/vmkernel/main/idt.c:1311
#11 0x000041802c73d044 in gate_entry ()
#12 0x000041802c68b9c2 in CPU_StiMwaitInstr (hints=0, extensions=0) at bora/vmkernel/hardware/x86/cpu_int_arch.h:136
#13 Power_ArchSetCState (state=<optimized out>, c1type=<optimized out>) at bora/vmkernel/hardware/x86/power_arch.c:379
#14 0x000041802c67796c in Power_HaltPCPU (now=<optimized out>, c1type=<optimized out>) at bora/vmkernel/hardware/power.c:961
#15 0x000041802c8c49d3 in CpuSchedIdleHaltStart () at bora/vmkernel/sched/cpusched.c:12546
#16 CpuSchedIdleLoopInt () at bora/vmkernel/sched/cpusched.c:12746
#17 0x000041802c8c728a in CpuSchedBusyWait (mySchedPcpu=<optimized out>) at bora/vmkernel/sched/cpusched.c:12835
#18 CpuSchedTryBusyWait (prevIRQL=0 '\000', idleVcpu=0x439140da7100, nowStart=994111992008397, schedPcpu=0x418046c00080) at bora/vmkernel/sched/cpusched.c:7751
#19 CpuSchedChooseAndSwitch (prevIRQL=0 '\000', nhccNow=16392214628518, now=<optimized out>, schedPcpu=0x418046c00080, prev=0x439140da7100) at bora/vmkernel/sched/cpusched.c:7936
#20 CpuSchedDispatch (prevIRQL=prevIRQL@entry=0 '\000', prevState=prevState@entry=2147483648) at bora/vmkernel/sched/cpusched.c:8097
#21 0x000041802c8c8502 in CpuSchedWait (event=..., waitType=CPUSCHED_WAIT_NET, actionWakeupSet=0x0, queue=0x0, cookie=<optimized out>) at bora/vmkernel/sched/cpusched.c:9694
#22 0x000041802c8c85d5 in CpuSched_NoEvqWait (waitType=waitType@entry=CPUSCHED_WAIT_NET) at bora/vmkernel/sched/cpusched.c:9764
#23 0x000041802c8324a2 in NetPollWorldCallback (data=0x4300d10fd980) at bora/vmkernel/net/vmkapi_net_poll.c:605
#24 0x000041802c8c91b5 in CpuSched_StartWorld (destWorld=<optimized out>, previous=<optimized out>) at bora/vmkernel/sched/cpusched.c:10780
#25 0x0000000000000000 in ?? () from /build/storage61/release/bora-5969303/build/linux64/bora/build/esx/release/vmkmod-vmkernel64/chardevs
(gdb)
Analysis:
These events resemble an ongoing PR with the engineering team, where the issue appears to be with lsi_mr3 low memory allocation failure. I do see that the server is installed with the lsi_mr3 driver version 6.910.18.00-1vmw. In case if this PSOD keeps happening at frequent intervals, then we would request you to update the latest lsi_mr3 drivers which is available from the below URL and its corresponding compatible firmware version (25.5.2.0001):
Dell R920
4 Intel(R) Xeon(R) CPU E7-4820 v2 @ 2.00GHz