Solved: Re: ESXI Crash to nothing. PCPU didn't have a hear...

evoicefire · ‎10-05-2020

Hi,

Running into an unusual issue here. Every 2 days or so my system hangs (crashes?) no PSOD or anything just blank screen until a force reboot.

This is an ESXi whitebox with Ryzen 2600 CPU, Asrock Rack X470D4U and 16Gb Kingston DIMM (on QVL for mobo)

I am running ESXi-7.0.0-16324942-standard which is the latest build that I can find.

memtest is returning clean and I have made sure that RDRAND is returning sensible values - I've reached a bit of a loss. Below is the vmkernel.log output when the issue occurs. Any hints would be great.

2020-10-05T23:45:46.161Z cpu6:264546)WARNING: Heartbeat: 767: PCPU 4 didn't have a heartbeat for 7 seconds; *may* be locked up.

2020-10-05T23:45:46.161Z cpu4:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0x368bd2(0x420035c00000):0x451a0469b9b0:0xf48] (Src 0x1, CPU4)

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b918:[0x420035f68bd1]CpuSched_PcpuLoadGet@vmkernel#nover+0x26 stack: 0x43007f0025a8

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b920:[0x420035f69844]CpuSchedMigrateGoodness@vmkernel#nover+0x5e1 stack: 0xff

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b9c0:[0x420035f6b30c]CpuSched_VcpuMigrateBestPcpu@vmkernel#nover+0x4f5 stack: 0x27a043797c562

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bcf0:[0x420035f6b7bd]CpuSched_VcpuWakeupMigrateUnified@vmkernel#nover+0x5e stack: 0x27a043797c562

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd20:[0x420035f55b0e]CpuSchedVcpuMakeReady@vmkernel#nover+0xdf stack: 0x451a151a1900

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd40:[0x420035f55bbc]CpuSchedWorldWakeup@vmkernel#nover+0x8d stack: 0x27a043797c562

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd70:[0x420035f55e31]CpuSchedForceWakeupInt@vmkernel#nover+0x82 stack: 0x3c

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd90:[0x420035d0b750]Timer_BHHandler@vmkernel#nover+0x1f9 stack: 0x4519c0200560

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469be20:[0x420035cbb531]BH_Check@vmkernel#nover+0x6e stack: 0x0

2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bea0:[0x420035f58079]CpuSched_SafePreemptionPoint@vmkernel#nover+0x16 stack: 0x1f

2020-10-05T23:52:10.098Z cpu1:262848)DVFilter: 6344: Checking disconnected filters for timeouts

2020-10-06T00:02:09.154Z cpu7:262848)DVFilter: 6344: Checking disconnected filters for timeouts

2020-10-06T00:10:46.301Z cpu2:262793)WARNING: Heartbeat: 767: PCPU 1 didn't have a heartbeat for 7 seconds; *may* be locked up.

2020-10-06T00:10:46.301Z cpu1:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0x33d79(0x420035c00000):0x451a0469bdf8:0xf48] (Src 0x1, CPU1)

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bdd8:[0x420035c33d78]NRandomHwrngRdrand@vmkernel#nover+0x9 stack: 0x0

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bde0:[0x420035c21366]extract_buf@vmkernel#nover+0x33 stack: 0x8

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bea0:[0x420035c21b77]extract_entropy_user@vmkernel#nover+0x5c stack: 0x451a0469bef0

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bf00:[0x420035dab142]VmMemCow_PShareUpdateCache@vmkernel#nover+0xab stack: 0x100000

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bf70:[0x420035f7bcd0]MemSchedEst_PShareLoop@vmkernel#nover+0x161 stack: 0x0

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bfe0:[0x420035f5e2f9]CpuSched_StartWorld@vmkernel#nover+0x82 stack: 0x0

2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469c000:[0x420035cc44c3]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

2020-10-06T00:11:16.465Z cpu1:264535)WARNING: Heartbeat: 767: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.

2020-10-06T00:11:16.465Z cpu5:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0xab974(0x420035c00000):0x451a0469bdf8:0xf48] (Src 0x1, CPU5)

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bd40:[0x420035cab973]SHA1Transform@vmkernel#nover+0x16c stack: 0xb580e14d071beb39

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bde0:[0x420035c213b6]extract_buf@vmkernel#nover+0x83 stack: 0x8

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bea0:[0x420035c21b77]extract_entropy_user@vmkernel#nover+0x5c stack: 0x1f

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bf00:[0x420035dab142]VmMemCow_PShareUpdateCache@vmkernel#nover+0xab stack: 0x200000

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bf70:[0x420035f7bcd0]MemSchedEst_PShareLoop@vmkernel#nover+0x161 stack: 0x0

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bfe0:[0x420035f5e2f9]CpuSched_StartWorld@vmkernel#nover+0x82 stack: 0x0

2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469c000:[0x420035cc44c3]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0

evoicefire · ‎07-28-2021

Hey,

I ended up resolving this issue. For me it was caused by a bad drive. I was running an M.2 PCIE SSD as my main local storage and boot drive. I reinstalled ESXI on a new drive (SATA this time) and it hasn't crashed for the last 3 months.

It would also make sense to check the VM integrity as described above (although all mine were fine when I tested).

Cheers

View solution in original post

evoicefire · ‎12-23-2020

bump

deganl · ‎04-13-2021

I have exactly the same PSOD here.
Did you found an Resolution?

It is always, if i want to delete one special VM. So i think that this is an storage firmware problem?

VMware ESXi, 7.0.1, 17168206

continuum · ‎04-13-2021

Check your "one special VM" - if any of its files does not allow you to use hexdump -C >file> and answers with "bad file descriptor" then that is your problem.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

deganl · ‎04-14-2021

I identified the flat.vmdk as the faulty one.

Hexdump ran for a few hours. It gives me now the following Error:
hexdump: Srv-Orbis2-flat.vmdk: Invalid argument.

continuum · ‎04-14-2021

There is no point in running hexdump -C against the flat.vmdk for more than just a few seconds.

While the VM is powered off now run

vmkfstools -p 0 name-flat.vmdk > /tmp/mapping.txt

If that runs without any error messages and populates the mapping.txt file your vmdk should be in a good enough state to clone it to a different datastore and then rebuild the current datastore from scratch.

If that does not work - then this turns into a recovery project.
In most cases I could probably do that via a remote session - call me via skype if you need assistance.

Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

_steez · ‎07-28-2021

Hi,

We have the same issue - it happens when VM is under heavy storage activity (eg ongoing backup with file copy). It does not result in PSOD, however after a while datastores time out and server needs to be restarted in order to regain the access.

When inspecting logs there are messages about storage issues alongside with these PCPU NMI RIPOFF messages, not sure how to fix this, already changed the controller, drivers, firmware (using local storage, VMFS6). Issue also caused by 1 special VM.

evoicefire · ‎07-28-2021

Hey,

I ended up resolving this issue. For me it was caused by a bad drive. I was running an M.2 PCIE SSD as my main local storage and boot drive. I reinstalled ESXI on a new drive (SATA this time) and it hasn't crashed for the last 3 months.

It would also make sense to check the VM integrity as described above (although all mine were fine when I tested).

Cheers

All

ESXI Crash to nothing. PCPU didn't have a heartbeat. NMI RIPOFF