Hi,
Running into an unusual issue here. Every 2 days or so my system hangs (crashes?) no PSOD or anything just blank screen until a force reboot.
This is an ESXi whitebox with Ryzen 2600 CPU, Asrock Rack X470D4U and 16Gb Kingston DIMM (on QVL for mobo)
I am running ESXi-7.0.0-16324942-standard which is the latest build that I can find.
memtest is returning clean and I have made sure that RDRAND is returning sensible values - I've reached a bit of a loss. Below is the vmkernel.log output when the issue occurs. Any hints would be great.
2020-10-05T23:45:46.161Z cpu6:264546)WARNING: Heartbeat: 767: PCPU 4 didn't have a heartbeat for 7 seconds; *may* be locked up.
2020-10-05T23:45:46.161Z cpu4:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0x368bd2(0x420035c00000):0x451a0469b9b0:0xf48] (Src 0x1, CPU4)
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b918:[0x420035f68bd1]CpuSched_PcpuLoadGet@vmkernel#nover+0x26 stack: 0x43007f0025a8
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b920:[0x420035f69844]CpuSchedMigrateGoodness@vmkernel#nover+0x5e1 stack: 0xff
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469b9c0:[0x420035f6b30c]CpuSched_VcpuMigrateBestPcpu@vmkernel#nover+0x4f5 stack: 0x27a043797c562
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bcf0:[0x420035f6b7bd]CpuSched_VcpuWakeupMigrateUnified@vmkernel#nover+0x5e stack: 0x27a043797c562
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd20:[0x420035f55b0e]CpuSchedVcpuMakeReady@vmkernel#nover+0xdf stack: 0x451a151a1900
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd40:[0x420035f55bbc]CpuSchedWorldWakeup@vmkernel#nover+0x8d stack: 0x27a043797c562
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd70:[0x420035f55e31]CpuSchedForceWakeupInt@vmkernel#nover+0x82 stack: 0x3c
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bd90:[0x420035d0b750]Timer_BHHandler@vmkernel#nover+0x1f9 stack: 0x4519c0200560
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469be20:[0x420035cbb531]BH_Check@vmkernel#nover+0x6e stack: 0x0
2020-10-05T23:45:46.161Z cpu4:262285)0x451a0469bea0:[0x420035f58079]CpuSched_SafePreemptionPoint@vmkernel#nover+0x16 stack: 0x1f
2020-10-05T23:52:10.098Z cpu1:262848)DVFilter: 6344: Checking disconnected filters for timeouts
2020-10-06T00:02:09.154Z cpu7:262848)DVFilter: 6344: Checking disconnected filters for timeouts
2020-10-06T00:10:46.301Z cpu2:262793)WARNING: Heartbeat: 767: PCPU 1 didn't have a heartbeat for 7 seconds; *may* be locked up.
2020-10-06T00:10:46.301Z cpu1:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0x33d79(0x420035c00000):0x451a0469bdf8:0xf48] (Src 0x1, CPU1)
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bdd8:[0x420035c33d78]NRandomHwrngRdrand@vmkernel#nover+0x9 stack: 0x0
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bde0:[0x420035c21366]extract_buf@vmkernel#nover+0x33 stack: 0x8
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bea0:[0x420035c21b77]extract_entropy_user@vmkernel#nover+0x5c stack: 0x451a0469bef0
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bf00:[0x420035dab142]VmMemCow_PShareUpdateCache@vmkernel#nover+0xab stack: 0x100000
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bf70:[0x420035f7bcd0]MemSchedEst_PShareLoop@vmkernel#nover+0x161 stack: 0x0
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469bfe0:[0x420035f5e2f9]CpuSched_StartWorld@vmkernel#nover+0x82 stack: 0x0
2020-10-06T00:10:46.301Z cpu1:262285)0x451a0469c000:[0x420035cc44c3]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
2020-10-06T00:11:16.465Z cpu1:264535)WARNING: Heartbeat: 767: PCPU 5 didn't have a heartbeat for 7 seconds; *may* be locked up.
2020-10-06T00:11:16.465Z cpu5:262285)ALERT: NMI: 694: NMI IPI: RIPOFF(base):RBP:CS [0xab974(0x420035c00000):0x451a0469bdf8:0xf48] (Src 0x1, CPU5)
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bd40:[0x420035cab973]SHA1Transform@vmkernel#nover+0x16c stack: 0xb580e14d071beb39
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bde0:[0x420035c213b6]extract_buf@vmkernel#nover+0x83 stack: 0x8
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bea0:[0x420035c21b77]extract_entropy_user@vmkernel#nover+0x5c stack: 0x1f
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bf00:[0x420035dab142]VmMemCow_PShareUpdateCache@vmkernel#nover+0xab stack: 0x200000
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bf70:[0x420035f7bcd0]MemSchedEst_PShareLoop@vmkernel#nover+0x161 stack: 0x0
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469bfe0:[0x420035f5e2f9]CpuSched_StartWorld@vmkernel#nover+0x82 stack: 0x0
2020-10-06T00:11:16.465Z cpu5:262285)0x451a0469c000:[0x420035cc44c3]Debug_IsInitialized@vmkernel#nover+0xc stack: 0x0
Hey,
I ended up resolving this issue. For me it was caused by a bad drive. I was running an M.2 PCIE SSD as my main local storage and boot drive. I reinstalled ESXI on a new drive (SATA this time) and it hasn't crashed for the last 3 months.
It would also make sense to check the VM integrity as described above (although all mine were fine when I tested).
Cheers
bump
I have exactly the same PSOD here.
Did you found an Resolution?
It is always, if i want to delete one special VM. So i think that this is an storage firmware problem?
VMware ESXi, 7.0.1, 17168206
Check your "one special VM" - if any of its files does not allow you to use hexdump -C >file> and answers with "bad file descriptor" then that is your problem.
Ulli
I identified the flat.vmdk as the faulty one.
Hexdump ran for a few hours. It gives me now the following Error:
hexdump: Srv-Orbis2-flat.vmdk: Invalid argument.
There is no point in running hexdump -C against the flat.vmdk for more than just a few seconds.
While the VM is powered off now run
vmkfstools -p 0 name-flat.vmdk > /tmp/mapping.txt
If that runs without any error messages and populates the mapping.txt file your vmdk should be in a good enough state to clone it to a different datastore and then rebuild the current datastore from scratch.
If that does not work - then this turns into a recovery project.
In most cases I could probably do that via a remote session - call me via skype if you need assistance.
Ulli
Hi,
We have the same issue - it happens when VM is under heavy storage activity (eg ongoing backup with file copy). It does not result in PSOD, however after a while datastores time out and server needs to be restarted in order to regain the access.
When inspecting logs there are messages about storage issues alongside with these PCPU NMI RIPOFF messages, not sure how to fix this, already changed the controller, drivers, firmware (using local storage, VMFS6). Issue also caused by 1 special VM.
Hey,
I ended up resolving this issue. For me it was caused by a bad drive. I was running an M.2 PCIE SSD as my main local storage and boot drive. I reinstalled ESXI on a new drive (SATA this time) and it hasn't crashed for the last 3 months.
It would also make sense to check the VM integrity as described above (although all mine were fine when I tested).
Cheers