Solved: Re: Help diagnosing purple screen

davidcrowder · ‎04-07-2019

I could use some assistance diagnosing a purpose screen. This seems to indicate an issue with the Physical CPU #30... beyond that, I'm not seeing why...

Any help is greatly appreciated.

Thanks in advance.

daphnissov · ‎04-07-2019

It won't say why exactly. MCEs almost always indicate physical hardware failure, and most often I see that something on the mainboard has failed. The best way to know for sure is to run hardware diagnostics from your vendor to pinpoint the issue.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

View solution in original post

daphnissov · ‎04-07-2019

It won't say why exactly. MCEs almost always indicate physical hardware failure, and most often I see that something on the mainboard has failed. The best way to know for sure is to run hardware diagnostics from your vendor to pinpoint the issue.

------------------
How to Ask for Help on Tech Forums
https://neonmirrors.net

asajm · ‎04-07-2019

Hi davidcrowder

Can you check

VMware Knowledge Base

If you think your queries have been answered
Marking this response as "Solution " or "Kudo"
ASAJM

TheBobkin · ‎04-07-2019

Hello David,

System has encountered a Hardware Error - Please contact the hardware vendor

If you can reproduce the issue readily and/or under load you may be able to narrow down what component is potentially broken/failing by whether you get consistent backtrace and/or things such as specific cores always indicated (e.g. always cores from one CPU but not the other if dual-socket) then again this could indicate slot or other board failure as Chip said above or potentially even the memory bank local to that socket.

Either way switching components around would likely be the only way to deduce it further e.g. if it always follows a CPU when switched. Probably good idea to check your out-of-band management and call your hardware vendor before doing the above of course.

Bob

adgate · ‎04-08-2019

What is the hardware build? Have you checked its support using VMware Compatibility Guide?

serveradminist2 · ‎04-08-2019

which hardware you using.

Arthos · ‎04-09-2019

MCE point to hardware errors. Besides running diagnostics , I suspect it to be related to cpu power states as mentioned in the trace. In order to provide more details I need hardware/server information.

Please upvote ,

Thanks.

davidcrowder · ‎04-10-2019

Apologies for the delayed response. The ESXi crashdumps/logs pointed to various processes & cores, but were consistent in pointing to physical CPU-socket #2. The system management gave the hardware a clean bill of health... however, further digging in the system management logs showed assert errors in a memory module on CPU #2. Why that didn't trigger alerts in system management, and subsequently monitoring software... ugh. At any rate, we had the offending memory module replaced and everything is back up and running. Thank you all for pointing me in the correct direction.

Edit: It did not take this long to fix it -- it was fixed same day. It took this long to update because, for whatever reason, the VMWare forum/community website would not allow me to update the post. Thanks again, all!

serveradminist2 · ‎04-10-2019

This is memory issue please change ASAP.

All

Help diagnosing purple screen