I could use some assistance diagnosing a purpose screen. This seems to indicate an issue with the Physical CPU #30... beyond that, I'm not seeing why...
Any help is greatly appreciated.
Thanks in advance.
It won't say why exactly. MCEs almost always indicate physical hardware failure, and most often I see that something on the mainboard has failed. The best way to know for sure is to run hardware diagnostics from your vendor to pinpoint the issue.
It won't say why exactly. MCEs almost always indicate physical hardware failure, and most often I see that something on the mainboard has failed. The best way to know for sure is to run hardware diagnostics from your vendor to pinpoint the issue.
Hi davidcrowder
Can you check
Hello David,
System has encountered a Hardware Error - Please contact the hardware vendor
If you can reproduce the issue readily and/or under load you may be able to narrow down what component is potentially broken/failing by whether you get consistent backtrace and/or things such as specific cores always indicated (e.g. always cores from one CPU but not the other if dual-socket) then again this could indicate slot or other board failure as Chip said above or potentially even the memory bank local to that socket.
Either way switching components around would likely be the only way to deduce it further e.g. if it always follows a CPU when switched. Probably good idea to check your out-of-band management and call your hardware vendor before doing the above of course.
Bob
What is the hardware build? Have you checked its support using VMware Compatibility Guide?
which hardware you using.
MCE point to hardware errors. Besides running diagnostics , I suspect it to be related to cpu power states as mentioned in the trace. In order to provide more details I need hardware/server information.
Please upvote ,
Thanks.
Apologies for the delayed response. The ESXi crashdumps/logs pointed to various processes & cores, but were consistent in pointing to physical CPU-socket #2. The system management gave the hardware a clean bill of health... however, further digging in the system management logs showed assert errors in a memory module on CPU #2. Why that didn't trigger alerts in system management, and subsequently monitoring software... ugh. At any rate, we had the offending memory module replaced and everything is back up and running. Thank you all for pointing me in the correct direction.
Edit: It did not take this long to fix it -- it was fixed same day. It took this long to update because, for whatever reason, the VMWare forum/community website would not allow me to update the post. Thanks again, all!
This is memory issue please change ASAP.