VMware Cloud Community
DavoudTeimouri
Virtuoso
Virtuoso
Jump to solution

HP DL585 G7 - Unexpected reboot cause of Uncorrectable Machine Check Exception

Hi,

I have an HP DL585 G7 and ESXi 5.1 is installed on that.

The server was rebooted cause of MCE which I found on iLO IML, error is like this: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000040, Bank 0x00000004, Status 0xF6000000'00070F0F, Address 0x00000019'9C5BFA00, Misc 0x00000000'00000000)

Also I found this on vmkernel.log: MCE: 635: Fixed 12 MCE bank/CPU-package ownership settings

My server BIOS is updated and power policy is "HP Static High Performance Mode".

Any one can help me to find root cause?

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
1 Solution

Accepted Solutions
DavoudTeimouri
Virtuoso
Virtuoso
Jump to solution

The answer is published on this page: http://h20565.www2.hp.com/hpsc/doc/public/display?sp4ts.oid=4161627&docId=mmr_kc-0117152&docLocale=e...

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/

View solution in original post

0 Kudos
3 Replies
NagangoudaPatil
Enthusiast
Enthusiast
Jump to solution

Check with HP, it is related hardware problem, motherboard or CPU issue.

0 Kudos
aaronwsmith
Enthusiast
Enthusiast
Jump to solution

Are you sure the server rebooted because of this MCE?  This implies ESXi displayed a purple diagnostic screen with exception 18 displayed.

VMware KB: Interpreting an ESX/ESXi host purple diagnostic screen

Did you get a screenshot of the PSOD?  If not, I assume ASR rebooted your machine.  I recommend ASR be disabled so you can capture the PSOD information, allow the host sufficient time to generate its core dump, and ensure a potentially unhealthy ESXi host does not rejoin the cluster automatically.

VMware KB: HP Automatic Server Recovery (ASR) in an ESX environment

Here is a KB to help decode a MCE after a PSOD:

VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error

Some MCEs on HP servers are benign or correctable.  Though for you looks like this wasn't the case.   But to make my point here, for example, in the BIOS, depending on which CPU you have installed, there may be a setting under Power Management -> Advanced Power Management Options -> SMI Link Power management.  Per HP for this setting: "Allows the user to disable power management on the Intel Scalable Memory Interconnect (SMI) link. Disabling this functionality will increase the server’s idle power usage.  While corrected events are considered normal and are expected on the SMI Link and do not affect operation of the platform, the occurrence of these corrected events can be reduced significantly by disabling SMI Link Power Management.  These events are logged as correctable Machine Check Bank 8 and 9 errors in the operating system logs for certain operating systems.  While these events can be ignored, SMI Link Power Management can be disabled to reduce or prevent their occurrence if desired."

Hope this information is helpful.

0 Kudos
DavoudTeimouri
Virtuoso
Virtuoso
Jump to solution

The answer is published on this page: http://h20565.www2.hp.com/hpsc/doc/public/display?sp4ts.oid=4161627&docId=mmr_kc-0117152&docLocale=e...

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos