Hi,
I have an HP DL585 G7 and ESXi 5.1 is installed on that.
The server was rebooted cause of MCE which I found on iLO IML, error is like this: Uncorrectable Machine Check Exception (Board 0, Processor 2, APIC ID 0x00000040, Bank 0x00000004, Status 0xF6000000'00070F0F, Address 0x00000019'9C5BFA00, Misc 0x00000000'00000000)
Also I found this on vmkernel.log: MCE: 635: Fixed 12 MCE bank/CPU-package ownership settings
My server BIOS is updated and power policy is "HP Static High Performance Mode".
Any one can help me to find root cause?
The answer is published on this page: http://h20565.www2.hp.com/hpsc/doc/public/display?sp4ts.oid=4161627&docId=mmr_kc-0117152&docLocale=e...
Check with HP, it is related hardware problem, motherboard or CPU issue.
Are you sure the server rebooted because of this MCE? This implies ESXi displayed a purple diagnostic screen with exception 18 displayed.
VMware KB: Interpreting an ESX/ESXi host purple diagnostic screen
Did you get a screenshot of the PSOD? If not, I assume ASR rebooted your machine. I recommend ASR be disabled so you can capture the PSOD information, allow the host sufficient time to generate its core dump, and ensure a potentially unhealthy ESXi host does not rejoin the cluster automatically.
VMware KB: HP Automatic Server Recovery (ASR) in an ESX environment
Here is a KB to help decode a MCE after a PSOD:
VMware KB: Decoding Machine Check Exception (MCE) output after a purple screen error
Some MCEs on HP servers are benign or correctable. Though for you looks like this wasn't the case. But to make my point here, for example, in the BIOS, depending on which CPU you have installed, there may be a setting under Power Management -> Advanced Power Management Options -> SMI Link Power management. Per HP for this setting: "Allows the user to disable power management on the Intel Scalable Memory Interconnect (SMI) link. Disabling this functionality will increase the server’s idle power usage. While corrected events are considered normal and are expected on the SMI Link and do not affect operation of the platform, the occurrence of these corrected events can be reduced significantly by disabling SMI Link Power Management. These events are logged as correctable Machine Check Bank 8 and 9 errors in the operating system logs for certain operating systems. While these events can be ignored, SMI Link Power Management can be disabled to reduce or prevent their occurrence if desired."
Hope this information is helpful.
The answer is published on this page: http://h20565.www2.hp.com/hpsc/doc/public/display?sp4ts.oid=4161627&docId=mmr_kc-0117152&docLocale=e...