VMware Cloud Community
jpoling
Enthusiast
Enthusiast

Machine Check Error?

I am seeing the following in the vmkwarning log on one of our ESX 3.5 U1 servers:

Jun 2 03:00:31 esx01 vmkernel: 31:03:34:47.573 cpu7:2967)WARNING: MCE: 196: Machine Check Error: Bank 0, Status cc00000120040189

Jun 2 03:00:31 esx01 vmkernel: 31:03:34:47.573 cpu7:2967)WARNING: MCE: 209: Machine Check Error: Bank 0, Misc 000140002b000aa0

Jun 2 03:00:31 esx01 vmkernel: 31:03:34:47.573 cpu7:2967)WARNING: MCE: 230: Machine Check Error: Bank 0, Addr 000000027676f680, Valid TRUE

Jun 2 03:00:31 esx01 vmkernel: 31:03:34:47.573 cpu7:2967)WARNING: MCE: 196: Machine Check Error: Bank 2, Status 9000000000000153

As far as I can tell the host has continued to run. There are no hardware errors being reported. . .

What do these warnings mean?

Thanks,

Jeff

0 Kudos
8 Replies
dominic7
Virtuoso
Virtuoso

Almost always MCE = Hardware failure. It looks like you might have a problem with some of your RAM.

jpoling
Enthusiast
Enthusiast

I'll pursue that. . .it is intriguing because the IBM RSA log and other hardwre details do not show a problem . . .

We've had RAM issues in the past and they caused the system to reboot (and HA triggered, etc)

0 Kudos
mcowger
Immortal
Immortal

They can also be caused by needing a BIOS update.

--Matt

--Matt VCDX #52 blog.cowger.us
0 Kudos
Col_Flashman
Contributor
Contributor

I'm getting on this on a DELL M600 blade:

Jul 24 23:28:07 VMSERVER vmkernel: 0:13:12:32.359 cpu6:1030)WARNING: MCE: 196: Machine Check Error: Bank 3, Status 942000560001010a

Jul 24 23:28:07 VMSERVER vmkernel: 0:13:12:32.359 cpu6:1030)WARNING: MCE: 230: Machine Check Error: Bank 3, Addr 0000000030056180, Valid TRUE

Jul 26 19:33:00 VMSERVER vmkernel: 2:09:17:17.358 cpu6:1166)WARNING: MCE: 196: Machine Check Error: Bank 3, Status 8020008000000000

I've got five of them and it is only happening to this one. Open Manage is clean as a whistle. Kind of sucky. I'm not risking it so I have put the server in maint mode as the last thing I need is OS corruptions. I know what VMWare support is going to say: "Blame Dell." Dell is going to be difficult as it is not appearing in Open Manage. I'll push them to replace the RAM once I have confirmed all the Firmwares are up to date.

If anyone can shed another light on this, I would be appreciative.

0 Kudos
jpoling
Enthusiast
Enthusiast

I've seen the errors at one point in time on one IBM server. . .it's not happened again. VMware support saw the errors while investigating another issue and emphaticly told me I had a hardware issue. . .but IBM's tools show nothing. . .so, I've let the system run without issue

Just my experience. . .

jeff

0 Kudos
balacs
Hot Shot
Hot Shot

You can also try and run the memory diags. Boot to the 'utility partition'. There should be memory diags. If you have deleted the utility partition, you can download the tools from support.dell.com. Here is a link i found.

http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R176884&SystemID=P...

Bala

Dell Inc

Bala Dell Inc
0 Kudos
Col_Flashman
Contributor
Contributor

Thanks

I forgot about those. I'm running the memory diags as we speak.

Cheers

0 Kudos
Goliath222
Contributor
Contributor

Hi,

I hade the same Error on my IBM LS41 7972 Blades from IBM. I changed a setting in the advanced options of the BIOS called AMD CPU Power control to "Disable". Since then no more errors occured. Don't forget to give out some of your Points for this call :smileylaugh:

Regards

Oliver.

0 Kudos