VMware Cloud Community
mrbiggles
Contributor
Contributor

Machine Check Exception 578

Specs

Intel 5000PSL mainboard, 16GB of FBRAM, Adaptec 5085 sas card

15.4K Cheetah HDDs

1x RAID10 (VM ON this)

1x RAID 1

2x 54xx XEON 12mb cache

Ive just put a new server on site that had been running flawlessly for about a week or so.

Now it seems to come up with a pink screen at odd times of the day usually after 6pm and before 930am. It has not crashed during the day which is when its at its heaviest load which seems odd to me.

When reading the pink screen (PSOD as I have just found out) it has the following main errors (among about 1/2 screen of hex/code)

Release build - 82663

Machine check exception: unable to continue

CPU6: 1076) MCE: 578 Machine check exception

CPU7: 1083) MCE: 578 Machine check exception

0: 1024/console

1:10255/idle1

2:1085/vmware-vm

3:1074/vmware-vm

4: 1084:vmm1:vint

5:1081/worker#0

6:1076/vmm1:lrs0

*7:1083/vmm0:virt

I am guessing that the numbers mean a specific cpu core and what they were doing at the time of the crash.

The problem seems to start at CPU6 which is running my only VM, while I know it all points to a hardware problem, is there a chance it could be software causing the failure?

From reading the forums, someone said to change the slot of the RAID controller which I will try this eventing and downgrading to the 53xx version of CPU which I will also try.

Could it be something as simple as the CPU getting too hot - which would be wierd because it only seems to crash out of hours and at night (maybe they turn their AC off!)

Something that has thrown me a bit and that is that it only started playing up after I installed backup software onto the ESX server first Acronis on the guest which after installing this the following day I got a PSOD which meant I automatically associated the problem with installing the backup

I then uninstalled Acronis and expected the problem to disappear too.

I then installed ESX express and ran the backup successfully, the following day the client walked into another pink screen.

It has now been a regular event happening once or twice in an evening while continuing rock solid during working hours (after 930am)

Any advice would be welcomed.

0 Kudos
1 Reply
Emil1
Contributor
Contributor

Hello MrBiggles: I saw that your quesion has gone unanswered since Aug 7th. I wanted to let you know that today I experienced the same exact error on a totally different server.

We installed and tested ESXi on a VMWare supported ASUS server. In fact this is the only ASUS server on the VMWare supported list. As well, there is currently no RAID running on the server, that is coming later. For now it is a 'plane jane' dual, Quad Intel server with 24Gig RAM, 1 SATA II Quantum 80Gig drive. We received the same exact errors as you reported below. VMWare support says it could need a BIOS upgrade from the HW manufacturer, it says it is not an OS error. We are currently running ESXi 3.5.0 (110271). Our system specs:

ASUS RS160-E5 (on VMWare supported list)

Dual Intel Xeon 5405 (Harpertown)

24Gig RAM MemoryStore

80Gig Quantum SATAII HD

My personal thoughts are that this is not temperature related. The room our server is in is chilled properly and the new ASUS server has a wonderfull setup of fans and air flow that would resolve that.

At the moment, Asus support is struggling with my request for assistance. I know that VMWare says it is not OS related, but I have a feeling that they need to develop for this situation since it seems to be an issue that is developing from time to time in the community.

I will report back if I get anywhere. I look forward to you doing the same if possible.

0 Kudos