ESXi crashing with Hardware (machine) error: Unkno...

EborComputing · ‎06-20-2010

Hi community,

I've got a lovely random problem with one of my ESXi hosts. Every now-and-then it will crash with the following (attached) error. I've got two identical systems and one has been running for months without drama, but this one has crashed at least twice now. Once was a reboot (didn't notice until checking guests and they asked why the system restarted unexpectedly), and the other two times were these failures.

I'm currently running a memtest on the host to try to eliminate one potential cause, but I'm not certain if it's CPU, (mainboard?) or a failure in ESXi? Anyone able to enlighten?

BruceMcMillan · ‎06-21-2010

What hardware are you using? Is it on the HCL?

jsteffen · ‎07-02-2010

I've got the same problem with one of mine- first time I believe. This is an IBM LS41 with IBM SVC storage and it's one of about 20 we have- the rest have been fine.

On the 2nd/3rd line it says "in world ###:vmm0:. Right? Could it have something to do with that particular VM? That particular VM happens to be VM version 4 and I'm using ESXi 4 build 244038.

Googling does not yield very much at all on this error.

Thanks in advance

EborComputing · ‎07-05-2010

I'm using this chassis as a dual host. Super Micro have assured me that it contains the same motherboard that is in an ESXi certified server. ESXi recognises it as a SuperMicro H8DMT+ so I assume it's all fine. One system has been working fine for months.

EborComputing · ‎07-05-2010

Hi jsteffen,

I tried upgrading from ESXi v4.0 to update 1 and lastly to update 2, where the problems still persisted.

Problems for me ranged from spontaneous reboot to hangs with nothing on the screen and nothing on a keyboard would wake it up (Num-Lock didn't even toggle, so unlikely to be software related).

Lastly, I got some approved down time and shifted the CPU from one host (that failed) to another (that was working fine) and the failures happened on the same system irrespective of which CPU was in there, indicating that it wasn't CPU related. I then shifted VMs off to the system that was working and tried running memtest v4.0 on the system to check the RAM. 3-and-a-bit iterations later it was reporting all was fine. There's not a lot left in the system, so I sent the motherboard back under warranty.

To give another board a boost, I put the RAM and CPU from the failing system into it, while the board was off under RMA. Within 3 days the beefier system was now failing. Repeatedly. Now, knowing the CPU was fine, the introduction of the RAM was also introducing faults into a system that was otherwise fine. 3 iterations of memtest didn't show any errors, and it took 3 days for one to be found, where it failed 3 times in the same day. Talk about intermittent.

Taking the additional RAM out, and the system hasn't shown problems in 30+ hours now. Still waiting.

I don't know if this will help you, but it appears that the PSOD (purple screen of death) is related to a hardware problem. Not bad for ECC Reg server class RAM. No reported ECC faults that I can see... Maybe I'm just not looking in the right place. Maybe if it is reported, there needs to be a "configure reporting" tab in vSphere...

WudthiP · ‎08-04-2010

Hi,

Maybe the hardware failure, try to check with the management software come with system to verify the hardware.

Wp.

EborComputing · ‎08-04-2010

It appears to be RAM related. Sent server and RAM back to supplier and they concurred that the RAM was most likely at fault. The board came back, but all 8 sticks of RAM were replaced, so even they didn't diagnose which stick was at fault.

System appears to be running fine again for a week now.

All

ESXi crashing with Hardware (machine) error: Unknown Encoding