VMware Cloud Community
IvanMoscow
Contributor
Contributor

esxi4 host reproducible crash

Hello,

I'm running ESXI4 (181792, will try to update later

today) on Sun Fire X4140 (dualcpu opteron quadcore), several VMs, some

running centOS, redhat (both amd64), some freebsd i386 VMs and now 2

freebsd amd64 VMs.

When creating high load in both freebsd amd64

VMs (compiling software), the host quickly crashes (after few minutes)

and reboots automatically.

If load is created only on one of

those VMs, then it takes more time (like 20-30 minutes) but the host

will also crash and reboot automatically.

The vmware setup is

nothing special, local disk, just one datastore, more memory available

than allocated to all VMs, one of the freebsd amd64 VMs has 4 cpus

(cores) assigned, most other VMs have 1 or 2 assigned.

Any ideas where to look for crash reports and what might cause the problem?

Reply
0 Kudos
7 Replies
Rumple
Virtuoso
Virtuoso

I would probably suspect bad memory in the host. If after running compiles for a while it then crashes, you could be eventually hitting a bad memory stick..

I would download memtest86+ and run it against the host for 48 hours as a first test (although memory errors usually show up long before that)

Reply
0 Kudos
golddiggie
Champion
Champion

Could be a funky processor too...

Definately agree that it's most likely a hardware issue that's causing the crashes...

VCP4

Reply
0 Kudos
J1mbo
Virtuoso
Virtuoso

Host presumably has a remote access board of some description, or IPMI logs at least? Check logs in the BIOS or use the DRAC/iLO equivalent to look at hardware event history. As said could be CPU, RAM, system board, PSU even, but logs should help you.

Reply
0 Kudos
IvanMoscow
Contributor
Contributor

Sorry I did not point out that this esxi4 host is running in production for more than 6 months now, including several linux 64bit VMs but when running 2 freebsd 64bit VMs it crashes very quickly.

I've checked ILOM and it shows a few messages regarding the crash but they do not make me believe in a hardware malfunction, again, several other 64bit VMs are running fine since several months, only these new freebsd 64bit VMs make the host crash.

I also tried to downgrade these two VMs to 32bit and repeated the same compile and it worked without any problems but creating new freebsd 64bit VMs and repeating the compile job crashes the host again.

The ilom message I found regarding the crash is:

ID = 257 : 03/28/2010 : 17:36:39 : System Boot Initiated : BIOS : Initiated by warm reset

Afterwards typical boot messages appear in the ilom logfile

Reply
0 Kudos
Skyjacker
Contributor
Contributor

Same problem.

I tried on three blades HP BL465c G6, BIOS A13, 2x Opteron Six-Core 2435, all slots DIMM full

ESXi4 u1(208167) / u2(261974) + FreeBSD8 amd64 + load (compiling software) = reset blade

But, BL460c G6 (QC Xeon E5540) + ESXi4 u1 (208167) + FreeBSD8 amd64, everything in perfect condition. Reset never been.

Log IML:

POST Error: A Critical Error occurred prior to this power-up.

iLO: Caution Server reset

It's little enough information.

Reply
0 Kudos
vwal
Contributor
Contributor

I started seeing the same problem on ESXi 4.1 host (quad core test system with 8GB) running FreeBSD 8.0 x64 when compiling software from ports. After the compilation had been running for a few moments it would take down the entire system.

I haven't done exhaustive testing yet, but it seems the problem was remedied by removing the default checkmark from "unlimited" CPU usage option in Virtual Machine Properties > Resources tab > CPU > Resource Allocation. At least a compilation process that crashed the host every time now completed without problems with that option turned off (with nothing else changed in the system configuration).

Reply
0 Kudos
J1mbo
Virtuoso
Virtuoso

Has anyone logged this with vmware?

As a point of interest the host can also be repeatably crashed by disconnecting a VMDK from a guest with active IO served from a StarWind iSCSI host that is using the StarPort driver.

http://blog.peacon.co.uk

Please award points to any useful answer.

Reply
0 Kudos