I've got a problem, well, something that is now
officially a problem. My PowerEdge 2970 VMWare ESXi (3.5.0 build-169697) server has
now rebooted/crashed or killed all the VMs twice over the past few
months. I have a DRAC card in the box and it doesn't show
anything out of the ordinary. There is only one entry in the SEL and
that is from where I cleared the SEL when I installed the server back
The logs on the ESXi server seem to wipe themselves
out when this happens, or perhaps that is by design? Whatever the
reason, both times when this has happened, all the ESXi logs
(Messages/Config/Management Agent) start from when the ESXi server
"boots" back up with this entry:
Date syslogd started: BusyBox 1.2.1
The logs then move into vmkernel hardware detection of CPUs, etc.
How can I figure out what the cause of this is? Is there any other way for me to log more information?
If the ESX host is crashing, I'd look at the hardwae config for the server.
Download the diagnostics disk for your Server and run a full set of diagnostics - could easily be failtyMemory / Raid controller / Motherboard etc.
It is interesting that you have lost all your logs at reboot.
Make sure that your disk partitions are not corrupt, preventing the saving of changes.
Have a look at the following - very useful doc:
Please post back with your results.
/var/log/messages just shows what I see from within the console. The log starts after the reboot with this info:
Sep 22 02:48:27 syslogd started: BusyBox v1.2.1
Sep 22 02:48:27 vmkernel: TSC: 0 cpu0:0)Init: 277: cpu 0: early measured tsc speed is 2294250449 Hz
Sep 22 02:48:27 vmkernel: TSC: 16340 cpu0:0)Cpu: 341: id1.version 100f42
Sep 22 02:48:27 vmkernel: TSC: 24587 cpu0:0)CPUAMD: 204: Detecting xapic on AMD_K8:tcr = 0x4fc820
Sep 22 02:48:27 vmkernel: TSC: 32660 cpu0:0)Cpu: 400: APIC ID mask: 0xff000000
Sep 22 02:48:27 vmkernel: TSC: 38206 cpu0:0)Cpu: 826: initial APICID=0x0
Sep 22 02:48:27 vmkernel: TSC: 42764 cpu0:0)CPUAMD: 418: Microcode patch level 0x1000086.
hostd.log shows this error:
while getting partitions: Error: The partition table on /dev/disks/vml.02000000
0060022190c59b2c00115c3b22497fdbd6504552432035 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /dev/disks/vml.020000000060022190c59b2c00115c3b22497fdbd6504552432035 incorrectly. GNU Parted suspects the real geometry should be 713472/64/32 (not 90954/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter disks/vml.020000000060022190c59b2c00115c3b22497fdbd6504552432035=713472,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).
I posed this question on the DSLReports forum and someone responded with the following, does it make sense?
+Additionally, there has been some instability with certain hardware and
ESXi. What build are you running? The issue is with CIM and the kernel
running out of memory and crashing. You can search VMware's community
for the details. To disable CIM, go to the configuration tab, advanced
settings, Misc, and set Misc.CimEnabled and Misc.CimOemProvidersEnabled
to 0. Reboot each host to activate the changes.+
I would set up a syslog server or use the VMA appliance to capture logs outside the host. The logs do get lost in a reboot.
Are you using the generic ESXi install or the Dell specific version? I would certainly run the diagnostics as suggested earlier.
As for the CIM modules causing the crash??? If you do disable them you will loose your hardware health monitoring.