Re: ESXi 3i 3.5.0 System Rebooting?

mevans336 · ‎09-21-2009

Hello Guys,

I've got a problem, well, something that is now

officially a problem. My PowerEdge 2970 VMWare ESXi (3.5.0 build-169697) server has

now rebooted/crashed or killed all the VMs twice over the past few

months. I have a DRAC card in the box and it doesn't show

anything out of the ordinary. There is only one entry in the SEL and

that is from where I cleared the SEL when I installed the server back

in March.

The logs on the ESXi server seem to wipe themselves

out when this happens, or perhaps that is by design? Whatever the

reason, both times when this has happened, all the ESXi logs

(Messages/Config/Management Agent) start from when the ESXi server

"boots" back up with this entry:

Date syslogd started: BusyBox 1.2.1

The logs then move into vmkernel hardware detection of CPUs, etc.

How can I figure out what the cause of this is? Is there any other way for me to log more information?

krowczynski · ‎09-21-2009

Were your vms located on the local storage of the server?

Have you got only one ESXi?

MCP, VCP3 , VCP4

mevans336 · ‎09-21-2009

Thanks for the quick reply.

They are located on an internal 8-disk RAID-10 SATA array attached to a PERC 6i I believe. And yes, this is the only ESXi server we have.

krowczynski · ‎09-21-2009

Have you checked out on your server in /var/log the different logfiles?

MCP, VCP3 , VCP4

mevans336 · ‎09-21-2009

I have checked the logs via the console. Am I able to SSH into the IP of the ESXi server to get the logs files?

bulletprooffool · ‎09-22-2009

If the ESX host is crashing, I'd look at the hardwae config for the server.

Download the diagnostics disk for your Server and run a full set of diagnostics - could easily be failtyMemory / Raid controller / Motherboard etc.

It is interesting that you have lost all your logs at reboot.

Make sure that your disk partitions are not corrupt, preventing the saving of changes.

Have a look at the following - very useful doc:

http://www.vm-help.com/esx/esx3i/check_system_partitions.php

Please post back with your results.

One day I will virtualise myself . . .

mevans336 · ‎09-22-2009

I am not sure how to get an SSH session to run those commands?

bulletprooffool · ‎09-22-2009

Full instructions:

http://www.yellow-bricks.com/2008/08/10/howto-esxi-and-ssh/

One day I will virtualise myself . . .

mevans336 · ‎09-22-2009

/var/log/messages just shows what I see from within the console. The log starts after the reboot with this info:

Sep 22 02:48:27 syslogd started: BusyBox v1.2.1

Sep 22 02:48:27 vmkernel: TSC: 0 cpu0:0)Init: 277: cpu 0: early measured tsc speed is 2294250449 Hz

Sep 22 02:48:27 vmkernel: TSC: 16340 cpu0:0)Cpu: 341: id1.version 100f42

Sep 22 02:48:27 vmkernel: TSC: 24587 cpu0:0)CPUAMD: 204: Detecting xapic on AMD_K8:tcr = 0x4fc820

Sep 22 02:48:27 vmkernel: TSC: 32660 cpu0:0)Cpu: 400: APIC ID mask: 0xff000000

Sep 22 02:48:27 vmkernel: TSC: 38206 cpu0:0)Cpu: 826: initial APICID=0x0

Sep 22 02:48:27 vmkernel: TSC: 42764 cpu0:0)CPUAMD: 418: Microcode patch level 0x1000086.

etc ...

hostd.log shows this error:

Error Stream from partedUtil

while getting partitions: Error: The partition table on /dev/disks/vml.02000000

0060022190c59b2c00115c3b22497fdbd6504552432035 is inconsistent. There are many reasons why this might be the case. However, the most likely reason is that Linux detected the BIOS geometry for /dev/disks/vml.020000000060022190c59b2c00115c3b22497fdbd6504552432035 incorrectly. GNU Parted suspects the real geometry should be 713472/64/32 (not 90954/255/63). You should check with your BIOS first, as this may not be correct. You can inform Linux by adding the parameter disks/vml.020000000060022190c59b2c00115c3b22497fdbd6504552432035=713472,64,32 to the command line. See the LILO or GRUB documentation for more information. If you think Parted's suggested geometry is correct, you may select Ignore to continue (and fix Linux later). Otherwise, select Cancel (and fix Linux and/or the BIOS now).

I posed this question on the DSLReports forum and someone responded with the following, does it make sense?

+Additionally, there has been some instability with certain hardware and

ESXi. What build are you running? The issue is with CIM and the kernel

running out of memory and crashing. You can search VMware's community

for the details. To disable CIM, go to the configuration tab, advanced

settings, Misc, and set Misc.CimEnabled and Misc.CimOemProvidersEnabled

to 0. Reboot each host to activate the changes.+

DSTAVERT · ‎09-22-2009

I would set up a syslog server or use the VMA appliance to capture logs outside the host. The logs do get lost in a reboot.

Are you using the generic ESXi install or the Dell specific version? I would certainly run the diagnostics as suggested earlier.

As for the CIM modules causing the crash??? If you do disable them you will loose your hardware health monitoring.

-- David -- VMware Communities Moderator

bulletprooffool · ‎09-23-2009

did you run through the instructions posted previoously?

http://www.vm-help.com/esx/esx3i/check_system_partitions.php

One day I will virtualise myself . . .

All

ESXi 3i 3.5.0 System Rebooting?