VMware Cloud Community
PeterT82
Contributor
Contributor

ESXi 5.0 random reboot

I have an ESXi 5.0 host that will randomly reboot. Its on a Win 2008 R2 server and is running two virtual machines with Win 7 OS. Has anyone else had this problem? Can you point me in the right direction to solving this issue.

Thank you

Reply
0 Kudos
9 Replies
JagadeeshDev
Hot Shot
Hot Shot

Are you running esxi on workstation or in Hyper-v ?

Do you see any events in the logs ? ( /var/log/messages , /var/log/vmksummary.log, like DCUI: reboot )

http://www.myitblog.in/
Reply
0 Kudos
HarryHolt
Contributor
Contributor

Same issue here.  Just started recently.

The server has been busier than usual lately, I have cleared off a lot of test VMs, and have done quite a bit of cloning and copying VMDKs to and from the server. Other than that, no changes to settings or ESXi software or configuration settings.

I've looked in the BIOS, and there is thermal throttling enabled, but nothing like a max temp shutdown available or enabled there.  I also changed the power-off settings so that the BIOS will leave the computer off if it is shutdown due to any power interruption, and it has rebooted a couple of times since then, so it does not seem to be power related.

There is nothing in the logs indicating any kind of reboot request, just this in the vmksummary.log:

2014-08-05T00:00:01Z heartbeat: up 0d1h1m32s, 3 VMs; [[5904 vmx 827956kB] [7604 vmx 3065716kB] [7588 vmx 3178488kB]] [[5988 sfcb-hhrc 2%max] [5922 sfcb-vmware_bas 4%max] [5916 sfcb-pycim 16%max]]

2014-08-05T00:40:06Z bootstop: Host is rebooting

2014-08-05T00:42:51Z bootstop: Host has booted

2014-08-05T00:43:30Z bootstop: Host is rebooting

2014-08-05T00:46:04Z bootstop: Host has booted

Redirecting the logs to the VMFS storage seems to get reset on a reboot, BTW.  I have to re-direct them again after a reboot.

I found VMWare article that suggested checking messages.log, but there is not messages.log file being created.

I have not attempted to configure core dumps, is that the next step?

Reply
0 Kudos
King_Robert
Hot Shot
Hot Shot

check your server resources ..might be its happening due to less resources on the server..how many times it is restarting in a day..?

Reply
0 Kudos
HarryHolt
Contributor
Contributor

The server resources are fine, it's not stressed at all.  Rarely does memory or CPU go over 50%, and in fact most of the times that it restarted the guests were mostly idle.

It has restarted three times within a 24-hour period, sometimes it doesn't restart for days.  It hasn't restarted now for 5 days.  I bought a new cooler for the CPU, as I noticed it was running just over 65 C the last couple of times I checked it.  Waiting for the next random restart before I install that.

Reply
0 Kudos
HarryHolt
Contributor
Contributor

Well the server rebooted again this morning.  This is what was in the vmksummary.log:

2014-08-12T11:00:01Z heartbeat: up 5d19h41m0s, 3 VMs; [[5904 vmx 2097152kB] [310239 vmx 2900680kB] [77306 vmx 4119540kB]] [[5980 sfcb-hhrc 2%max] [5928 sfcb-vmware_bas 4%max] [5922 sfcb-pycim 16%max]]

2014-08-12T11:17:52Z bootstop: Host has booted

2014-08-12T12:00:01Z heartbeat: up 0d0h43m7s, 1 VM; [[5918 sfcb-pycim 13024kB] [4990 hostd-worker 40712kB] [5900 vmx 856708kB]] [[5976 sfcb-hhrc 2%max] [5924 sfcb-vmware_bas 4%max] [5918 sfcb-pycim 16%max]]

So no indication of why.

The last message in the vmkernel.log was this:

2014-08-12T00:01:02.983Z cpu0:4155)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle

But that was several hours before the restart.  Nothing else in any of the other logs around the reboot time.

I'm pretty stumped.  I'm going to install the new CPU cooler just to throw that at the wall (it's a cheap try).

Reply
0 Kudos
IntellinetSC
Contributor
Contributor

Same exact thing just started happening to us in the last 1.5 months

No events pointing to a reboot.

Resources are perfect.

We get a software monitor alert that the vms are down.

By the time we remote in, host is back up and we have to manually start the vms back up.

Log of last night pasted below. Starts at 9:24pm at the bottom.

SFX-MAS is powered on info 11/5/2014 11:26:28 PM SFX-MAS root

SFX-MAS is starting info 11/5/2014 11:26:25 PM SFX-MAS root

SBS2011 is powered on info 11/5/2014 11:26:18 PM SBS2011 root

SBS2011 is starting info 11/5/2014 11:26:16 PM SBS2011 root

User root@68.115.251.146 logged in info 11/5/2014 11:25:57 PM root

User root@127.0.0.1 logged in info 11/5/2014 9:24:46 PM root

User root logged out info 11/5/2014 9:24:24 PM root

User root@127.0.0.1 logged in info 11/5/2014 9:24:24 PM root

VMware Host Agent started info 11/5/2014 9:24:24 PM localhost.localdomain

Host has booted. info 11/5/2014 9:24:23 PM localhost.localdomain

Physical NIC vmnic1 linkstate is up. info 11/5/2014 9:24:18 PM localhost.localdomain

Physical NIC vmnic1 linkstate is down. warning 11/5/2014 9:24:18 PM localhost.localdomain

Physical NIC vmnic1 linkstate is up. info 11/5/2014 9:24:18 PM localhost.localdomain

Physical NIC vmnic0 linkstate is up. info 11/5/2014 9:24:18 PM localhost.localdomain

Port vmk0 is now protected by Firewall. info 11/5/2014 9:24:17 PM localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set netDump succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set netDump succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set remoteSerialPort succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set vSPC succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set WOL succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set WOL succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set IKED succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set syslog succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set DVSSync succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set DHCPv6 succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set DVFilter succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set gdbserver succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set httpClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set ftpClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set HBR succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set HBR succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set NFC succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

One or more LVM devices have been discovered
on this host.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set NFC succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set activeDirectoryAll succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'setrequired' for rule set vSphereClient succeeded
.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set vSphereClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set vSphereClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set vMotion succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set vMotion succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set webAccess succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set webAccess succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set faultTolerance succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set faultTolerance succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set updateManager succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set vpxHeartbeats succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set vpxHeartbeats succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set iSCSI succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set CIMSLP succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set CIMSLP succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set CIMHttpsServer succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set CIMHttpsServer succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set CIMHttpServer succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set CIMHttpServer succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set ntpClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set ntpClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set snmp succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set snmp succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set dns succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set dns succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'enable' for rule set dhcp succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set dhcp succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set nfsClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set sshClient succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Firewall configuration has changed. Operation
'add' for rule set sshServer succeeded.
info
11/5/2014 9:24:17 PM
localhost.localdomain

Reply
0 Kudos
admin
Immortal
Immortal

Can you check vmkernal logs and see if anything wrong there.

Thanks,

DJ

Reply
0 Kudos
Alistar
Expert
Expert

Hi guys,

can you please check vmkernel.log for Machine Check Errors and post output of this?

# grep MCE /var/log/vmkernel.log

also a dump of vmkwarning.log would be helpful Smiley Happy

We had hosts crashing under random loads as well and eventualy it was pointing out to a hardware failure - this is 99% the case when an ESXi crashes on you without any PSOD.

Stop by my blog if you'd like 🙂 I dabble in vSphere troubleshooting, PowerCLI scripting and NetApp storage - and I share my journeys at http://vmxp.wordpress.com/
Reply
0 Kudos
HarryHolt
Contributor
Contributor

The only result from your suggested grep was:

TSC: 114497 cpu0:0)BootConfig: 89: mcaClearBanksOnMCE = TRUE

0:00:00:02.005 cpu0:4096)MCE: 2582: Detected 6 MCE banks. MCG_CAP MSR:0x106

There was nothing relevant in the warning file, so I didn't post that, it's just a bunch of latency warnings (during backups), and a few stale file handles.  None of it written near the reboot times.

But I think my issue probably came down to some incompatibility between my UPS and the server's power supply.  I moved the laser printer off of the UPS to a different circuit, and I haven't had any random reboots since.

Reply
0 Kudos