I came in this morning to find that my VMware ESXi 3.5 Build 158869 was not responding. All VMs were unresponsive, and the console was there but neither F2 nor F12 did anything. VIC couldn't connect, and I could not connect with SSH, either. I had to physically power off the server and turn it back on again. /var/log/messages starts at that time, soooo... how do I find out what hapened and how to stop it from happening again?
Well what I would do is the following
Get you a SYSLOG Server from the net like kiwi or any other (they are freeware most of them)
Get Your ESX to feed the SYSLOG Server - You can do that in the Advanced Options in the Configuration part of the ESX
Check the BIOS of your Hardware if it is the right on that is gonna be supported and upgrade it if nessesary
Install the latest patches for your esx Server
Wait for the next time to come then you look up the Logfiles been send to the SYSLOG Server and give them to VMware Support
If you find that peace of information Helpfull or anything I´d be glad if you reward some points for it
look at following
ESX Server host agent log - /var/log/vmware/hostd.log - Contains information on the agent that manages and configures the ESX Server host and its virtual machines
var/log/vmware/vpx - for VC agent logs
/var/log/messages - for Service console error - It rotates to .Xx etension, you may want to check date/time stamps
Also make sure if anything new added ur environment in past / any hardware rpairs in past.
You can try "vm-support" command and upload the file to VMware Support for them to analyze what's the problem is otherwise use methods recommended above for long term strategic monitoring solution or error checking.
If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!
VMware vExpert 2009
iGeek Systems Inc.
VMware, Citrix, Microsoft Consultant
I'm getting the exact same symptoms on all our ESXi servers here. All at version 169697 (U4). Bios and HBA's have all been updated. Really annoying, the DCUI doesn't work, can't connect the VIC to the server, but you can ping it. HA doesn't kick in because VC doesn't see the host as dead, but you lose all network connectivity to the hosts VMs. The only workaround I've got at the moment is to jump into the ESXi console and run /etc services.sh restart. Saves me having to cold boot the host. The weird thing is that the host realises it's isolated as our HA settings are to power down the VMs and it does this. Seems to be random and no two boxes fail at the same time.
I've got the logs getting pumped out to a Syslog server, about as much use as a chocolate teapot though, nothing obvious in there the only thing that gets me wondering is events about:
13:49:19 CreateDefaultSelfCheckSettings failed to get TopLevelSystem
A couple of tasks and then
13:57:48 ***1369 Error accepting SSL connection -- exiting
13:59:04 CreateDefaultSelfCheckSettings failed to get TopLevelSystem
Couple more tasks
14:00:23 auto-backup.sh seems to kickoff
14:00:23 vmwarerootwatch.sh seems to kick off
14:00:24 Setting RTC date 'n' time
14:02:44 starts resetting the VMs
Lines in italics are direct copies of the log, lines not in italicas are my interpretation of what the log says. We've got a mix of ESXi and ESX managed by the same VC, this only happens on ESXi. All on HP DL380's G5, SAN attached.
Anyone any ideas?
Same problem here. DL580's with Emulex HBA's. HA didn't kick in as could ping host, console F2 and F12 unresponsive.
Hard booting the failed host meant none of the VM's could be restarted due to problems with all of the swap files.HA eventually kicked in which I'm guessing unlocked the files.
I'm thinking I'll disable CIM as it will be some time before this gets patched if it's not even been confirmed as a bug by VMware yet.
I've not used the unsupported SSH mode for fear of breaking support agreements with Vmware. I'm starting to think I should enable it though to give more options when situations like this happen. Can anyone tell me if SSH still works when a server has suffered this fault? I'm looking for a way to power cycle the box when thid happens without needing someone on site to physically visit the server to hard power cycle (no ilo in place at the moment)
I do not have Emulex HBAs. And my VMware completely stopped responding... no ping, no SSH, no VIC, and the console was frozen.
Nothing useful exists in the logs. It even looks like all the logs got deleted / reset. There is nothing older than Jul 30, which is when this happened.