sysjno
Contributor
Contributor

VMware just stopped responding

I came in this morning to find that my VMware ESXi 3.5 Build 158869 was not responding. All VMs were unresponsive, and the console was there but neither F2 nor F12 did anything. VIC couldn't connect, and I could not connect with SSH, either. I had to physically power off the server and turn it back on again. /var/log/messages starts at that time, soooo... how do I find out what hapened and how to stop it from happening again?

0 Kudos
8 Replies
MHAV
Hot Shot
Hot Shot

Well what I would do is the following

  • Get you a SYSLOG Server from the net like kiwi or any other (they are freeware most of them)

  • Get Your ESX to feed the SYSLOG Server - You can do that in the Advanced Options in the Configuration part of the ESX

  • Check the BIOS of your Hardware if it is the right on that is gonna be supported and upgrade it if nessesary

  • Install the latest patches for your esx Server

  • Wait for the next time to come then you look up the Logfiles been send to the SYSLOG Server and give them to VMware Support

If you find that peace of information Helpfull or anything I´d be glad if you reward some points for it

Regards

Regards Michael Haverbeck Check out my blog www.the-virtualizer.com
0 Kudos
dickybird
Enthusiast
Enthusiast

look at following

ESX Server host agent log - /var/log/vmware/hostd.log - Contains information on the agent that manages and configures the ESX Server host and its virtual machines

var/log/vmware/vpx - for VC agent logs

/var/log/vmksummary

/var/log/messages - for Service console error - It rotates to .Xx etension, you may want to check date/time stamps

Also make sure if anything new added ur environment in past / any hardware rpairs in past.

0 Kudos
dominic7
Virtuoso
Virtuoso

0 Kudos
azn2kew
Champion
Champion

You can try "vm-support" command and upload the file to VMware Support for them to analyze what's the problem is otherwise use methods recommended above for long term strategic monitoring solution or error checking.

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

VMware vExpert 2009

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
0 Kudos
Dave_Mac
Contributor
Contributor

I'm getting the exact same symptoms on all our ESXi servers here. All at version 169697 (U4). Bios and HBA's have all been updated. Really annoying, the DCUI doesn't work, can't connect the VIC to the server, but you can ping it. HA doesn't kick in because VC doesn't see the host as dead, but you lose all network connectivity to the hosts VMs. The only workaround I've got at the moment is to jump into the ESXi console and run /etc services.sh restart. Saves me having to cold boot the host. The weird thing is that the host realises it's isolated as our HA settings are to power down the VMs and it does this. Seems to be random and no two boxes fail at the same time.

I've got the logs getting pumped out to a Syslog server, about as much use as a chocolate teapot though, nothing obvious in there the only thing that gets me wondering is events about:

13:49:19 CreateDefaultSelfCheckSettings failed to get TopLevelSystem

A couple of tasks and then

13:57:48 ***1369 Error accepting SSL connection -- exiting

13:59:04 CreateDefaultSelfCheckSettings failed to get TopLevelSystem

Couple more tasks

14:00:23 auto-backup.sh seems to kickoff

14:00:23 vmwarerootwatch.sh seems to kick off

14:00:24 Setting RTC date 'n' time

14:02:44 starts resetting the VMs

Lines in italics are direct copies of the log, lines not in italicas are my interpretation of what the log says. We've got a mix of ESXi and ESX managed by the same VC, this only happens on ESXi. All on HP DL380's G5, SAN attached.

Anyone any ideas?

0 Kudos
Dave_Mac
Contributor
Contributor

Ignore the previous, I can confirm we also have Emulex HBA's.

at 14:01:08 we got sfcb Process "emulex" PID is 1096879.

1 minute and 34 seconds later the first machine was reset. Looks like a bug.

0 Kudos
spig777
Contributor
Contributor

Same problem here. DL580's with Emulex HBA's. HA didn't kick in as could ping host, console F2 and F12 unresponsive.

Hard booting the failed host meant none of the VM's could be restarted due to problems with all of the swap files.HA eventually kicked in which I'm guessing unlocked the files.

I'm thinking I'll disable CIM as it will be some time before this gets patched if it's not even been confirmed as a bug by VMware yet.

I've not used the unsupported SSH mode for fear of breaking support agreements with Vmware. I'm starting to think I should enable it though to give more options when situations like this happen. Can anyone tell me if SSH still works when a server has suffered this fault? I'm looking for a way to power cycle the box when thid happens without needing someone on site to physically visit the server to hard power cycle (no ilo in place at the moment)

0 Kudos
sysjno
Contributor
Contributor

I do not have Emulex HBAs. And my VMware completely stopped responding... no ping, no SSH, no VIC, and the console was frozen.

Nothing useful exists in the logs. It even looks like all the logs got deleted / reset. There is nothing older than Jul 30, which is when this happened.

0 Kudos