Re: ESXi 5.1 Host system stopped responding

Feanwulf · ‎02-14-2013

Hello,

I am running an ESXi Hostsystem at a german provider (Hetzner) with just 3 VM guests at the moment. Since the beginning i had troubles because ESXi stopped responding:

* Guest systems not responding

* Host system could not be connected using VSphere Client

* SSH using putty connected to server BUT root login did not work (= Access Denied)

I even tried to look on the console but the screen was blank (just the starting information how to manage the host) and I was not able to login to ESXi using the console (there was no response either).

What I did so far:

* I took a look in syslog, vmkernel Logfiles but did not see any failures that could bring the esxi host to stop servicing my requests

* i patched ESXi to the actual build

* upgraded bios (newest installed now)

* did a hardware check (=all OK)

As soon as I restart (power of / power on) the host and its guest start normally.

Where can i take a look what happend with the SSH logins at the time of the "crash" - do you have any advices how to debug this behavior? It gets a little bit frustrating at the moment

Thanks for your help

Patric

a_nut_in · ‎02-14-2013

Hey Patric,

Did these issues start with the 5.1 upgrade/install or was present even before the upgrade?

What is the hardware you are running the host on?

Also if you do a gunzip /var/run/log/*.gz and then a cat /var/run/log/vmkernel* | grep -i esr do you see any output?

If no, then manually checking the /var/run/log/vmkwarning and vmkernel at the time of the crash should help figure out the exact issue

Regards

a

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

Feanwulf · ‎02-15-2013

Hi,

thanks for your answer - I asked my provider to tell me which hardware is used - hopefully i get an answer soon.

I tried your gunzip / cat commands but got no information so I manually looked into both log files but cannot see what happend BUT i see some error messages i did not see before:

2013-02-15T11:09:51.567Z cpu3:6271)WARNING: UserLinux: 1331: unsupported: (void)
2013-02-15T11:10:14.663Z cpu0:4163)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle
...

Any advice here?

a_nut_in · ‎02-15-2013

What is the SAN being used? Are you using NFS by any chance?

Can a copy of the vm-support be attached to this thread? Might help us figure out what he issue is. But it's OK if the end user objects, as we can always go back for more info :smileymischief:

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

ciandro · ‎02-15-2013

What back end storage are you using? Or is the provider presenting you storage? As requested above it would be helpful to know hardware type and protocol for storage communication.

Feanwulf · ‎02-16-2013

There are two NL-SAS/SATA Drives build in with this server - each with 3TB. I created 2 datastores there is no external storage attached.

griffinboy · ‎02-16-2013

Is your default route for management still there?

You can check by typing:

esxcli network ip route ipv4 list

we've had a few v5.x hosts randomly losing their default gateway recently for some unknown reason.

When you run:

/sbin/service.sh restart

do all services stop and start ok?

If some fail then check relevant logs for errors.

VCPID: 40118 (VCP310, VCP4)

a_nut_in · ‎02-16-2013

Hmm. Sounds similar to http://kb.vmware.com/kb/1030265

What do you see on screen if you type ALT+F12 when you are seeing the issue?

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

griffinboy · ‎02-16-2013

Thanks for the link. I'll check to see if this has something to do with what we're experiencing.

VCPID: 40118 (VCP310, VCP4)

Feanwulf · ‎02-16-2013

when i start /sbin/services.sh restart - i get these failure messages:

[...]

VobUserLib_Init failed with -1
Running SSH stop
SSH login disabled
VobUserLib_Init failed with -1
Connect to localhost failed: Connection failure
Errors:
Invalid operation requested: This ruleset is required and connot be disabled
Running SSH restart
Connect to localhost failed: Connection failure
SSH login enabled
VobUserLib_Init failed with -1
Running ESXShell restart
ESXi shell login enabled
VobUserLib_Init failed with -1
Running ntpd restart
Connect to localhost failed: Connection failure

[...]

Running memscrubd restart
The checkPages boot option is FALSE, hence memscrubd could not be started.

I am just waiting for another "crash" so i can check the advices I got here so far - lucky me server is running almost 2days now

a_nut_in · ‎02-16-2013

That doesn't look nice. What hardware is this? HP?

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

Feanwulf · ‎02-18-2013

my server is running for about 4 days now (new record) - i used the link you send before - hopefully this was the problem

a_nut_in · ‎02-18-2013

Cool :smileygrin: Fingers crossed!

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

Feanwulf · ‎02-22-2013

What i did in the past (7 days):

I took away the unlimited CPU ressources from my running VMs and left some MHz for the Host - that seemd to work - no crahes.

I put unlimited ressources back in yesterday evening and today (30 minutes ago) the server stopped responding again.

I placed a kvm console at the server (see screenshot) but pressing any key (even ctrl-alt-del) did not get the esxi host back to life!

I placed limits on CPU back again and maybe this is helping me for the next days.

I have a silly question. I am runni9ng ESXi 5.1 at latest build now - Is it possible to reinstall ESXi without loosing information stored on my datastores?

I am really thinking of getting rid of the pre-installed esxi from my provider and installing it on myself.

BTW: Hardware they use is a "self-made" server - no server like dell, hp, ibm or something - so i just get to know some of the build in hardware - is there any tool to see the hardware that is build in?

jdptechnc · ‎02-22-2013

Feanwulf wrote:
BTW: Hardware they use is a "self-made" server - no server like dell, hp, ibm or something - so i just get to know some of the build in hardware - is there any tool to see the hardware that is build in?

You could get some of that info from the lspci command.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222

jdptechnc · ‎02-22-2013

Are all of your VM's running on the same physical disks on which the ESXi system is installed? How many VMs?

I wonder if you have a VM with some runaway I/O processing causing the host system to become unresponsible (which might have been mitigated by reducing the CPU resources to the VM, as you mentioned earlier). 2 SATA drives doesn't offer very much in the way of available I/O or Kbps per second.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222

Feanwulf · ‎02-22-2013

I have 2 Vms running (both Debian) - and 2 SATA drives as DataStore1 and DataStore2 - i placed VM1 on DataStore1 and VM2 on DataStore2.

But i want to install some more VMs but this is barely not possible at the moment

a_nut_in · ‎02-22-2013

You could take a PUTTY and run

#esxcfg-info | less

That will give you the list of hardware components on the host.

Trouble is, it's a large file, and searching through the list can get tiring unless you know exactly what you are looking for.

So for example if I want to know the details about the NIC cards, or let's say vmnic0, I can do a

#esxcfg-info -l | less

followed by

/vmnic0 and enter

That will take me to the nics

Or #esxcfg-info -l | grep -i vmnic0

Also if the VM's are on the local datastore, take a backup before you do an install, but ideally, move all VM's off the host and then do an install.

Regards

a

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

a_nut_in · ‎02-22-2013

Sorry typo

#esxcfg-info | less should read #esxcfg-info -l | less

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!

griffinboy · ‎02-23-2013

What CPU options have you enabled in the BIOS? And what CPU model is it?

VCPID: 40118 (VCP310, VCP4)

All

ESXi 5.1 Host system stopped responding