Hello,
I am running an ESXi Hostsystem at a german provider (Hetzner) with just 3 VM guests at the moment. Since the beginning i had troubles because ESXi stopped responding:
* Guest systems not responding
* Host system could not be connected using VSphere Client
* SSH using putty connected to server BUT root login did not work (= Access Denied)
I even tried to look on the console but the screen was blank (just the starting information how to manage the host) and I was not able to login to ESXi using the console (there was no response either).
What I did so far:
* I took a look in syslog, vmkernel Logfiles but did not see any failures that could bring the esxi host to stop servicing my requests
* i patched ESXi to the actual build
* upgraded bios (newest installed now)
* did a hardware check (=all OK)
As soon as I restart (power of / power on) the host and its guest start normally.
Where can i take a look what happend with the SSH logins at the time of the "crash" - do you have any advices how to debug this behavior? It gets a little bit frustrating at the moment
Thanks for your help
Patric
Hey Patric,
Did these issues start with the 5.1 upgrade/install or was present even before the upgrade?
What is the hardware you are running the host on?
Also if you do a gunzip /var/run/log/*.gz and then a cat /var/run/log/vmkernel* | grep -i esr do you see any output?
If no, then manually checking the /var/run/log/vmkwarning and vmkernel at the time of the crash should help figure out the exact issue
Regards
a
Hi,
thanks for your answer - I asked my provider to tell me which hardware is used - hopefully i get an answer soon.
I tried your gunzip / cat commands but got no information so I manually looked into both log files but cannot see what happend BUT i see some error messages i did not see before:
2013-02-15T11:09:51.567Z cpu3:6271)WARNING: UserLinux: 1331: unsupported: (void)
2013-02-15T11:10:14.663Z cpu0:4163)WARNING: VFAT: 4346: Failed to flush file times: Stale file handle
...
Any advice here?
What is the SAN being used? Are you using NFS by any chance?
Can a copy of the vm-support be attached to this thread? Might help us figure out what he issue is. But it's OK if the end user objects, as we can always go back for more info :smileymischief:
What back end storage are you using? Or is the provider presenting you storage? As requested above it would be helpful to know hardware type and protocol for storage communication.
There are two NL-SAS/SATA Drives build in with this server - each with 3TB. I created 2 datastores there is no external storage attached.
Is your default route for management still there?
You can check by typing:
esxcli network ip route ipv4 list
we've had a few v5.x hosts randomly losing their default gateway recently for some unknown reason.
When you run:
/sbin/service.sh restart
do all services stop and start ok?
If some fail then check relevant logs for errors.
Hmm. Sounds similar to http://kb.vmware.com/kb/1030265
What do you see on screen if you type ALT+F12 when you are seeing the issue?
Thanks for the link. I'll check to see if this has something to do with what we're experiencing.
when i start /sbin/services.sh restart - i get these failure messages:
[...]
VobUserLib_Init failed with -1
Running SSH stop
SSH login disabled
VobUserLib_Init failed with -1
Connect to localhost failed: Connection failure
Errors:
Invalid operation requested: This ruleset is required and connot be disabled
Running SSH restart
Connect to localhost failed: Connection failure
SSH login enabled
VobUserLib_Init failed with -1
Running ESXShell restart
ESXi shell login enabled
VobUserLib_Init failed with -1
Running ntpd restart
Connect to localhost failed: Connection failure
[...]
Running memscrubd restart
The checkPages boot option is FALSE, hence memscrubd could not be started.
I am just waiting for another "crash" so i can check the advices I got here so far - lucky me server is running almost 2days now
That doesn't look nice. What hardware is this? HP?
my server is running for about 4 days now (new record) - i used the link you send before - hopefully this was the problem
Cool :smileygrin: Fingers crossed!
What i did in the past (7 days):
I took away the unlimited CPU ressources from my running VMs and left some MHz for the Host - that seemd to work - no crahes.
I put unlimited ressources back in yesterday evening and today (30 minutes ago) the server stopped responding again.
I placed a kvm console at the server (see screenshot) but pressing any key (even ctrl-alt-del) did not get the esxi host back to life!
I placed limits on CPU back again and maybe this is helping me for the next days.
I have a silly question. I am runni9ng ESXi 5.1 at latest build now - Is it possible to reinstall ESXi without loosing information stored on my datastores?
I am really thinking of getting rid of the pre-installed esxi from my provider and installing it on myself.
BTW: Hardware they use is a "self-made" server - no server like dell, hp, ibm or something - so i just get to know some of the build in hardware - is there any tool to see the hardware that is build in?
Feanwulf wrote:
BTW: Hardware they use is a "self-made" server - no server like dell, hp, ibm or something - so i just get to know some of the build in hardware - is there any tool to see the hardware that is build in?
You could get some of that info from the lspci command.
Are all of your VM's running on the same physical disks on which the ESXi system is installed? How many VMs?
I wonder if you have a VM with some runaway I/O processing causing the host system to become unresponsible (which might have been mitigated by reducing the CPU resources to the VM, as you mentioned earlier). 2 SATA drives doesn't offer very much in the way of available I/O or Kbps per second.
I have 2 Vms running (both Debian) - and 2 SATA drives as DataStore1 and DataStore2 - i placed VM1 on DataStore1 and VM2 on DataStore2.
But i want to install some more VMs but this is barely not possible at the moment
You could take a PUTTY and run
#esxcfg-info | less
That will give you the list of hardware components on the host.
Trouble is, it's a large file, and searching through the list can get tiring unless you know exactly what you are looking for.
So for example if I want to know the details about the NIC cards, or let's say vmnic0, I can do a
#esxcfg-info -l | less
followed by
/vmnic0 and enter
That will take me to the nics
Or #esxcfg-info -l | grep -i vmnic0
Also if the VM's are on the local datastore, take a backup before you do an install, but ideally, move all VM's off the host and then do an install.
Regards
a
Sorry typo
#esxcfg-info | less should read #esxcfg-info -l | less
What CPU options have you enabled in the BIOS? And what CPU model is it?