What hardware do you use? It may be good idea to run prolonged stress test on host.
When you say "crashing" what do you mean exactly? Kernel Panick? Or the host seems unresponsive? Or the host spontaneously reboots itself? What are the specs of your host machine? How are you guest VM's configured? How much RAM is each guest using?
Host is an AMD Dual Core 64Bit 4200+ with 2gb of RAM. Host is running Debian 2.6.18-6-amd64. Guests are as follows:
2 Debian 2.6.18-6-486 each with 256MB of RAM, both bridged networking
1 Windows 2003 Server R2 64Bit with 768MB or RAM, host only networking with IPtables from the HOST routing traffic to it.
All guests stop responding to pings and system is not accessible. The system is at a hosting company and they have told me that when they reboot it that the display is blank and system does not respond. When they reboot all works fine for about a week or when it seems I start doing some installs of software in one of the guests.
Try taking a history of the uptime as you do this. If possible, keep an SSH session open as you work with VMware as the shell is usually the last thing to go. It is possible that the system is overtaxed. I have experienced a similar thing with a Webserver and MySQL when the uptime values exceeded 45. The server was still up, but so slow as to be practically unresponsive. I had thought it was crashing, when it was just being driven into the ground by sustained traffic.
It just happened again. I installed VMware tools on the 2 debian guests and things seemed to be running smoother and I set the clocksource to pit on both. This time it went down I had started a build all thumbnails task in gallery2 in the one guest.
Shell windows open when this happens terminate and the host and guest do not respond to pings. What should I look for as logs don't show anything.
It sounds like the host maybe underpowered RAM-wise. Also possible that having 3 guests on the machine may be causing a lot of disk access thereby making it seem unresponsive. How long have you waited after it "hangs" to see if it starts responding again? I'm not sure if this is possible, but what if you turn off one or two VM's and re-run the "build all thumbnails" task - does it still hang?
Also the ISP is saying when they go to reboot the machine, the screen is non-responsive - is there disk activity? Most likely (besides physical hardware being the cause) problem is you are overtaxing the system causing it so seem unresponsive.
I'm only giving the guests 256m,256m and 768m. Out of the systems total 2gb. The other 2 guests are pretty much idle. One has nothing on it but a plain debian etch install. The other has win2k3 on it not doing much either.
I have let it sit for an hour and still it did not come back with no ping and no ssh access. The hosting comany says no video on the screen but didn't say if their was disk access I will ask the next time for them to check that.
I just switched all guests to host only mode and forwarded the traffic via iptables as the hosting company says they keep seeing the card go into promiscuous mode. Necessary when you are doing bridged. I switched all to host only and network speeds have increased signifacntly between guests and between guest and host. SCP'ing a file before would cause the display to stall and a 485mb file would take 20 minutes from host to guest. This now takes seconds. I'm hoping that it was this that caused the crashing.
I'm waiting for the hosting company to resolve an IP address issue before I try to rebuild thumbs again.
Keep an SSH shell open, and have top running as you rebuild the thumbnails. Pay close attention to the uptime values shown by top. Link here incase you are unfamiliar with its use: http://en.wikipedia.org/wiki/Top_(Unix)
This should give you a good idea of what process is going awry, and how it is going awry. From there, you can begin to narrow down the issues.
The hosting company fixed my ip address issue and I rebuilt the thumbnails which previously crashed the server. It no longer crashes and no longer does processor or disk intensive taks cause other guests or the host to slow down.
I think the nic didn't support promiscuous mode and switching to all host only fixed it.
I'm still having a random crash issue. It's no longer as pronounced as it was before. My uptimes are usually in the 30 day range now. But it still crashes at random. Any thoughts on where I can look to see what's going on?
I haven't found anything useful in any logs but I may not be looking in all the right places.
I am having the same issue on a CentOS 5 host with 3 guests (1 w2k3 and 2 xp).
I have noticed that the NIC goes into promiscuous mode (Intel Dual port Gigabit adapter).
I have tried everything, and believed the "top" command was rubbish as when using "ps" command showing CPU load on each of the 8 CPU's (Dual Core also) it was showing minimal on all of them, even though TOP would sho rediculous number like 120+ for the vmware-vmx processes.
Is it possible to start each of the VM's as a different named process (ie, somehting other than vmware-vmx) to be able to tell them apart ?
The server will stay up for a few days but then crash as you explain, absolutly nothing from SSH or PING on either Guests or Host. Got a kernel panic when i updated to the last x.x.19el5PAE kernel ... but since resolved this (was a grub.conf issue).
What do you mena but switching all to host only ?? Any other suggestions all ??
I switched all of my vms to host only networking. And then setup iptables to forward the incoming traffic to the host only IP addresses. That worked for me. Not sure what the issue is now. The server is back up and I don't see anything in:
Okay now it has crashed twice in two days. The only change was adding BES software on my VM that is running exchange. I didn't change the size of the memory allocated to the VM because I don't have much physical memory left and can't really add more to this box.
I currently have swap some memory turned on. Should I have it on Fit all virtual machine into reserved host ram? Allow some or Allow most?
Just happened again. I've waited hours and still unresponsive. Have asked the hosting company to switch hardware and check for disk activity before rebooting. Every time they say the screen is blank. Probably when the screen saver kicked in.
Of course my cable modem at home crapped out before the server crashed so my shells to it aren't helpful. It just seems to randomly crash now w/o me doing anything memory or disk intensive to the server.
Would a different OS be more stable? I'm using Debian and since Ubuntu is debian based is there any benefit to running Ubuntu or another OS? I just don't know where to look as logs show nothing and the VM's when run from another machine don't do this.