I also have run into similar problems with my CentOS 4.4 host and I use a variety of Centos 4 and 5 vm guests. I notice after about 2-3 months of running that the nat (vmnet8) network starts to slowly degrade and finally come to the point where all connections are refused. It is most noticable for my port forwarding service from the host IP address to a vm guest web server. Once the network activity starts dropping then things on web pages stop showing up... usually the first sign is that the css is not loading. I also have noticed general network problems from the guest smtp service to trying to ssh from host/guest to another guest where it would take a few minutes for it to respond.
I haven't figured out exactly what is causing this issue but if anyone else is experiencing similar problems please let us know. I do know that this behavior has been occuring from vmware server v1.0.1 all the way to v1.0.3 (ie- throughout the upgrade process) so whatever is causing it still lurks out there. I have been experimenting some with tcpdump but haven't found anything suspicious yet. On a note of interest once the network starts going haywire all I have to do to get it back in shape is to issue this command at the shell: " /usr/lib/vmware/net-services.sh restart ". Once the vmware network services have restarted then the network will run just fine for 2-3 more months.
As for my host nic card I have the following:
eth0: RXcsums LinkChgREG MIirq ASF Split WireSpeed TSOcap
eth0: dma_rwctrl dma_mask[64-bit]
For the guest nic cards I have the following for both Centos 4 and Centos 5 guests:
eth0: registered as PCnet/PCI II 79C970A
ACPI: PCI Interrupt 0000:00:12.0[A] -> GSI 19 (level, low) -> IRQ 185
pcnet32: PCnet/PCI II 79C970A at .... assigned IRQ 185.
Any help / advice is appreciated.
Obvious workaround would be, to set a cron job for nat service reset, during host maintenance window.
This sounds very similar to our problem: vmware-natd gradually degrades towards 100% CPU usage, blocks all network traffic except ICMP, and can only be recovered by being restarted. This happens on the average of once a week on two different Ubuntu 7.04 hosts with Ubuntu 7.04 guests and Windows XP guests that we use in our development/testing environment. There is no obvious trigger based on load or usage. We can see it start to happen as connections start being refused (log entries in syslog) or taking an extraordinarily long time to work. Also, at one point, we did some fairly extensive poking around in a slowly dying server, and it appears that networking external to the NAT died before networking internal to the NAT died.
This problem only emerged after we upgraded to 1.0.4 (although a lot of other variables have changed). One machine is a dual core dual processor Dell AMD system w/8 gigs of memory, the other is quad processor Dell dual core system (8 cores) with 8 GB of memory. Here's the uname -a output showing kernel revs.
Linux qasrv 2.6.20-16-server #2 SMP Sun Sep 23 19:57:25 UTC 2007 i686 GNU/Linux
Linux staging 2.6.20-16-server #2 SMP Sun Sep 23 19:57:25 UTC 2007 i686GNU/Linux
tvleavitt: I have had this problem from at least v1.0.1 all the way to v1.0.3 (I have not done an upgrade to 1.0.4 yet but from what I gather the problem still persists in that version). I also have noticed something similar to you where the external networking died first (ie- outbound connections to the internet) and then the internal network slowly degraded (although I have always been fortunate enough to restart the network services before that happened totally). As in your case there are no obvious reasons (cpu usage is normally in 20% range on average for the host and cpu usage for the guests tend to be normal as well when this is occurring).
As for my kernal information using uname -a :
Linux vmhost 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 26 14:30:58 EDT 2007 i686 i686 i386 GNU/Linux
Linux vmguest4 2.6.9-42.0.8.ELsmp #1 SMP Tue Jan 30 12:33:47 EST 2007 i686 i686 i386 GNU/Linux
Linux vmguest5 2.6.18-8.1.8.el5 #1 SMP Tue Jul 10 06:50:22 EDT 2007 i686 i686 i386 GNU/Linux
in reply to peter_vm:
I actually thought about doing that vmware network restart in a crob job ("/usr/lib/vmware/net-services.sh restart") on a monthly basis during the middle of the night before you mentioned that. I wanted to wait until a couple of our production applications wind down in december in the case I ran the job early monday morning (say 2am) and the network was not running for some reason by the time I came into my office later on that morning. I have no real way to test that right now outside of the production realm unfortunately
Has there been a bug filed for this? This is a known issue, as far as I'm concerned, and apparently has been so for several revisions - I also see indications that it exists in other VMware products (Fusion?). I see numerous mentions of it here on the forums, and I see a few mentions of it out on the open net in Google - all describing an essentially similar problem.
The common work around is to take a sledgehammer to the problem and restart vmware-natd on a regular basis, but this is really not a satisfactory solution for a production or even a semi-production system. ... admittedly, you get what you pay for, but this, along with some other experiences, really takes the shine off the product for me.
Anyone out there at VMware care to comment?
We have the same problem, running VMWare Server under Windows...
After a few minutes already, the NAT'ed NIC8 starts not to work anymore under CentOS5...
We have compiled the VMware Tools but that didn't change anything.
The system is currently useless as NIC8 is unusable after a few minutes/hours
Is there any fix available?
And where can I make a backup of all the Port-Forwardings so that I don't have to type
them in by hand again if I need to install VMware-Server again?
Thanks a lot!
I am on a linux box but I just make a backup copy of this file:
Be sure to put it onto a seperate folder or machine. Not sure the path on windows but search for nat.conf in your vmware insall directory and maybe you will be lucky and find it
There has been a suggestion to move NAT (and associated routing) off the VMware Server and move it to the host OS service. But make sure that you thoroughly test that solution, because VMware Server application tends to be sensitive to networking functions running on host.
thanks for your reply!
Do you have a link to that suggestion for VMWare Server Windows?
That would be great
thanks for your reply!
I found the entries for the NAT-Port-Forwarding in the Windows Registry and exported them...
Hopefully the import will work out ^^
Can someone who is experiencing this problem run the vmware-natd process (on linux) under strace, and pipe the output to a file so we could look at which files it is not closing.
Couldn't we also just run "lsof" when this happens, and tally which files VMware-natd has open?
Yeah, but strace might help show which situations they don't get closed, etc. The more information the better, but lsof is certainly better than nothing.. Where are all the people with this issue? (Or maybe they don't get it on linux) Windows probably has a way to list open files as well.
We're getting it on Linux.
Has anyone looked at the bugfix list for 2.x? Is this mentioned there as a resolved issue?