I have a dual Xeon machine, Intel motherboard, 4.5g of ram, 1TB of disk on a 3ware drive array running Ubuntu LTS.
The hardware is proven and solid, having been in production as an email server for a couple years, we just migrated email to a newer server...
I was running vmware server 1.0.2, was having a problem, so yesterday I upgraded to 1.0.3, and am having the same problem.
The host OS is running fine, however my windows server console is unable to connect to the server, and the web based console serves pages, but is also unable to log in. The web based console reports this error:
Unexpected response from vmware-authd: 510 Could not create lock for vmware-serverd
I assume that vmware-serverd is simply not responding.
I get a similar result from vmware-cmd:
root@ubuntu2z280:/root# vmware-cmd -l
/usr/bin/vmware-cmd: Could not connect to vmware-authd
(VMControl error -14: Unexpected response from vmware-authd: 510 Could not create lock for vmware-serverd)
The virtual machines are unresponsive, but ps shows all of the expected processes exist, but none are consuming expected CPU. There is plenty of free memory, the configured virtual machines only consume 25% of the available memory. The server is on private addresses, and can access the net only via nat, the vm's are on public addresses, running ubuntu LTS with all updates applied.
As far as I can tell, I'm not doing anything stressful. The vm's are all clones of each other, 3 out of 4 are running apache2 with a 2 byte index.html file being served. The 4th machine is running cacti and nagios, cacti is scanning 4 SNMP devices, nagios is scanning 140 hosts with pings.
Everything is set-up with out-of-the-box settings, with very little custom configuration... the vm's are all running vmware tools.
Does anyone have any idea what is causing my problems?
The server typically will run fine for one or two days, then the vm's stop responding at the same time I loose access to the console. A reboot always fixes it, though the shutdown is not graceful for the vm's as far as I can tell.
Obviously the host OS is healthy, as I can ssh into the machine, and run utilities such as ps and top with no problem... I've been working around unix machines for 20 years, so the only "new" ingredients are vmware and ubuntu.
When it works, it works well.
The one other variable is the ethernet switch is not on UPS power, so as storms go past, the ethernet has cycled several times... but the power IS protected by a chain of three large GE subpanel surge supressors, and surge supression on the power strip... I mention all of this just in case loosing the ethernet is a potential cause of lockup of the vmware code.
Of course my ssh sessions into the server are through the same ethernet port that is used for all other purposes... so the ethernet port itself does work.
I see vmware-rtc and httpd-vmware both consume small amounts of CPU from time to time, and over a half hour period I see that vmware.serverd consumed .04 seconds of CPU... So it appears to be alive. None of the vmware-vmx processes have consumed any cpu, though the cacti/nagios should be consuming a fair amount of cpu. In the last 24 hours the vmx processes consumed 2 hours of cpu total, in the last hour, none.
Looks to me that vmware-serverd has stopped processing interrupts from the vmware-vmx processes (or something like that).
Anyone have any clues what might be the cause? Is it my ethernet issues?
I will move my switch to a UPS later today if I have time... Just to help debug this.