I have an ESX 4.0.0 Build 164009 box that disconnected from vCenter (4.0.0 - Build 258672) and is no longer able to be reached or managed by any means other than the console. This disconnect occurred about 3-4 hours after adding vCenter, upgrading the vSphere client and upgrading the ESX host in question's license (via vCenter) from ESX Standard to Enterprise Plus (it was previously managed by vSphere client directly as a stand alone host). I also added a second ESX host (build 261974) to the Datacenter in vSphere, but turned on no features and performed no updates to either box. The update manager does show that the box with issues needs quite a few updates, but that will have to wait for later or until I can get the running machines off the box. Once I saw the number of updates needed I disabled the update manger plug in.
Here's a quick synopsis of what works and what doesn't:
Works:
1) Console access
2) All machines on the box are running with no issues
No working:
1) Cannot connect via SSH. Telnet to port 22 shows nothing.
2) Unable to ping box
3) From the console, if I attempt to ping out to gateway or any other box I receive "ping: sendmsg: Operation not permitted"
What I've tried so far:
1) Restarting management agents based on http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100349...
2) When I attempt restart of management agents, the ESX Server Host Agent will hang on stopping. I have then manually killed the pid for hostd. Once done, I am able to start the host agent, but still cannot connect to the server via SSH, vCenter, etc and have the same issues as above.
3) I have verified that all NICs are up and the switch they are connected to show UP on the appropriate ports.
4) I verified that disk space is not an issue (<40% used on / and only 4% use on /log)
Extra Info:
1) I have running production machines that have not been impacted other than difficulty in managing them.
2) I will be unable to help with this situation starting tomorrow for a few days (Murphy strikes) and am hoping there is an easy/safe fix that allows me to keep the host up without a restart. I am weighing leaving it as is vs. restarting and having the entire system refuse to come up.
3) All storage is local.
4) esxcfg-route shows VMkernel default gateway as 0.0.0.0. Not sure this is an issue as other host that are working show the same.
5) I am showing 6 instances of vmware-watchdog running and 4 instances of vmware-vimsh running.
6) Looking at esxtop load is .15, .25 and .21
7) Looking at top load is 1.2, 1.1 and 1.1
Box is a Dell 2950, dual dual core box with 16GB
Any thoughts and suggestions are most welcome.