VMware Cloud Community
tomaddox
Enthusiast
Enthusiast

ESX server not connected in VI client

Starting today, I have an ESX Server that showed up as not connected in the VI Client (VI 2.0.2). I tried disconnecting and reconnecting, and the reconnect process wound up taking about half an hour to time out. I did some investigating, and it turned out that all the ESX servers had wildly different dates and times set. I reconfigured ntpd (we had decommissioned our old ntp server) and resynched the time with ntpdate. Now, the rogue ESX Server gets to the point that I am eventually prompted for a username and password when reconnecting, but the host will not reconnect, eventually failing with the error "Unable to access the specified host. It either does not exist, the server software is not responding, or there is a network problem." /var/log/messages shows the following output:

Dec 26 13:27:36 mcmsfobve03 /usr/lib/vmware/hostd/vmware-hostd[28207]: Accepted password for user vpxuser from 172.16.28.99

Dec 26 14:37:45 mcmsfobve03 /usr/lib/vmware/hostd/vmware-hostd[28207]: Accepted password for user root from 172.16.28.99

Dec 26 14:39:48 mcmsfobve03 /usr/lib/vmware/hostd/vmware-hostd[28207]: Accepted password for user root from 172.16.28.99

Dec 26 14:39:48 mcmsfobve03 passwd(pam_unix)[6938]: password changed for vpxuser

/var/log/vmware/hostd.log shows:

[2007-12-26 14:39:47.911 'App' 79944624 verbose] Accepted authd connection from:172.16.28.99:2833
[2007-12-26 14:39:48.069 'TaskManager' 134192048 info] Task Created : haTask--vim.SessionManager.login-32
[2007-12-26 14:39:48.239 'Vimsvc' 134192048 info] [Auth]: User root
[2007-12-26 14:39:48.355 'ha-eventmgr' 134192048 info] Event 15 : User root@172.16.28.99 logged in
[2007-12-26 14:39:48.356 'TaskManager' 134192048 info] Task Completed : haTask--vim.SessionManager.login-32
[2007-12-26 14:39:48.407 'TaskManager' 75213744 info] Task Created : haTask-ha-folder-root-vim.host.LocalAccountManager.createUser-33
[2007-12-26 14:39:48.477 'TaskManager' 75213744 info] Task Completed : haTask-ha-folder-root-vim.host.LocalAccountManager.createUser-33
[2007-12-26 14:39:48.481 'Vmomi' 75213744 info] Activation [N5Vmomi10ActivationE:0xa156978] : Invoke done [createUser] on [vim.host.LocalAccountManager:ha-localacctmgr]
[2007-12-26 14:39:48.489 'Vmomi' 75213744 info] Throw vim.fault.AlreadyExists
[2007-12-26 14:39:48.493 'Vmomi' 75213744 info] Result:(vim.fault.AlreadyExists) {
   name = "vpxuser"
   msg = ""
}
[2007-12-26 14:39:48.602 'TaskManager' 3076452480 info] Task Created : haTask-ha-folder-root-vim.host.LocalAccountManager.updateUser-34
[2007-12-26 14:39:48.879 'TaskManager' 3076452480 info] Task Completed : haTask-ha-folder-root-vim.host.LocalAccountManager.updateUser-34
[2007-12-26 14:39:48.923 'TaskManager' 79944624 info] Task Created : haTask--vim.AuthorizationManager.setEntityPermissions-35
[2007-12-26 14:39:48.950 'TaskManager' 79944624 info] Task Completed : haTask--vim.AuthorizationManager.setEntityPermissions-35
[2007-12-26 14:39:50.001 'ha-eventmgr' 66186160 info] Event 16 : User root logged out
[2007-12-26 14:39:50.023 'VmdbAdapter' 66186160 verbose] Removed vmdb connection /db/connection/#8/

DRS is enabled in the cluster, but HA is not.

Basically, the error seems to indicate that the vpxuser account already exists, which is true. Deleting the account with vipw seems to have improved the initial responsiveness of the ESX host when reattaching it to the cluster (i.e., it prompts for a username and password more quickly), but it still will not reconnect. The virtual machines are still running, so I'm reluctant to just reboot the server, especially without understanding what's going on.

Any thoughts?

Reply
0 Kudos
20 Replies
mstahl75
Virtuoso
Virtuoso

Verify DNS resolution works correctly on the problem host. Also make sure that both the FQDN and shortname for the host exist in /etc/hosts and verify both are the same case (host name part).

IP hostname.domain.tld hostname

Reply
0 Kudos
jayolsen
Expert
Expert

I assume you are using Virtual Center, have you tried restarting the service for virtual center on your VC windows server?

Reply
0 Kudos
IB_IT
Expert
Expert

check to make sure you have sufficient SC memory available. At cmd prompt type: free

and see what you get

Reply
0 Kudos
admin
Immortal
Immortal

The logs about vpxuser already existing are expected - the VC server always tries to create that account when adding a host. Is vpxa already installed on the host? Check out the vpxa logs in /var/log/vmware/vpx/ for any errors.

tomaddox
Enthusiast
Enthusiast

Ah ha. Here's what's in that log:

[2007-12-27 08:44:06.385 'App' 6495152 info] [VpxLRO] -- BEGIN task-internal-33 --  -- [vpxa:getChanges]
[2007-12-27 09:03:58.878 'App' 7535536 error] [VpxdHalVmHostAgent] Call to GetSummary failed: vmodl.fault.HostNotReachable
[2007-12-27 09:44:34.026 'App' 6495152 error] [VpxaMoService] Throwing HostCommunication error: vmodl.fault.HostNotReachable
[2007-12-27 09:44:34.034 'App' 6495152 info] [VpxLRO] -- FINISH task-internal-33 --  -- [vpxa:getChanges]
[2007-12-27 10:04:14.228 'App' 7535536 error] [http://vm.GetConfig|http://vm.GetConfig] Received exception in GetConfig: vmodl.fault.HostNotReachable

I have verified that I can ping back and forth and that the hostnames (short and FQDN) and IP addresses are mutually resolvable. I opened up the firewall on the ESX server, to no effect. All physical interfaces show that they're up, and I can ping the VMKernel IP from the VC server. In short, everything looks healthy as far as I can see. Is there anything I'm missing?

Reply
0 Kudos
Milton21
Hot Shot
Hot Shot

Reboot the ESX box

Sorry hit post before I was done.

The VC client in the ESX host... Seems to lockup and show as disconnected. I think the Processes went Zombie on you and the parent process needs reset. Well the parent process is root. And that is the system.

Reply
0 Kudos
admin
Immortal
Immortal

It looks like vpxa is having trouble communicating with hostd. Do you see any other errors in the hostd log? Try restarting hostd (service mgmt-vmware restart).

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

The only errors from hostd.log are the ones I posted originally. I have restarted hostd, and the first time I did so, it took a long time for the Host Agent service to shut down. Most recently, it restarts quickly, but the problem persists.

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

I have tried restarting VC but to no avail. I can connect fine to the ESX server with the VI Client, and the VC server can connect to the other ESX servers, so it's something specific to this host.

Reply
0 Kudos
admin
Immortal
Immortal

Can you enable verbose logging on hostd and see if that shows any extra useful info?

Modify /etc/vmware/hostd/config.xml and change the log.level tag to verbose. You'll need to restart hostd for this to take effect.

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

Free memory looks fine. It has more free memory, buffers, and swap than another server which is not having this problem. CPU load is also fairly minimal.

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

I'm not seeing any zombie processes when I view the process list with ps. Looking at it from the other direction, are there any other services that I can restart non-destructively? I have 18 VMs running on this box, so I would prefer not to reboot. Alternately, is it possible that a necessary process has just died?

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

Logging is already at "verbose." Is there a "debug" or other higher setting?

Reply
0 Kudos
jayolsen
Expert
Expert

Does this directory exist on your host /tmp/vmware-root

Reply
0 Kudos
admin
Immortal
Immortal

There is "trivial" logging but I don't think it will help in this case. Worth a try though...

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

It did not, so I created it and chmod'd it 700 (owner root), but it doesn't seem to have changed anything.

Reply
0 Kudos
jayolsen
Expert
Expert

Ug, might be time to call support.

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

Yeah, I'm trying to avoid that, but I may not have a choice. What's strange to me is that I get to the point where the VI Client shows me which VMs are running on the host and asks me which folder to put them in when I'm attempting to reconnect, but then it claims that it can't reach the host, so it does seem as though some process is not responding as expected.

Reply
0 Kudos
Milton21
Hot Shot
Hot Shot

I have called support with this problem after the where in my system for about 2 hrs they just told me to reboot

Reply
0 Kudos