VMware Cloud Community
juanclau
Contributor
Contributor

ESXi 5.1.0 hostd agent requires constant restart

Hello everyone,

I've been dealing with a problem for a while and haven't been able to find a solution via research, so hopefully I can get some help this way.

I'm using vCenter to manage my two ESXi servers and using Veeam as my backup solution. The reason I mention Veeam is because it is how I detect the problem I'm about to describe.

The backup software is configured to notify me via email when backup jobs have problems, every once in a short while, 2-3 times a week) I get notifications about multiple VM backups failing. The first thing I notice is vCenter shows the ESXi server (hosting the failed backed up VMs ) as disconnected; and a reconnect simply returns a connection error.

This is happening on both servers, I haven't seen it on both at the same time yet.

Researching the issue I found that restarting the management agents fixes the issue (VMware KB: Restarting the Management agents on an ESXi or ESX host), however this is new behavior and the servers have been in production for almost 2 years. I now discovered that restarting the hostd agent alone fixes the problem.

Since they hadn't been restarted in such a long time we rebooted both of them last Friday only to see the issue again the following day, so that didn't work.

Has anyone experienced this issue? why do I find myself restarting thee agents so often? Is there a permanent fix for this?

0 Kudos
4 Replies
Shirish_Madhari
Contributor
Contributor

Have you recently performed any upgrade or applied patches to the affected hosts? There may be some incompatibility with drivers if it was missed when you upgraded.

Can you also check the status of hostd using /etc/init.d/hostd status, when this happens again.

--- If you found this or any other answer useful please consider the use of the Helpful or Correct buttons to award points. Shirish Madharia VCP 5 - DCV
0 Kudos
juanclau
Contributor
Contributor

Shirish_Madharia

Thanks for your response. I haven't done any upgrades or applied any batches lately, last one was when I upgraded from 5.0 to 5.1.0 when it was released.

The system has worked without issues until about 2 months when this started happening.

I'm sorry I don't have any logs or extra information from when the problem is happening, but I will definitely get it and post it here to see if it helps identify the problem.

For the time beign, is there anyone else that has experienced the same problem?

0 Kudos
Anjani_Kumar
Commander
Commander

Juan, i am quite sure your vms which are being backup are causing error and keep the host isolating. Faced the similiar issue and found that multiple vms start not responding when the veeam start creating snapshots for backup and vm got hung (No network reachability) High cpu and memory uses same time too.  And multiple vms can cause the host stop responding as well.

After 3-4 attempt veeam takes the snapshot and vm come backs online.

to fix this . we upgraded the virtual hardware and updated the vmtools to current version. After that the problem got resolved and vm and host are running happily now.

Please consider marking this answer "correct" or "helpful" if you found it useful. Anjani Kumar | VMware vExpert 2014-2015-2016 | Infrastructure Specialist Twitter : @anjaniyadav85 Website : http://www.Vmwareminds.com
0 Kudos
juanclau
Contributor
Contributor

Shirish,


One of my servers is showing the problem again and I am doing some logging as I type. Here are my findings:


- The error displayed by vCenter when attempting to connect to the ESXi server is:  Reconnect host:  Cannot contact specified host (hostname). The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding.

- The command /etc/init.d/hostd status confirms the service is currently running. It seems the service doesn't crash, however restarting the server allows vCenter to re-connect.

- hostd.log shows several errors related to SSL. Here are some of the errors found on the log (IPs removed for security):

[28835B90 error 'Default'] SSLStreamImpl::DoServerHandshake (0b3d4778) SSL_accept failed with Unexpected EOF

[28835B90 warning 'Proxysvc'] SSL Handshake failed for stream TCP(local=X.X.X.X:Y, peer=Z.Z.Z.Z:W), error=SSL Exception: Unexpected EOF

[27D9DB90 warning 'Proxysvc Req07733'] Error reading from client while waiting for header: N7Vmacore15SystemExceptionE(Connection reset by peer)

[27002B90 error 'Default'] SSLStreamImpl::BIORead (0b66f108) timed out

Any ideas?

0 Kudos