Client dies after running successfully for several...

estebann · ‎01-02-2008

Hi,

I recently installed Hyperic and noticed that it has a tendency to die after being up and running for several days. According to the server, the client continues (for a while) to report the server is available but it fails to report any other information. If I try to stop the client using the appropriate script I get the following result:

Stopping agent ...
Failed to stop agent: Unable to connect to agent: already dead?

However, I can kill the client using sigterm and then start it up again and everything works fine for ~ 1-2 weeks. Below is a snippet from the log file corresponding to the time of death.

Any help would be greatly appreciated.

Regards,
BD

-----
2007-12-31 19:46:01,871 ERROR [SenderThread] Error sending measurements: IO error: java.net.SocketTimeoutException: Read timed out
2007-12-31 20:28:11,892 ERROR [HttpMethodBase] I/O failure reading response body
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at java.io.PushbackInputStream.read(Unknown Source)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:167)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:142)
at java.io.FilterInputStream.read(Unknown Source)
at org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:161)
at org.apache.commons.httpclient.HttpMethodBase.getResponseBody(HttpMethodBase.java:717)
at org.apache.commons.httpclient.HttpMethodBase.getResponseBodyAsString(HttpMethodBase.java:764)
at org.hyperic.lather.client.LatherHTTPClient.invoke(LatherHTTPClient.java:112)
at org.hyperic.hq.bizapp.client.AgentCallbackClient.invokeLatherCall(AgentCallbackClient.java:145)
at org.hyperic.hq.bizapp.client.MeasurementCallbackClient.measurementSendReport(MeasurementCallbackClient.java:62)
at org.hyperic.hq.measurement.agent.server.SenderThread.sendBatch(SenderThread.java:410)
at org.hyperic.hq.measurement.agent.server.SenderThread.run(SenderThread.java:541)
at java.lang.Thread.run(Unknown Source)
2008-01-02 10:12:08,032 ERROR [SenderThread] Error sending measurements: IO error: null

Message was edited by: estebann

excowboy · ‎01-03-2008

Hi,

maybe you could provide more details about your environment (HQ Agent platform, Agent version).
Do you use any kind of firewall ? The agent contacts the server on port 2144, the server contacts the agent on port 7080 by default.

Cheers,
Mirko

estebann · ‎01-03-2008

Hi,

Thanks for the response. The server in question is running RH Linux with kernel version 2.4.20. The agent is version 3.1.4. There are a couple of firewalls involved but as communication is not impeded for the first ~week after restarting the agent, I don't think the problem is one of firewall configuration.

Thanks,
BD

JohnMarkOrg · ‎01-03-2008

Hmm... well something is preventing HQ from reading data from the network - thus the timeout errors.

Just to cover the bases - do any of the machines involved change IP addresses? At the time of the errors, how much disk space remains on the agent's partition?

Also, just out of curiosity - did you ever kill the agent while it was starting up? I once did that and found that I needed to blow away the agent and re-install.

And finally, I wonder if you have some kind of weekly cron job that somehow interferes with the agent. Do you find that the errors begin at a specific time on a specific day?

-John Mark

roger_symonds · ‎01-03-2008

Hi,

I agree with John Mark. I've had the same experience when I killed an agent instead of shutting it down correctly.

A quick reinstall of the agent fixed this for me, I recommend giving that a try first.

If it still doesn't work, double check the network connection between agent and server.
A good program for this is Hping (http://www.hping.org/) as you can craft appropriate network packets and trace them through the network to find the problem:

(quote from Hping site) Hping supports TCP, UDP, ICMP and RAW-IP protocols, and has a traceroute mode.

I hope this helps.

Regards,
Roger

estebann · ‎01-04-2008

Hi,

Thanks for all the suggestions, I'll try a reinstall and look for connectivity problems. It still seems a little odd to me that the agent wouldn't recover after running into a transient connectivity problem. I call it transient because the agent has never failed to connect properly after being restarted. So at least by that time connectivity is possible.

Also I am curious as to why the server continues to receive server availability data several hours after it stop receiving all other data from the client. Is server availability communicated differently from other things? Also why would the agent become non-responsive to a shutdown command executed locally due to connectivity problems?

Again thanks for the help,

BD

All

Client dies after running successfully for several days.