I have been running Hyperic HQ since 3.2.3 on a setup with about 10 Linux and Windows servers. I recently upgraded from 4.0.3 to 4.1.2. All of my agents were working fine in 4.0.3, but now a few agents are acting strange. The installation of the agents went fine on every server. I associated them with the server, they handshaked, and all seems good. I've done this many times, so I am confident about the installation procedure. All of my agents on Red Hat/CentOS work fine. I have an agent on an Ubuntu box, and two agents on Windows boxes that are acting strange. Before the upgrade, they were working fine.
Now, what happens is, when I start the agent, it loads the plugins, sends the Autodiscovery report to the server, and runs the Runtime autodiscovery such as the following:
2009-06-01 04:16:17,774 INFO [Thread-1] [RuntimeAutodiscoverer] Running runtime autodiscovery for HQ Agent
2009-06-01 04:16:17,787 INFO [Thread-1] [RuntimeAutodiscoverer] HQ Agent discovery took 0
Unless I do something to force something to appear in the log, a line like the above is usually the last line I will see in the agent.log no matter how long I wait. If anything is changed, I see the changes appear in the AIQ window of the HQ Dashboard. The problem is that at this point the agent just goes idle and never does anything else again. It does not die, it is clearly still running. If I do something to the platform on HQ that causes the server to contact the agent, I can see something in the log. Like, if I define a Script Service, but give a non-existent filename, that will make it throw a file not found exception in the log, so I know the server talks back to the agent.
There is just no metric data being collected, or any other activity at all in the agent.log after initial startup or some other forced activity. The HQ shows the platform as down with all indicators red. What could possibly be causing the agent to just be brain-dead like this?
Anyone have any tips on how I can start troubleshooting this problem? I've been looking in the server.log also but I have no idea where to begin. I don't see any errors in there that jump out at me. What kinds of things in the server.log might be relevant to this problem?
Lee