VMware Cloud Community
lfeistel_hyperi
Contributor
Contributor
Jump to solution

Agent starts, sends autodiscovery info, then just sits idle, no metrics

I have been running Hyperic HQ since 3.2.3 on a setup with about 10 Linux and Windows servers. I recently upgraded from 4.0.3 to 4.1.2. All of my agents were working fine in 4.0.3, but now a few agents are acting strange. The installation of the agents went fine on every server. I associated them with the server, they handshaked, and all seems good. I've done this many times, so I am confident about the installation procedure. All of my agents on Red Hat/CentOS work fine. I have an agent on an Ubuntu box, and two agents on Windows boxes that are acting strange. Before the upgrade, they were working fine.

Now, what happens is, when I start the agent, it loads the plugins, sends the Autodiscovery report to the server, and runs the Runtime autodiscovery such as the following:
2009-06-01 04:16:17,774 INFO [Thread-1] [RuntimeAutodiscoverer] Running runtime autodiscovery for HQ Agent
2009-06-01 04:16:17,787 INFO [Thread-1] [RuntimeAutodiscoverer] HQ Agent discovery took 0

Unless I do something to force something to appear in the log, a line like the above is usually the last line I will see in the agent.log no matter how long I wait. If anything is changed, I see the changes appear in the AIQ window of the HQ Dashboard. The problem is that at this point the agent just goes idle and never does anything else again. It does not die, it is clearly still running. If I do something to the platform on HQ that causes the server to contact the agent, I can see something in the log. Like, if I define a Script Service, but give a non-existent filename, that will make it throw a file not found exception in the log, so I know the server talks back to the agent.

There is just no metric data being collected, or any other activity at all in the agent.log after initial startup or some other forced activity. The HQ shows the platform as down with all indicators red. What could possibly be causing the agent to just be brain-dead like this?

Anyone have any tips on how I can start troubleshooting this problem? I've been looking in the server.log also but I have no idea where to begin. I don't see any errors in there that jump out at me. What kinds of things in the server.log might be relevant to this problem?

Lee
Reply
0 Kudos
1 Solution

Accepted Solutions
jvalkeal_hyperi
Jump to solution

Deleting data directory is not changing that much stuff on server side. I believe deleting platfrom from server and then re-instatiating agent would do the trick. But that may not be something what you want to do if you want to preserve collected metrics.

I wonder if this is a new bug due to upgrade process. Quick look to jira didn't gave anything similar.

I hope somebody from hq developers would pick this up. I don't know any direct ways to refresh those schedules. Maybe through groovy it would be possible to gather mertic id's from failed platforms and call some spesific method on them. This is a way too long road to follow and would actually sound a bit weird thing to do.

View solution in original post

Reply
0 Kudos
10 Replies
jvalkeal_hyperi
Jump to solution

When you upgraded the server, was there any error messages or something similar which could give some ideas about the problem. I believe there should be upgrade/install log somewhere.
Reply
0 Kudos
lfeistel_hyperi
Contributor
Contributor
Jump to solution

I looked over the upgrade log and don't see any errors I can spot. I also watched the server startup and saw no errors there either. After the server was started, I stopped and then restarted one of the failing agents. I watched the server.log and still no more errors. I watched the agent.log and there is one error in there about a copy of Tomcat that is no longer running. I think that error is innocuous. I captured everything and am attaching the logs here in case someone would be kind enough to take a look.

Any other ideas as to what would cause an agent to just sit and perform no data collection even though it is running and communicating bi-directionally with the server?
Reply
0 Kudos
jvalkeal_hyperi
Jump to solution

Yeah, there was nothing special in those log files... at least not anything what I was able to spot. So some agents are working fine, some failed. Strange...

This is a long shot, but here's the different steps what I would try to do:

1. shutdown agent. delete it's data directory. start agent.
2. Selecting one service and trying to change it's metric collection interval.
3. Have you tried to enable agent debugging. It's INFO by default. Sometimes debug messages may give something useful.
4. Don't know how important collected data is, but can you delete the agent from hq's inventory, and re-install agent(by doing step 1).
lfeistel_hyperi
Contributor
Contributor
Jump to solution

I enabled DEBUG logging on the agent and noticed that when it goes into its "coma", it is actually running in a continuous loop doing this:

2009-06-01 15:37:00,379 DEBUG [ScheduleThread] [ScheduleThread] Platform schedule is null
2009-06-01 15:37:00,380 DEBUG [ScheduleThread] [UnitsFormat] format(1.243888621379E12) -> 6/1/09 3:37:01 PM
2009-06-01 15:37:00,380 DEBUG [ScheduleThread] [ScheduleThread] Waiting 1000 ms until 6/1/09 3:37:01 PM

Then, I decided to try your suggestion to alter a metric from the server. I changed the system Availability metric collction interval from 1 min to 2 mins. Activity immediately appeared in the agent's log. See the attached agent.log for the activity that occurred (DEBUG level logging).

The availability status for all three offending platforms has now changed to green. The problem is, it appears I might have to do this for each set of metrics. It looks like it only affected the metric I actually changed.

So, what this proves is that the agent is able to actually send this metric to the server, but for some reason it never got the list of metrics (the "schedule") from the server to begin with. So, now I can narrow my question. If an agent has no schedule after an upgrade/fresh install on the agent side, what does this mean? Is there anything I can do to force the server to send a new copy of the entire schedule to the agent?

Since I reinstalled the agent already, that would have inherently involved a fresh data directory, so deleting the data directory would make me reassociate with the server (enter IP, credentials, etc.) and I have already done this dozens of times, so that's not it.

Thanks again for those tips. That really helps to at least see some sign of life here.

Lee
Reply
0 Kudos
jvalkeal_hyperi
Jump to solution

Deleting data directory is not changing that much stuff on server side. I believe deleting platfrom from server and then re-instatiating agent would do the trick. But that may not be something what you want to do if you want to preserve collected metrics.

I wonder if this is a new bug due to upgrade process. Quick look to jira didn't gave anything similar.

I hope somebody from hq developers would pick this up. I don't know any direct ways to refresh those schedules. Maybe through groovy it would be possible to gather mertic id's from failed platforms and call some spesific method on them. This is a way too long road to follow and would actually sound a bit weird thing to do.
Reply
0 Kudos
lfeistel_hyperi
Contributor
Contributor
Jump to solution

Well, you point out something that brings me to a general question about Hyperic:

I could live with losing all the metrics data for a platform in order to correct a problem occasionally. I rest assured knowing plenty of new metrics will come in to replace those I deleted over just a few days. What I would prefer not to have to do, however, is reconfigure all the collection intervals, customized servers/services, and alert profiles associated with the platform. It takes a lot of time and attention to detail, and just plain patience, to sit down and configure everything for a Platform. To just delete the Platform to get the agent to reset/resync seems unacceptable to me.

I need a way to maybe delete the Platform, but save it in some way, and then bring it back. What do you do when you have a Platform that is something like, say, a www1 machine. Now you want to setup a www2 machine, which is an exact replica of www1. Identical hardware and software. Now I have to go and manually recreate all my settings I originally created on the www1 Platform for the www2 Platform. Isn't this basically a terrible oversight in the UI? A golden rule of UI design: whenever the user has to spend a large amount of time/effort inputting data, don't make them repeat the same exact steps over and over.

Am I missing something about how to replicate/rename/copy/backup Platforms here? If I could do that, then I would not be so worried about the mundane task of deleting a Platform and recreating it in order to force a buggy Agent to reload its schedule.
Reply
0 Kudos
jvalkeal_hyperi
Jump to solution

Well, that is the downside if you modify individual metrics. If that metric is later removed, it's settings are also lost.

I always try to modify global settings(monitoring defaults) for platform, server, service types. This way you always have common settings for similar resource types. Downside of this is that it will overwrite individual metric collection settings. But in this case you just have to select which option to use.
lfeistel_hyperi
Contributor
Contributor
Jump to solution

I see. Well, trying to put as much in the global settings as possible does sound like a good idea. I hadn't thought of doing that, but I could probably cut repetition down by 75% that way. I'll give that a try and see if it simplifies management of the system over time.

At this point, I have gone through and nudged all the metric sets that were affected, so I will say this problem is now resolved in my case. Fully resolved? No, I guess we will see if a developer picks it up as a bug or knows a way to overcome the problem more elegantly. The agent should request a fresh copy of the entire platform schedule if it has a null schedule, or the server should sent it if the agent isn't giving any reports. Or at the very least, there could be an option to resync the agent in the HQ server-side UI somewhere. There is definitely some kind of initialization/resync bug in the new version I haven't seen before.

Anyway, thanks again for your help on this one. I was about to resort to banging my head on the wall. I learned a little bit more about how the agent actually works and how to find more info in the logs using the DEBUG level, so that is good.
Reply
0 Kudos
lfeistel_hyperi
Contributor
Contributor
Jump to solution

I am still having this issue. I just upgraded the server to 4.2.0.7 and the agent to 4.2.0.7 as well (on Windows 2003 Server). Now the agent doesn't have the metric monitoring schedule and so all items show red or no data.The only workaround I know of is to manually go through all levels in the associated platform on the server side, and edit the metric collection intervals. This forces it to push the schedule data for those metrics back to the agent, and those items start collecting metrics again. It is tedious to manually go through and do this each time I upgrade the agent though. Surely there is a way it can automatically detect a change to the agent and push the entire schedule back down.

I notice this issue seems related:
http://jira.hyperic.com/browse/HHQ-1999

Has anyone else experienced this issue? I've used Hyperic for years and upgrading the agent was always effortless up until sometime after 4.x. Now everything is down after an agent upgrade until I go through and kick all those settings by hand.

Message was edited by: lfeistel
Reply
0 Kudos
lfeistel_hyperi
Contributor
Contributor
Jump to solution

Bingo. I think I solved my own problem. I followed the instructions at the bottom of the following message to remove some derelict Auto Discovery information that would not import. I think the issue boils down to the fact that the Auto Discovery mechanism on the server side was not working properly due to these entries that could neither be imported, nor ignored or deleted. I had to remove them by hand from psql. Then, when I repaired the agent with the server, everything is copacetic again.

http://communities.vmware.com/thread/352872

Hope this helps anyone else who runs across this later.
Reply
0 Kudos