gnovak
Contributor
Contributor

Reading log files

Hi there Hyperic users!

I am trying to read through some of the logs to find out information about an event that happened over the weekend.

The company I work for has machines in the home office, which is where the Hyperic server is stationed, but also has machines in a remote office. Sometime over the weekend, it appears the internet connection at this remote office did go down and Hyperic sent out alerts about it. Which is good!

However what I'm trying to do is go through the logs on the Hyperic server to find out when the services came back up. I haven't seen anything yet in logs that would indicate when things came back up. Where in particular should I be looking? I assumed the server logs on the hyperic server would tell me. Can anyone help? I see messages in the logs that hyperic could not report to a non-existing entity....

2007-08-19 23:57:48,175 ERROR [org.hyperic.hq.measurement.server.session.SRNManagerEJBImpl] Agent's reporting for non-existing entity: 2:10042

If the entity actually some of the machines being monitored?
0 Kudos
2 Replies
gnovak
Contributor
Contributor

Well I did find something interesting on one of the agent logs...I checked out a log on one of the machines running the agent and i could see where hyperic said "i can't generate information for the metric CPU" because the connection was down. When I started to see messages again in the logs where hyperic was telling me that the time on the machine running the agent wasn't perfect, (I have to fix that!!!) i knew the machine was back up and running and I could get an idea of when the machine was back up.

Is there any other place I might want to look?
0 Kudos
ama_hyperic
Hot Shot
Hot Shot

Can you be a bit more specific? From the 1st post, it sounds like none of your machines or the services that were running on those machines was actually down, it was just your network connection.

Did you have a agent running on the server and a ping test going out to the remote office? If so, I would check the availability chart for the ping test to see when it started responding again.

Barring that, the metrics are timestamped during collection and if there is a network error or the HQ server is down, the metrics are kept in a local spool on the agent side. The agent will retry the connection and when the agent can re-connect the server will receive metrics with timestamps in the past and know to back fill this data into the database so that you do not lose metric data regardless of if the HQ server is down or if there is a network outage.

If looking in the agent logs, I would grep -n through them for a string similar to this

ERROR [SenderThread] Error sending measurements: Unable to contact server @ http://xxx.xxx.xxx.xxx:7080/jboss-lather/JBossLather: Connection refused

I would then focus on the timestamp of the last entry that I see.
0 Kudos