hyperik
Contributor
Contributor

Hyperic HQ Agent availability up/down

My agents are appearing up/down every 10 to 15 minutes. Sometimes every 5 minutes. Often the Recent Alerts section will indicate that they became unavailable and then available (recovery alert) in the same minute. The ones digit is always 0 or 5.

I've scoured the documentation, these forums, and google and can only guess that I'm seeing the dreaded false negative.

Condition wise, I've tried Availability = 0 with the recovery alert Availability >0, as well as = 0 / = 100. I've tried "Once every X times conditions are met within a time period of Y minutes". I've tried running the agent as root. Running the agent as the hyperic user.

NTP seems to be the most common suggestion, but that is already running and the offset is very low.

What else might be causing this endless alert spam?

0 Kudos
9 Replies
admin
Immortal
Immortal

Hi

Which Hyperic version are you using?

I have never faced such situation. Can you please explain - do you really have an availability issue? you see your agents actually down and up (not in Hyperic but in the monitored machines)? or is it as a result of the alert definition you added?

0 Kudos
hyperik
Contributor
Contributor

I was using 5.7.0 but yesterday I upgraded to 5.7.1. So far so good.

Hopefully it's not an issue that just takes a few days to occur.

0 Kudos
admin
Immortal
Immortal

Hi,

Seems like java was crashed and wrapper is trying to up agent again.

On which platform agents are running ?

Please attach agent.log and wrapper.log.

Good luck

0 Kudos
hyperik
Contributor
Contributor

The agents are running on Red Hat Enterprise Linux 6 (64-bit).

Wrapper.log is mostly empty, with a few entries from last week about starting the agent. The following error appears from time to time in agent.log:

28-01-2014 13:56:32,214 UTC ERROR [SenderThread] [SenderThread@478] Error sending measurements: Unable to contact server @ http://<IP removed>:7080/lather: org.apache.http.conn.HttpHostConnectException: Connection to http://<IP removed>:7080 refused

org.hyperic.hq.bizapp.client.AgentCallbackClientException: Unable to contact server @ http://<IP removed>:7080/lather: org.apache.http.conn.HttpHostConnectException: Connection to http://<IP removed>:7080 refused

        at org.hyperic.hq.bizapp.client.AgentCallbackClient.invokeLatherCall(AgentCallbackClient.java:176)

        at org.hyperic.hq.bizapp.client.AgentCallbackClient.invokeLatherCall(AgentCallbackClient.java:146)

        at org.hyperic.hq.bizapp.client.MeasurementCallbackClient.measurementSendReport(MeasurementCallbackClient.java:62)

        at org.hyperic.hq.measurement.agent.server.SenderThread.sendBatch(SenderThread.java:451)

        at org.hyperic.hq.measurement.agent.server.SenderThread.sendData(SenderThread.java:623)

        at org.hyperic.hq.measurement.agent.server.SenderThread.run(SenderThread.java:613)

        at java.lang.Thread.run(Unknown Source)

At the time above (13:56), the error appeared twice but the availability metric was unaffected. However, the exact same error appeared about 8 times between 14:29 and 14:32 and an alert was generated. Both the alert (availability = 0) and the recovery alert (availability > 0) triggered at 14:30.

0 Kudos
hyperik
Contributor
Contributor

Another oddity... Either closing the Hyperic HQ browser tab or signing out is related to all of these alerts, or it's extremely coincidental.

If I leave the Dashboard open in a browser tab, the alerts do not seem to go off. I refresh the dashboard and see no new recent alerts.

However, if I close the tab, sign out, or try to refresh and find that it had automatically signed out (timeout?) then I will log back in and there will be availability alerts for the HQ agents during that time frame. Logging back on to the Hyperic Dashboard seems to stop the alerts for as long as I keep the tab open and refreshed.

Like I said, it seems coincidental, but then again I can fairly reliably generate alerts just by signing out and stop them by signing in. What is going on??

0 Kudos
admin
Immortal
Immortal

This is very strange. I highly recommend opening a ticket with Support so a full analysis of the logs and settings could be performed.

Do you have any kind of dynamic firewall that could open and close ports based on things like browser connections from an administrator desktop?

0 Kudos
rwmastel
Contributor
Contributor

I have opened a case yesterday for a similarly strange availability issue.  I'm running vFabric Hyperic 5.0 agents and server.  The availability will go to 0 until I RDP into the Win2008 R2 platform, at which availability returns - sometimes.  Sometimes I do have to run \hq-agent.bat setup to get it to come back.  I'll report back here when the case is resolved.

0 Kudos
hyperik
Contributor
Contributor

For what it's worth, these configuration changes improved our agent stability and seem to have stopped the false alerts.

When setting up the agent:

Should Agent communications to HQ be unidirectional [default=no]: yes

Should Agent communications to HQ always be secure [default=yes]: no


In the file /...path to agent directory.../conf/agent.properties, set accept.unverified.certificates=true

0 Kudos
rwmastel
Contributor
Contributor

We use, in order:
no

no

true

My issue is with just one Win2008r2 platform.  Our other 270+ Linux and Windows platforms don't behave this way.

0 Kudos