My agents are appearing up/down every 10 to 15 minutes. Sometimes every 5 minutes. Often the Recent Alerts section will indicate that they became unavailable and then available (recovery alert) in the same minute. The ones digit is always 0 or 5.
I've scoured the documentation, these forums, and google and can only guess that I'm seeing the dreaded false negative.
Condition wise, I've tried Availability = 0 with the recovery alert Availability >0, as well as = 0 / = 100. I've tried "Once every X times conditions are met within a time period of Y minutes". I've tried running the agent as root. Running the agent as the hyperic user.
NTP seems to be the most common suggestion, but that is already running and the offset is very low.
What else might be causing this endless alert spam?
Which Hyperic version are you using?
I have never faced such situation. Can you please explain - do you really have an availability issue? you see your agents actually down and up (not in Hyperic but in the monitored machines)? or is it as a result of the alert definition you added?
The agents are running on Red Hat Enterprise Linux 6 (64-bit).
Wrapper.log is mostly empty, with a few entries from last week about starting the agent. The following error appears from time to time in agent.log:
28-01-2014 13:56:32,214 UTC ERROR [SenderThread] [SenderThread@478] Error sending measurements: Unable to contact server @ http://<IP removed>:7080/lather: org.apache.http.conn.HttpHostConnectException: Connection to http://<IP removed>:7080 refused
org.hyperic.hq.bizapp.client.AgentCallbackClientException: Unable to contact server @ http://<IP removed>:7080/lather: org.apache.http.conn.HttpHostConnectException: Connection to http://<IP removed>:7080 refused
at java.lang.Thread.run(Unknown Source)
At the time above (13:56), the error appeared twice but the availability metric was unaffected. However, the exact same error appeared about 8 times between 14:29 and 14:32 and an alert was generated. Both the alert (availability = 0) and the recovery alert (availability > 0) triggered at 14:30.
Another oddity... Either closing the Hyperic HQ browser tab or signing out is related to all of these alerts, or it's extremely coincidental.
If I leave the Dashboard open in a browser tab, the alerts do not seem to go off. I refresh the dashboard and see no new recent alerts.
However, if I close the tab, sign out, or try to refresh and find that it had automatically signed out (timeout?) then I will log back in and there will be availability alerts for the HQ agents during that time frame. Logging back on to the Hyperic Dashboard seems to stop the alerts for as long as I keep the tab open and refreshed.
Like I said, it seems coincidental, but then again I can fairly reliably generate alerts just by signing out and stop them by signing in. What is going on??
This is very strange. I highly recommend opening a ticket with Support so a full analysis of the logs and settings could be performed.
Do you have any kind of dynamic firewall that could open and close ports based on things like browser connections from an administrator desktop?
I have opened a case yesterday for a similarly strange availability issue. I'm running vFabric Hyperic 5.0 agents and server. The availability will go to 0 until I RDP into the Win2008 R2 platform, at which availability returns - sometimes. Sometimes I do have to run \hq-agent.bat setup to get it to come back. I'll report back here when the case is resolved.
For what it's worth, these configuration changes improved our agent stability and seem to have stopped the false alerts.
When setting up the agent:
Should Agent communications to HQ be unidirectional [default=no]: yes
Should Agent communications to HQ always be secure [default=yes]: no
In the file /...path to agent directory.../conf/agent.properties, set accept.unverified.certificates=true