I am monitoring a java process and have an alert defined on it that says if "availability<100%", send me an email. Well, I've gotten a few emails when the process appears to still be alive. I wonder if anyone knows how the availability is determined? - does the server track this or the client? i.e. if the network connection between the server & client is down, would this be considered 0% availability? - is there a different metric to determine whether the hq agent is unreachable vs. whether the applications being monitored via that agent are failing?
I too found this (if Availability<100%) to cause problems at times. I never figured out why , but I changed it to be if Availability = 0.0% then alert. So if the java process is gone then it will notify, and that seems to work just fine.
Thanks for your input Brian. I tried this change and continue to see erroneous "process down" alerts. I have it configured to alert if "availability == 0% for 3 minutes over a 3 minute period". I wish I knew how it determines availability, then I might be able to make a smarter alert. That said, I think it is a bug that this definition doesn't work.
If anyone knows a way to work around this, please let me know. It is critical that I am able to determine whether my application has actually crashed. But I do also want to know if it is in a slow/unhealthy state that is causing the unavailable notifications.
I guess I just need to understand this metric a little better.
I started with "avail<100%", then simply "avail == 0%", but continued to see erroneous process unavailable messages... so I started adding the condition that indicates that it's down for some period of time. I set it to 1 min, then 2, now 3... I guess the UI makes it seem like I need to indicate the greater time period. I was thinking that if it is recorded as down for 3 minutes, it surely must be down (nope...).
I didn't think setting that condition would change the fact that it checks every 10 minutes... I thought it would just note that over a 3 minute span within that 10 minute window, my app was unavailable for 3 minutes. Is that not correct?
Im also having this problem, the weird thing is that the graph for availability doesnt show it moving, its always at 100% but i still get erroneous emails, i have been increasing the alert time also and currently im on 35 minutes!
Well, the collection interval will not be affected by your alert definition. The value is considered by the alert definition to hold until it changes. So if you collect an 0% available at some point and then 3 minutes pass, then an alert will fire.
When the alert fires, is there a corresponding 0% data availability data point? ncrukrepairs, you claim that you have the alert duration to 35 minutes, so does that mean when you look back on the availability chart, there are definitely data points at 100% availability in those 35 minutes?
My problem is a little different. Generally this latest definition is working pretty well for me: If Condition: Availability = 0.0% Enable Action(s): Once every 2 times conditions are met within a time period of 30 minutes.
However, hyperic is getting a report of 0% when the app isn't actually down. My graph shows 2 readings of 0% in the past half hour, so I received an alert, but my app was never down. This brings me back to my earlier question: how is availability of a process determined? I don't understand where the miscommunication is occurring. If my app is running, how can hyperic send back a reading of 0% available?
Hi, apologies for the threadomancy, but I just dealt with a similar problem and found this thread via search, figured some more information might benefit others.
The default TomCat plugin monitors availability simply by checking if the process is running in the OS process list. It uses this query:
What that does, is grep through the process list looking for instances of java.exe that have catalina in theirs arguments. This is pretty good as a generic way to find instances of TomCat.
However, in my case, I was getting false availability readings because the default setup wasn't even monitoring my instance! The application I wanted to monitor doesn't even run as java.exe, the query above was discovering a development instance that goes up and down all the time independent of my app, so the availability data made no sense.
I'm on a windows box, so I changed the query to:
So the plugin now properly monitors only my specific TomCat instance.
The query can be found in the Configuration Properties under the Inventory tab for the Server.
You can test if a query finds your TomCat by doing <AgentHomeDir>\>jre\bin\java -jar pdk\lib\sigar.jar ps your_query_here