Hi, I'm scratching my head trying to figure out how to get HQ (open source version) to notify me when the win32 server is down or interrupted, without flooding me with messages.
It appears the availability parameter is actually an average instead of an actual value, which makes it difficult to come up with good criteria (so far!). For example, I would like to know when the actual availability is less than 100% anytime within an hour, but I don't want false messages, so I assume I should wait for 3 or so values less than 100% (default measurement period is 1 minute). The problem is, the average stays below 100% long after the network/server has recovered and is actually at 100% availability.
The best idea seems to be to enable the action when the availability=0 for at least 5 minutes in one hour. However, since the availability is an average I would not get a message until the average was dragged down to 0, which could be some time. If I try to pick a percentage, then I'm kind of guessing at when the threshold is crossed. Plus if I just get a couple of random dropouts in an hour that happen to add up to 5 minutes below my threshold, I don't want to get beeped (I already know my network is very unreliable!)
I'm hoping I have just overlooked an easier way to set it up, so if anyone knows the secret, I would appreciate knowing. Is there maybe a way to make availability be an average of only 1 measurement point?
Actually, each discrete availability metric value is going to be either 0% or 100%. When you see an average, it's because it's applied to a time range. To create an alert for what you are looking for would be just to create one for Availability<100% when the condition is met for 5 minutes within a time period of 1 hour if you expect your server or networks to have some occasional dropouts. With this setting, the behavior would be that if the total down time reaches 5 minutes (within an hour timeframe), then the alert will fire. However, keep in mind that once the 5 minutes is reached, even if it's in the first 5 minutes of the hour, everything gets reset, and another alert will fire whenever the next 5 minutes total down time is reached (within an hour timeframe).
You can, of course, select the option to Disable alert until re- enabled manually. Then as a workflow, once you get an alert and investigate your server, you would then manually turn the alert back on.
Let me add a followup, if the server was down from 7:00 to 7:05 AM I would get an alert at 7:05 because there was 5 minutes of downtime since 6:05 AM. Would I get another alert at 7:06, due to the 5 minutes total having occured since 6:06? I don't want to have set the option to disable the alert and then to manually "reenable" the alert, but once I get an alert, I don't want another one until 8:05 AM (in case I am lazy or forgetful and don't fix the problem.) It didn't seem to work that way when I tried it.
Not exactly. Let's say the server goes down at 7am. You will get an alert at 7:05am (well, not quite, since with the agent not running, it takes us a while to detect that, but for the sake of the argument, let's stick to the exact time). Then, if the server stays down, you'll get another alert at 7:10am, because another 5 minutes of downtime has been detected. There's not a way currently to configure so that you suppress the alert for an hour before firing again.
Ok, thanks, I understand now. Although this might not be the correct forum, I would ask that the alert config be updated to allow a new option (assuming it doesn't already exist in the enterprise edition): send an alert after x time with the condition met, but only allow 1 alert per y time periods.