Why do I get data when I divide by a lowpass(x) wh...

AbhishekSK · ‎07-28-2017

I have an alert which is intended to show that, if a disk continues filling for the next hour as rapidly as it did on average for the last hour, it will fill up:

lowpass(3600, -1 * ts("*.df-*.percent_bytes-free") as pu / lowpass(0, mavg(3600s, deriv($pu))))

The first lowpass is to limit to 'things which will fail in less than one hour', the second is to ensure it is only for 'things which are trending towards full'. Otherwise you get results for things which are trending down from full and would have been full less than an hour ago!

My problem is this: when the second lowpass returns No Data, the alert still fires. Why is this? Is it a bug, or something I've failed to understand?

I have an alternative version (using '(X - at(X))/period') which appears to work, so please don't try to help me write something better, I am only interested in why I don't get no data.

AbhishekSK · ‎07-28-2017

Hi justin_rowles! I believe I understand your question correctly and would love to help. I see that the alert you have questions about has already been updated to the alternative version you referred to, so I will focus on the time where the original version was being used.

The following link shows that while values are being returned at times, there are several gaps of missing data which correlate with sections of 'No Data'. I added align(1m,) to the query to signify minutely summarized values being evaluated for an alert.

Link: https://connectedhome.wavefront.com/u/MgpQ7CBR7J

For each alert check, our system is looking at the last "Minutes to Fire" window (in this case 15 minutes) and is checking to see if there is at least 1 true value reported and no false values. Since there is no specific condition associated with this alert (e.g. > 24), any non-zero reported value is considered true and any zero reported value is considered false. Each time the alert fired, there was at least 1 non-zero reported value present in the last 15 minutes. Since gaps of missing data are considered neither true nor false in our system, this behavior represents a time when the alert should fire. For those gaps of missing data, you can apply a default(0,) around the entire equation. This would replace those gaps with a value of 0, which would be considered false in this scenario.

I also noticed that the alert seemed to go back and forth from fired to resolved quite often. This seems to be tied to the 'Minutes to Fire' and 'Minutes to Resolved' fields being 15 minutes and 2 minutes, respectively. I'd be glad to provide some additional context around this topic as well if you had questions about that as well.

Did this explanation help to answer your question?

AbhishekSK · ‎07-28-2017

Thank you for that, it's improved my understanding. I had naively thought that a trigger would fire if 'continuously positive for n minutes', but I can see that that is not valid, as the data is discrete.

It makes sense for 'any positives and no negatives' for a trigger period causing an alert to start, which combined with the 'no data is not negative data' clears up a lot of my understanding problem.

I can also see that wrapping in default(0, lowpass()) would have fixed the problem, by forcing the 'no data' to be 'not a problem'.

However, it doesn't seem right that I have the event at, for example 11:34:26:
https://connectedhome.wavefront.com/chart#(c:(cs:(customTags:!(),fixedLegendDisplayStats:!(CURRENT),...

The 'unfire' rule would appear to be 'no positives during a period'. This means that I can see a new event caused by data which occurred in a previous, now closed, event. This is definitely counter-intuitive.

Perhaps the fire rule could be 'any positives and no negatives during a trigger period unless a subsequent un-trigger period of no positives has been seen.

All

Why do I get data when I divide by a lowpass(x) which returns no data?