VMware Modern Apps Community
corey_r
Community Manager
Community Manager
Jump to solution

How can I detect missing time series?

I have a time series which, if present, indicates that a job is running.  The job should always be running.

I would therefore like to alert on count(ts("X")) = 0, ie if there are no statistics being sent.

But if ts("X") is 'No Data', then count(ts("X")) is also 'No Data', at least in the ten minute view.  In a longer view which stretches back to a time when the job had been running, it shows zero.  The value shown depends on the view, which seems... wrong.

How can I detect a missing time series?

Reply
0 Kudos
1 Solution

Accepted Solutions
corey_r
Community Manager
Community Manager
Jump to solution

I love the questions justin_rowles -- Keep 'em coming!

I think this following discussion should help: What's the best way to alert when a source or series stops reporting?

You're right that the 'NO DATA' gaps make it hard to alert on a series that is no longer reporting. Our best practice suggestion is to use mcount() with your query. mcount() is going to count the number of reported points within a given moving time window. Let's say for example that a particular series should report 5 values in a 5 minute window. If you take the base query and wrap it with an mcount(5m,) function, then you can simply add = 0 to the end for your alert.

mcount(5m, ts(my.metric)) = 0

This will tell you when a series has stopped reporting completely. You can adjust the condition if you simply want to know when a series is not reporting as often as you'd like. For example, you can switch = 0 to < 2 if you want to know when that same series only reports a single value in a given 5 minute window.

View solution in original post

Reply
0 Kudos
5 Replies
corey_r
Community Manager
Community Manager
Jump to solution

I love the questions justin_rowles -- Keep 'em coming!

I think this following discussion should help: What's the best way to alert when a source or series stops reporting?

You're right that the 'NO DATA' gaps make it hard to alert on a series that is no longer reporting. Our best practice suggestion is to use mcount() with your query. mcount() is going to count the number of reported points within a given moving time window. Let's say for example that a particular series should report 5 values in a 5 minute window. If you take the base query and wrap it with an mcount(5m,) function, then you can simply add = 0 to the end for your alert.

mcount(5m, ts(my.metric)) = 0

This will tell you when a series has stopped reporting completely. You can adjust the condition if you simply want to know when a series is not reporting as often as you'd like. For example, you can switch = 0 to < 2 if you want to know when that same series only reports a single value in a given 5 minute window.

Reply
0 Kudos
corey_r
Community Manager
Community Manager
Jump to solution

Eventually, I found a solution which alerts on EITHER no data OR bad data.  Others may wish to use something like this:

mavg(1m, rate(default(0, align(1m, sum(ts("X"))))) > Y

This tests that a count continues to increase over time.  If the metric stops being sent (job crashed), or the count remains the same (job hung), then I get an alert.

Reply
0 Kudos
corey_r
Community Manager
Community Manager
Jump to solution

Glad to see you're building out some more complex queries justin_rowles! Over the years, customers have shared several queries with us. This experience has helped us to understand where improvements can be made in order to maintain proper performance. I'd like to make a few suggestions to your query if that's okay

I see that you are utilizing sum() and rate() in your query. This combo is seen across several customer environments. What we've learned over time is that it's actually better to rate() the counter metric(s) first, and then apply sum(). Each counter metric will typically reset to zero at some point in time, but those resets do not always occur at the same time. By applying sum() to a set of counter metrics with this behavior, the resulting single time series can sometimes exhibit behaviors of a non-counter metric (gauge). Since rate() is meant to be used with counter metrics, this behavior can have an adverse affect on the data that rate() returns. It would also make sense to change sum() to rawsum() in this case as well. Since this is an alert, sum() can cause an alert to fire when the alerting check occurs, but (based on interpolation) give a false impression that the alert should not have fired when investigated after the fact. Therefore something like rawsum(align(1m, rate(ts("X"))) would be a more beneficial approach.

I'd also like to make a suggestion about utilizing default(). The insight gained on this function is based on some recent changes that were made to the behavior of default(). Prior to the change, default() would apply a default value to a series if it wasn't reporting data. However, the default value would no longer be applied if the last known reported value was more than 4 weeks in the past. We came across an issue where, at times, default() would not apply a default value in small time windows (< 5 minutes). Some improvements were made to the default() function (and are in the process of being rolled out), but default() will now also apply a default value to series even if the last known reported value was greater than 4 weeks ago. This means that by using default() in your query, you are also retrieving series that may have legitimately stopped reporting over 1 month ago. Not only can this cause you to evaluate series that you may not have expected, it can slow down performance. My suggestion would be to create a secondary argument for this alert expression without default(), or apply a time parameter with default() so that considerably old data is not included in your results. For the latter approach, you could go with something like default(2w, 0). I would suggest the first approach though, which you could accomplish by creating an expression like the one below:

mcount(5m, ts("X", not tag="decommissioned")) = 0 or mcount(5m, highpass(0, rawsum(align(1m, rate(ts("X")))))) = 0

In the example query above, you would be alerted if there are no reported values at all (mcount() approach) or if the total summed rate remains unchanged for X minutes (based on Minutes to Fire). In the first argument, I specify not tag="decommissioned" in the query so that any series that did report data in the past but are no longer reporting for valid reasons are not evaluated. It requires that you properly apply a "decommissioned" source tag to the sources that are no longer reporting data. This can be done through a manual process or a script. We talk about the "decommissioned" approach here.

Small note on the approach above-- You'll see that I'm using 'or' in this expression. Using 'or' or 'and' in this manner means that they are mathematical in nature. If both arguments include 2 or more resulting series, then those series would need to appear in both arguments in order for them to be evaluated properly. For example, if the first argument includes series A,B,C, but the second argument includes series C,D,E, then only C would be properly evaluated by the query. In the example above though, the second argument results in a single series (since it's using sum()), and therefore all series in the first argument will be evaluated properly. We discuss this in more depth in our Series Matching doc. This mathematical 'or' approach though does require data to be present in both queries in order to function properly. If there was no data being reported, which would be captured by the first argument, then the second argument wouldn't have any data associated with it if we simply went with rawsum(align(1m,rate(ts("X")))) = 0. Because of this, I applied highpass(0,) to the query and then applied mcount(5m,) around that. highpass(0,) essentially forces the query to only display values that include a rate() change. If data was reporting but the rate() was not changing (job hung), then it would result in a value of 0. Since we are eliminating the non-changing rate values from the query with highpass(0,), the mcount(5m,) on that is determining how many rate changes occurred over the last 5 minutes. If there were no rate changes at all in that span, then the mcount() value would be 0 and cause the alert to fire. You could also consider the approach of separating these arguments into two alerts.

Lastly, you may have noticed in my above example query that I did not include mavg(1m,) on the 2nd argument. I removed this function based on the execution of alerting checks. For alerting checks, Wavefront runs the desired alert query approximately every 60 seconds. Instead of evaluating all of the raw data associated with the alert query, our alerting check will summarize (average) the raw data into minutely buckets. In this query, the align(1m,) function is already being applied. With this being the case, the mavg() value will stay the same for an entire minute before it changes values. Since the alerting check is going to summarize those values into a minute bucket anyways, all of the reported mavg(1m,) values are going to be averaged together at 1m intervals and results in the same value as you see with align(1m). Because of this behavior, I would suggest removing the mavg(1m,) completely from the query. I hope these suggestions help! The best way to learn about the nuances of the query language is to start playing with it. I appreciate you sharing your example in this thread and hope I can be of more assistance in the future as well!

Reply
0 Kudos
corey_r
Community Manager
Community Manager
Jump to solution

Crikey, what a read.

Yep, understood all that, thanks.  This is slowly becoming clearer.

When we relaunch a job, it gets a new job id, so it gets a new metric path (at one level only)

eg:

blah.blah.myapp.1.batches

blah.blah.myapp.2.batches

I want to alert if I am getting a non-zero rate for any "blah.blah.myapp.*.batches" OR no agents sending "blah.blah.myapp.*.batches".

So, instead of using 'decommissioned', I would rather find the mcount of all contributing time series, instead of one per time series.   I've changed the mcount duration to 1m too.

This gives me:
sum(mcount(1m, ts("blah.blah.myapp.*.batches") as t)) = 0 or sum(mcount(1m, highpass(0.05, rawsum(align(1m, rate($t)))))) = 0

Thoughts please?!

Reply
0 Kudos
corey_r
Community Manager
Community Manager
Jump to solution

I think you're pretty close on that query based on my understanding. With both arguments using sum()/rawsum(), the only way the alert would trigger is if all source/series meet the condition (e.g. all sources have stopped reporting data or all sources are reporting a rate of zero). If you'd like either argument to fire based on the behavior of a single source (instead of all sources aggregated), then you'll want to remove sum()/rawsum() from that argument. Other than that I think you're good to go!

Reply
0 Kudos