Re: What does integrate() do? Or, what's the inver...

AbhishekSK · ‎07-28-2017

I'm looking for the inverse of deriv(), but integrate() does not seem to be that. What is a "moving integration" and how is it different from a moving sum?

A particular use case: say I have metrics for things put in a queue, and another for things taken out. I want to graph the number of items in the queue (plus or minus the number of items in the queue at the beginning).

If my metrics were monotonically increasing counters (like SNMP counters frequently are, modulo rollovers), then I could just subtract them and graph the difference.

But what if my metrics are "per unit time" rates, such as statsd reports for its "counters"?

AbhishekSK · ‎07-28-2017

Hi Phil,

Moving sum simply adds together all values of data points for a particular time series within that time window, and the moving integration function calculates a definite integral over a closed interval [a, b] at every second (where (b - a) is the width of the time window).

Having said that, the easiest way to calculate the difference (assuming the unit is "items per second" and metrics are reported once a minute) would be to simply multiply the values by 60 and subtract - that would give you the difference over 1 minute:

(ts(queue.items.added.per.second) * 60) - (ts(queue.items.removed.per.second) * 60)

Please let us know if this helps!

-Vasily

AbhishekSK · ‎07-28-2017

How is the "width of the time window" defined for a single point? What happens when points are missing due to the source failing to report metrics temporarily?

And thanks for the suggestion of subtracting the metrics, but this is not my objective. I don't want the difference between two metrics at one point in time (or some moving window over some point in time), but rather the cumulative difference.

AbhishekSK · ‎07-28-2017

Let me clarify that more: if the metrics are reported at regular intervals, then an integral or a sum over some period should be proportional. Yet they're not, and I especially can't make any sense of the behavior when there is not a metric for every single second: either because metrics are collected at some longer interval, or some are missing due to a transient failure of the collector, or some collectors (notably cloudwatch) when the value is 0.

AbhishekSK · ‎07-28-2017

Phil,

Generally, converting "rate per unit of time" functions back to monotonously increasing counters has the same challenges on every platform - getting an accurate cumulative value requires scanning the entire range of data since the beginning of time (when the metric started reporting), which may end up being not very practical from a performance standpoint; and any missing data points are data loss and lower the overall accuracy of the calculation.

Normally, for counter metrics statsd creates a stats. metric and a stats_counts. metric - the first one represents a "per second" rate of change, and the second one represents a "per flush interval" change (the default flush interval is 10 seconds). So, if you don't need absolute accuracy going back to the beginning of time (for example, Graphite's integral() function only looks at the selected time window and simply adds together visible values and doesn't handle missing values well), a moving sum function over the stats_counts. metric would be appropriate: for example, if looking at 2 hours worth of data, msum(2h, ts(stats_counts....)) would give you the same result as Graphite's integral() (we have an experimental and undocumented integral() function as well, which works like Graphite's, with the same limitations).

With identical moving time window parameters the sums should indeed be proportional - could you please send us an example of these metrics? If you would prefer not to share them here, email us at support@wavefront.com and we'll take a look. If your metrics are missing a notable percentage of data points, you can somewhat improve the accuracy of calculations by interpolating the missing values: msum(2h, align(10s, mean, interpolate(ts(stats_counts....)))),where 10s is the configured statsd flush interval. However, if you need to be able to calculate the number of "in-flight" messages in the queue very precisely, using statsd gauges instead of counters would be ideal.

Let us know if this helps!

-Vasily

AbhishekSK · ‎07-28-2017

I'm sorry but that really does not help. The use case is comparing two counters which on the average should be equal. Like "orders received" and "orders processed". If the difference of these is not zero in the long run, something's wrong.

If I'm looking at this week, I want to see how we did at processing every order that was received this week. I don't want to see, for each second this week, how we did in the 604800 seconds prior to that point. I want to calculate results based on what's in the time period I'm looking at right now, which might be the past 5 minutes or the past 5 years. And I don't want to include information from time periods beyond what I can see. For example I don't want that time John's dog ate the order list last week to make it seem like this week the counts are way off. I'm concerned about this week.

You say "if looking at 2 hours worth of data, msum(2h, ts(stats_counts....)) would give you the same result as Graphite's integral()", but I think that's true only for the very last point. There's value in having calculations take effect only over the current time window, especially when talking about running sums. Would you rather see a running balance on your bank statement, or for each transaction, the average balance of the last 30 transactions? Yes, missing data can interfere with the accuracy (as missing data interferes with the accuracy of everything), and practicality dictates not taking that running sum back to the dawn of time. Nevertheless, this can be a useful thing to do.

Regarding sharing an example of metrics where integral behaves oddly, simply try this expression in a point chart:

integral(10s, align(1m, last, 1))

I hope we can agree that any definite integral over a 10 second interval of the function x(t) = 1 is 10. And yet this is not what happens: at the top of each minute the result is 10, then over the next 10 seconds it decreases to zero. Then for the last 50 seconds of each minute the integral isn't defined at all (Or is it? I've never been clear on how WF handles missing data, but it would seem odd that I'm to interpolate somehow missing points for the integral, when that's not what integral() is doing for its input function). The mean of this integral function is 5, which makes no kind of sense. If we change the 1 function to be aligned to 5s intervals, than the mean of the integral goes up to 8?!

AbhishekSK · ‎07-28-2017

We believe integral() would work in your case - please take a look and let us know if this is what you had in mind: https://metrics.wavefront.com/u/3vdGwHDrdY

-Vasily

AbhishekSK · ‎07-28-2017

That looks promising, thanks. I'll try incorporating it into some of our dashboards and let you know if any issues crop up.

All

What does integrate() do? Or, what's the inverse of deriv()?