Some of the metrics from my services record ongoing progress towards completing a large, long-term task. While running, each instance of this service reports, among other metrics, a count of items completed. As a long-term, stateful process, it checkpoints its progress as it goes so that when restarted (or when it crashes) it can resume close to the point where it left off.
Naturally, the counter metrics it reports reset when it restarts (or crashes).
I would like a long-term sum of these counts. So far, I’ve not been able to accomplish this. I have tried applying integral to the sum of the rate of all instances, but the result is clearly not accurate.
Can someone suggest a means to track such a long-term count?