Alerting question

CloudCredManage · ‎07-27-2017

Alerting question

This question has been Answered.

piotr.wreczycki Sep 8, 2016 6:46 AM

Hey guys,

Have question around alerting - Please look at below event log:

I received below alert at 00:13:17 UTC.

-- SNIP --

OPENED

Alert: DP Disk Full Failure - 24 hours

Condition: 2h * lowpass(12, ts("dp.*.df-*.percent_bytes-free", not source="i-*") / highpass(0, lag(2h, ts("dp.*.df-*.percent_bytes-free"))-ts("dp.*.df-*.percent_bytes-free"))) and ts("dp.*.df-*.percent_bytes-free") < mmin(1d, lag(2h, ts("dp.*.df-*.percent_bytes-free"))) * 0.9

Created: 07/29/2016 09:02:06 +0000

Affected since: 09/06/2016 23:58:17 +0000

Event started: 09/07/2016 00:13:17 +0000

Sources/Labels Affected:

dp-beta-sdriver-a-1 (dp.beta.sdriver.host.df-root.percent_bytes-free)

dp-prod-kafka-a-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-b-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-c-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-c-2 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-a-2 (dp.prod.kafka.host.df-run.percent_bytes-free)

--SNIP --

Two minutes later (00:15:17 UTC) I received a RECOVERY message saying:

-- SNIP --

RECOVERED

Alert: DP Disk Full Failure - 24 hours
Condition: 2h * lowpass(12, ts("dp.*.df-*.percent_bytes-free", not source="i-*") / highpass(0, lag(2h, ts("dp.*.df-*.percent_bytes-free"))-ts("dp.*.df-*.percent_bytes-free"))) and ts("dp.*.df-*.percent_bytes-free") < mmin(1d, lag(2h, ts("dp.*.df-*.percent_bytes-free"))) * 0.9
Created: 07/29/2016 09:02:06 +0000

Affected since: 09/06/2016 23:58:17 +0000
Event started: 09/07/2016 00:13:17 +0000
Event ended: 09/07/2016 00:15:17 +0000

Sources/Labels recovered:

dp-beta-sdriver-a-1 (dp.beta.sdriver.host.df-root.percent_bytes-free)

dp-prod-kafka-c-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-a-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-c-2 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-a-2 (dp.prod.kafka.host.df-run.percent_bytes-free)

dp-prod-kafka-b-1 (dp.prod.kafka.host.df-run.percent_bytes-free)

-- SNIP --

And when I looked at the graph I saw that actually non-zero condition started at around 09/06/2016 10:28 PM UTC and finished at around 12:00 AM UTC.

Alert is configured with 15 min "Minutes to fire" property as well 2 min "Minutes to resolve".

My understanding is that if according to a graph condition happened at around 10:28 than alert should fire 15 min later at around 10:43 and recovery message should be sent at around 12:02 AM.

Please find a graph below

Dropbox - Screen Shot 2016-09-08 at 14.30.32.png

Can you let me know if my understanding of wavefront alerting is correct?

Thanks,

Piotr

CloudCredManage · ‎07-27-2017

Correct AnswerRe: Alerting question

jason_goocher Sep 9, 2016 8:44 AM (in response to piotr.wreczycki)

This is a great one to look at piotr.wreczycki! While it has several layers, I'm sure we can provide a proper answer.

I was looking at the chart window for this alert and I see there are 6 hosts that caused the alert to fire (listed in your message above). Based on the minutes to fire field being 15 minutes, it does look like each of those 6 hosts did update the alert 15 minutes after the condition was true. For example, dp-prod-kafka-b-1 became true around 10:55p UTC and updated the alert around 11:10p UTC.

According to the chart here, it looks like the alert stopped firing shortly after 12:03a UTC, but then started to fire again around 12:06a UTC. This pattern continues until ~12:15a UTC. The reason this pattern occurs is based on the Minutes to Resolve field and how we evaluate missing data. In Wavefront, gaps of missing data are considered neither true nor false for alerting purposes. During the alerting check, our system will evaluate the data associated with the alert query and determine if there is at least 1 true value and no false values present in the given "Minutes to Fire" window. For example, around 12:06a UTC, our alerting check looked at data within the last 15 minutes (~11:51p - 12:06a) and identified true values between 11:51p and 12:00a. Since gaps of missing data are considered neither true nor false, this means that the 15 minutes window included several true values before 12:00a UTC and no false values. This caused the alert to fire at ~12:06a UTC.

When an alert is in a firing state, the alerting check then starts to evaluate the last "Minutes to Resolve" window. In this case, that would be 2 minutes (side note: If there is no "Minutes to Resolve" set, then it is considered the equivalent of "Minutes to Fire"). So at the next alerting check interval (~12:07a UTC), the alerting check sees no data (since it stopped around 12:00a UTC) and therefore identifies no true/false values and resolves the alert. Then the alerting check at ~12:09a UTC repeats this process of looking at the last 15 minutes, identifying true values just prior to 12:00a UTC and fires. That's why this alert seemingly fires/resolves several times until ~12:15a UTC. At ~12:15a UTC, the 15 minute window associated with "Minutes to Fire" would only see No Data and would not trigger the alert.

I hope this explanation helps. We also noticed that there may be some places to improve the actual alert query and would love to hear more about what you are attempting to track. Feel free to provide those details here, or you could reach out to us directly at support@wavefront.com if you'd prefer to go that route. Based on your response, we may suggest some updates to better track your use case. For example, we could provide a way for you to evaluate missing data as "False" if that is your preference.

Thanks,

Jason

View solution in original post

CloudCredManage · ‎07-27-2017

Correct AnswerRe: Alerting question

jason_goocher Sep 9, 2016 8:44 AM (in response to piotr.wreczycki)

This is a great one to look at piotr.wreczycki! While it has several layers, I'm sure we can provide a proper answer.

I was looking at the chart window for this alert and I see there are 6 hosts that caused the alert to fire (listed in your message above). Based on the minutes to fire field being 15 minutes, it does look like each of those 6 hosts did update the alert 15 minutes after the condition was true. For example, dp-prod-kafka-b-1 became true around 10:55p UTC and updated the alert around 11:10p UTC.

According to the chart here, it looks like the alert stopped firing shortly after 12:03a UTC, but then started to fire again around 12:06a UTC. This pattern continues until ~12:15a UTC. The reason this pattern occurs is based on the Minutes to Resolve field and how we evaluate missing data. In Wavefront, gaps of missing data are considered neither true nor false for alerting purposes. During the alerting check, our system will evaluate the data associated with the alert query and determine if there is at least 1 true value and no false values present in the given "Minutes to Fire" window. For example, around 12:06a UTC, our alerting check looked at data within the last 15 minutes (~11:51p - 12:06a) and identified true values between 11:51p and 12:00a. Since gaps of missing data are considered neither true nor false, this means that the 15 minutes window included several true values before 12:00a UTC and no false values. This caused the alert to fire at ~12:06a UTC.

When an alert is in a firing state, the alerting check then starts to evaluate the last "Minutes to Resolve" window. In this case, that would be 2 minutes (side note: If there is no "Minutes to Resolve" set, then it is considered the equivalent of "Minutes to Fire"). So at the next alerting check interval (~12:07a UTC), the alerting check sees no data (since it stopped around 12:00a UTC) and therefore identifies no true/false values and resolves the alert. Then the alerting check at ~12:09a UTC repeats this process of looking at the last 15 minutes, identifying true values just prior to 12:00a UTC and fires. That's why this alert seemingly fires/resolves several times until ~12:15a UTC. At ~12:15a UTC, the 15 minute window associated with "Minutes to Fire" would only see No Data and would not trigger the alert.

I hope this explanation helps. We also noticed that there may be some places to improve the actual alert query and would love to hear more about what you are attempting to track. Feel free to provide those details here, or you could reach out to us directly at support@wavefront.com if you'd prefer to go that route. Based on your response, we may suggest some updates to better track your use case. For example, we could provide a way for you to evaluate missing data as "False" if that is your preference.

Thanks,

Jason

All