1 Reply Latest reply on Jul 28, 2017 10:46 AM by AbhishekSK

    Unclear why alert fired

    AbhishekSK Enthusiast
    VMware Employees

      I am struggling to understand why my alert fired. After reading the  Alert States and Lifecycle, I notice that I'm using an Aggregate Function. So I suspect that the answer relates to item b under "I don't think my alert should've fired. Why did it?" in Alert States and Lifecycle, but I'm not sure how. I performed the following steps:

      - Modified the time period I'm looking at in the Wavefront graph to select 5 minutes, allowing me to concentrate on only the period of time just before and after the alert.

      - Put my cursor on the alert. From here I could see details indicating that the host First Affected by this alert was "web0228".

      - Changed Chart Type from "Line Plot" to "Point Plot" so I could more easily see what values were being reported. From this I could see that there were points in time during which data was not received.

      - Changed the Aggregate Query being used into 3 separate individual queries, so that I could see what actual values were present. The values in question are HTTP response code values. The Aggregate query is looking for any value above 400. I noted from this that the maximum value received during the time period in question was 204. The Aggregate Query was configured to use ">= 400". So we didn't receive any values that exceeded the 400 threshold, and yet the Alert fired.

      - Modified the query to append the string cname=web0228... And unticked the 2 individual queries that were unrelated to this alert firing. This allowed me to focus only on data points for this host. But I didn't see any unusual pattern here. Data was being reported from what I can see at a standard cadence.

      - I looked at the configuration of the Alert in question. I noticed that the Alert History shows the alert was modified the day before it fired such that Alert minutes was updated from 10 to 15. I don't think this is relevant.

      - I noticed that when I choose Backtesting in the Edit Alert page, it does not indicate that an alert would have fired at the time it did. Indicates perhaps data was received late, and interpolation was being used?

       

      I'm missing something in my understanding here. But what?

        • 1. Re: Unclear why alert fired
          AbhishekSK Enthusiast
          VMware Employees

          Hi Owen,

           

          Actually, I didn't see the aggregation function in the query that you sent over.   At first glance it would appear that the alert shouldn't have fired.   One potential cause (and the most common cause we typically see) is that your data was delayed more than the 15 min alert threshold.   Looking at the underlying data points, however, I can't find any gaps more than a few minutes.   The fact that someone changed the alerting threshold from 10min to 15 min however is interesting.   It's possible that the data is delayed sometimes and every once in a while it's delayed more than 10 minutes.   After the points come in, you can't tell it was delayed.   I've sent this over to engineering to have a look. 

           

          In 3.0 we'll have a feature we can activate which will record the points at the time of alert firing.   As soon as we upgrade you we can activate that feature and will have a better idea if it's caused by a delay in your data or some other issue.