I have a couple servers that are alerting for 'Virtual machine has chronic CRITICAL CPU workload leading to CPU stress', which isn't an event-based alert but rather uses data over the past 30 days.
Both servers are actually right-sized, but only alarming because of one application issue a week or so ago that caused the CPU's to spike for a while.
Do I have to wait for the 30 day cycle to end before these alerts go away, or is there a way to 'reset' them on these servers?
I can't change the 30 day time setting, and cancelling these alerts only brings them right back.
That Alert Definition is tied to a Symptom Definition which has a cancel cycle of 1, which should mean 5 minutes. That being said, critical by default is set at greater than 50% CPU|Stress, so if you are still exceeding 50%, it will still alert.
Both are editable.
So let me understand... If cancelled, and the alert doesn't happen again within 5min, it shouldn't come back.
However I do have my stress criteria in the policy set to >80%, 120min peak, 30 day sample.
That policy setting is different from the alert itself. The policy is more capacity related, whereas the alert indicates an immediate and precise issue.
Ok, got that. But here's where I'm having trouble... My alert is based on a 'critical' symptom where CPU% > 75. But after cancelling this alert, the CPU didn't get above or near 50% and it came back again.
Is the second image showing CPU | Stress, or CPU | Workload or Usage?
Here the screen you asked for, along with another view that shown the CPU stress from back on 9/25. My thinking is that since the stress timeframe goes back 30 days, this might be why the alarm isn't clearing?
That could explain why it is recurring. Where in the policy did you set the 30 day time frame?
It's set in the policy. 30days is the default setting for non-trend based analytics. But there has to be a way to get these alerts to stop showing up, if the stress was caused by some short-term issue with the server, instead of having to wait for the entire 30 days to go by.
I really don't think that is the issue, but change it to 1 day and see. It's easy enough to change back.
OK, try this.
1) Check the policy that is being applied to this VM to ensure you edit the right policy.
2) Edit that policy and make sure to choose the vCenter Adapter - Virtual Machine.
3) In the stress field, change CPU from Sliding Analysis Window: Any, to Entire Range
4) In the time field, change the Date Range to 1 day.
This works. My badge score was 60 earlier, but now tat it is based on the last day for the VM, it is 4.