VMware Cloud Community
SWARob
Contributor
Contributor
Jump to solution

ALERTS - CPU Workload leading to CPU stress

I've been working on this alert triggering for sometime now in our vRealize Ops 6.0 environment.

We constantly get 100s of these alerts that trigger each collection cycle and I'm trying to tune this to not be so sensitive.

Why do I get this critical alert.....

cpu2.jpg

When this is the current performance?

cpu1.jpg

If anyone could shed some light on this that would be great!

Many thanks in advance.

Tags (3)
Reply
0 Kudos
1 Solution

Accepted Solutions
greco827
Expert
Expert
Jump to solution

This is a common misunderstanding, so don't worry.

In your screenshot, you can see that you have a spike around 70 days ago (give or take a few days).  Since your setting is for Any 60 minute period, over 90 days, and account for spikes and peaks, the alert is valid.  20 days or so from now when that spike is beyond 90 days old, they should stop.

You have told vROps that if a VM has a spike or peak, as an average over 60 minutes, in the past 90 days, which breaches 70% of the usable capacity, to consider that VM to be under stress, thus triggering the alert.  The average is of no consequence.  Your VM could have had a hung OS for every hour but that one in the past 90 days, and you'd still have the alert.  It comes back because after you clear it, when vROps checks, this VM still meets the criteria you have set in the policies to be considered under stress.

You could change the settings of the policy ... at the VM level specifically ... to NOT account for spikes and peaks.  You could also set it to Any with an elevated amount of time (90 minutes, 120 minutes, whatever) and or increase the threshold which is set in your policy at 70%.  Play with the policy a bit.  It really helps to learn what effects what.  Make one change at a time so that you can see what impact it has, and also revert back to what it was easily.  Just make sure you are doing it at the Virtual Machine level, not the cluster or datacenter, etc.

Does this help?

vROps_Stress.jpg

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog

View solution in original post

Reply
0 Kudos
7 Replies
greco827
Expert
Expert
Jump to solution

The CPU demand on this particular VM is higher than the threshold will allow.  In the case of the example you have provided, it is above 100, so it is going to alert regardless of the buffer, unless you disable the alert altogether, which I do not recommend.

Your policy will dictate some of these things.  The stress level for the alert is set to critical is greater than 80%.  This is 80% of the usable capacity, which is to say, 80% of (total capacity - buffer) which is set in the policy.

The graph you show is from a very small range.  If you look at a broader scope, you probably have much higher spikes in CPU demand.  In the policy, under stress at the virtual machine level, are you using the "Any" parameter or the "Entire Range" parameter?  If the latter, what is the time range set to?

vROps_Policy_Stress.jpgvROps_Policy_TimeFrame.jpg

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos
SWARob
Contributor
Contributor
Jump to solution

Thanks for the reply.

I only showed the "realtime" chart as this was during the timeframe that alert was triggered.  I know by looking back further that CPU utilization never got close to 100% nor has it ever which is why I'm concerned why it's triggering.  Here is the setting we have set for CPU stress.

cpu3.jpg

Here is another example of an alert that I'm getting when it states the stress level is 59.43 > 80.  Why is it triggering a critical alert when it's LESS than the maximum setting?

cpu4.jpg

Thanks for any assistance...I'm still learning!

Rob

Reply
0 Kudos
greco827
Expert
Expert
Jump to solution

Can you share a screenshot of the Analysis --> Capacity Remaining tab?  CPU Demand does not need to get to 100% or even above 80% (which is where your stress threshold seems to be set) to trigger the alert.  A lot depends on 1) Are spikes and peaks accounted for, 2) What is the range in which it is looking at under Stress in the policy applied to the VM, and 3) how much buffer is set in the policy applied to the VM.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos
SWARob
Contributor
Contributor
Jump to solution

I guess I need to learn more what is going on here.  Here is the Capacity Remaining tab

cpu5.jpg

I don't understand why stress level is so high and it's recommending more that doubling the vCPUs when the average demand is less than 1 vCPU.

The ESX values are below as well

cpu6.jpg

Reply
0 Kudos
greco827
Expert
Expert
Jump to solution

This is a common misunderstanding, so don't worry.

In your screenshot, you can see that you have a spike around 70 days ago (give or take a few days).  Since your setting is for Any 60 minute period, over 90 days, and account for spikes and peaks, the alert is valid.  20 days or so from now when that spike is beyond 90 days old, they should stop.

You have told vROps that if a VM has a spike or peak, as an average over 60 minutes, in the past 90 days, which breaches 70% of the usable capacity, to consider that VM to be under stress, thus triggering the alert.  The average is of no consequence.  Your VM could have had a hung OS for every hour but that one in the past 90 days, and you'd still have the alert.  It comes back because after you clear it, when vROps checks, this VM still meets the criteria you have set in the policies to be considered under stress.

You could change the settings of the policy ... at the VM level specifically ... to NOT account for spikes and peaks.  You could also set it to Any with an elevated amount of time (90 minutes, 120 minutes, whatever) and or increase the threshold which is set in your policy at 70%.  Play with the policy a bit.  It really helps to learn what effects what.  Make one change at a time so that you can see what impact it has, and also revert back to what it was easily.  Just make sure you are doing it at the Virtual Machine level, not the cluster or datacenter, etc.

Does this help?

vROps_Stress.jpg

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos
SWARob
Contributor
Contributor
Jump to solution

Thanks so much for explaining that.  It does in fact make more sense now.

This is truly a great tool but there is a ton of configuration information to learn and understand!

Thanks again.

Rob

Reply
0 Kudos
greco827
Expert
Expert
Jump to solution

Yes there is, and there is no one place to find it ... at least not in lay terms that make sense.  VMTN is the best place to get help for sure.

If you find this or any other answer useful please mark the answer as correct or helpful https://communities.vmware.com/people/greco827/blog
Reply
0 Kudos