Solved: Re: Question about workload alerts

mhauck · ‎06-30-2014

vCOPs has been up and running in our environment for 10+ months. Up to this point it has been used more for critical alerting than analysis. We are trying to get a deeper understanding.

Currently running 5.8.1

I have read the following posts, but am still a little foggy on why we are seeing what we are seeing;

https://communities.vmware.com/message/2396460#2396460

https://communities.vmware.com/thread/449924

We are receiving network warning alerts that demand is reaching 100% of available resources.

New alert Type:Health, Sub-Type:Workload, State:Immediate, Kind:HostSystem

Info:Object`s demand is 100 percent of its available resource capacity. Network is the most constrained resource.
Alert Type : Health
Alert Sub-Type : Workload
Alert State : Immediate
Resource Kind : HostSystem

We are also receiving critical network alerts that demand is reaching 100% of available resources.

New alert Type:Health, Sub-Type:Workload, State:Critical, Kind:HostSystem,
Info:Object`s demand is 100 percent of its available resource capacity. Network is the most constrained resource.
Alert Type : Health
Alert Sub-Type : Workload
Alert State : Critical

If you look at the attached doc you can see the actual usage during the time of the warning alert is about 55% of capacity, and actual usage during the critical alert is about 77% of capacity.

If the alerts aren't an indication of actual NIC utilization I am not sure what the alerts are telling us.

Any help would be appreciated.

mark_j · ‎06-30-2014

That Network I/O bar there is just the last collected metrics. You need to bring up a metric graph, where you chart the overall badge|workload for this host, along with the individual workloads of cpu/mem/disk/network. Look back the past 24hr or 7 days.. and you'll see them spiking (vC Ops isn't incorrect in saying you're 100>90). It's these spikes that can cause the alerts. One option is that you could drop the alerting for infrastructure|workload in the config policy to not alert you when this happens (host system workload alert). Or instead of dropping alert altogether, simply set the warning/immediate levels to not alert, set critical at 100, and see what happens (for infrastructure workload badge changes). However, most likely if you let this run longer, vC Ops will learn the maxObserved values of the network adapter and put that watermark up higher well beyond your normal 80/90/95 percentage levels.

For cap plan/ etc, you can ignore disk io and network io, but not the case with health badge calculations. If you need to exclude a particular components of the overall workload calculation for a vSphere resource, the only way is to stop alerting on the overall badge 'workload' and alert on the child workload metrics.. e.g. cpu|workload, mem|workload, etc.

Lots of options.. give them a try and see what work best for you.

If you find this or any other answer useful please mark the answer as correct or helpful.

View solution in original post

mark_j · ‎06-30-2014

Your configuration policies need to be adjusted. Those alerts are firing because your config policy is telling it to do so. If you adjust your thresholds for alerts on these badges, you won't have these types of false positive. You may also want to adjust your capacity planning policies.

If you double click an alert in the vSphere UI, you'll see more details and will show you what the threshold IS that was violates. ex.. 56>50.

This content is covered in the vC Ops getting started guide for the vsphere UI, available on the vC Ops pubs site.

If you find this or any other answer useful please mark the answer as correct or helpful.

mhauck · ‎06-30-2014

"If you adjust your thresholds for alerts on these badges, you won't have these types of false positive"

These are the settings for our infrastructure badges;

and here is the "reason" given for the alert;

Metric Name:	Badge \| Workload (%)
Values:	100.0 > 90.0

I get that we will be alerted for badge changes. What I don't get is why we are getting the 100% demand alert (critical and immediate) when the usage rate is only at 70% or so;

mark_j · ‎06-30-2014

That Network I/O bar there is just the last collected metrics. You need to bring up a metric graph, where you chart the overall badge|workload for this host, along with the individual workloads of cpu/mem/disk/network. Look back the past 24hr or 7 days.. and you'll see them spiking (vC Ops isn't incorrect in saying you're 100>90). It's these spikes that can cause the alerts. One option is that you could drop the alerting for infrastructure|workload in the config policy to not alert you when this happens (host system workload alert). Or instead of dropping alert altogether, simply set the warning/immediate levels to not alert, set critical at 100, and see what happens (for infrastructure workload badge changes). However, most likely if you let this run longer, vC Ops will learn the maxObserved values of the network adapter and put that watermark up higher well beyond your normal 80/90/95 percentage levels.

For cap plan/ etc, you can ignore disk io and network io, but not the case with health badge calculations. If you need to exclude a particular components of the overall workload calculation for a vSphere resource, the only way is to stop alerting on the overall badge 'workload' and alert on the child workload metrics.. e.g. cpu|workload, mem|workload, etc.

Lots of options.. give them a try and see what work best for you.

If you find this or any other answer useful please mark the answer as correct or helpful.

mhauck · ‎06-30-2014

Ah! I see what you are saying.

I took a look at a metric chart for one of the problem hosts over an extended period of time and do see the spikes.

As you said, the number in the alerts are accurate based on vCOPs workload percentage algorithms. I was getting hung up on correlating those numbers with the red usage rate bar, but there really is no correlation.

Thank you for offering that thorough answer. That helped a lot.

All

Question about workload alerts