Re: Usage vs Demand

gravesg · ‎10-24-2012

Apologies this might have been asked to death.

I have instances on CPU intensive VMs where the usage is LESS than the demand. In this case they are Exchnage 2010 servers front ended by users in cached mode. Becasue of this its hard to say if there is a developing performance issue or not.

How should a dashboard like this be interpreted? How do I confirm this is not a ready time issue?

kitcolbert · ‎10-24-2012

The way to think of it is that Demand is what a VM wants to use, while Usage is what it gets. So in this case, the VM wants to use more CPU than its getting.

The question is why and is it causing a performance issue for the guest OS/app. In terms of why, you should look at the host for this VM to see if it's overloaded on CPU. If you see the host is maxed out on CPU, then likely this VM is contending for CPU with other VMs and possibly suffering because of it. You can also look at CPU Usage | Contention or CPU Usage | Ready for the VM to see how much it's being affected.

It's hard to tell whether it's an actual issue for the guest OS/app unless you talk to the VM owner or can look inside the VM. Some applications are very sensitive to the additional latency being induced by the virtualization layer, but other applications or app instances aren't so sensitive. It's it's hard to say 100% that this is a problem. However, there is certainly risk of a problem happening.

But one thing I do notice is that CPU usage is higher than it normally is for this VM. (You can see the blue line underneath the CPU bar graphs and the fact that Demand is larger than the high end of normal demand.) So it's possible there's some workload spike happening inside the VM. This may be contributing to the contention at the host level.

So in the end, this situation isn't ideal. Some apps and some customers are ok with this sort of difference between demand and usage, but you'll probably want to investigate a bit more and see how tolerant the application is to this extra cpu scheduling latency the VM is seeing.

gravesg · ‎10-24-2012

The host is healthly as an ox, so I'm not sure how to rule out CPU contention. What is an acceptable value? The ready time values of the server itself seem high, but on a host with 9 VMs and no other VM as noisey as this one, where do I dig next?

elgreco81 · ‎10-24-2012

Hi,

Check this link for a list of values

http://www.yellow-bricks.com/esxtop/

It's for esxtop. You could verify your vcOPS values with the ones in esxtop...they should be the same.

Regards,

elgreco81

Please remember to mark as answered this question if you think it is and to reward the persons who helped you giving them the available points accordingly. IT blog in Spanish - http://chubascos.wordpress.com

vkaranam · ‎10-24-2012

hey gravesg/Kitcolbert

Gravesg you raised a nice discussion where iam am having some questions and half of them are answered by KitColbert. i appreciate both u guys for this.

But i have few more questions regarding usage and demand.

how does the normal demand is calculated (the blue line) -- it makes the average of the 4 or 7 or 30 days demand of cpu or memory?

What does the Phys mean actually? What does it represents and measures?

Thanks

VK

gravesg · ‎10-24-2012

Problem is vcops contention % doesn't appear to correlate to any one ESXTOP metric. It might be some cumulative % based on many metrics. And no guide book really goes into depth on how to interpret some of these vcops specific numbers.

kitcolbert · ‎10-24-2012

gravesg,

What does the CPU usage of your host look like over time? I see in the screenshot it's at 44%, but does it go higher than that?

The fact that the host is seeing ready time means there is queueing going on. There are times when all physical cpu cores have VM vcpus scheduled on them and other VMs are unable to run right then due to this. So we know there is queueing/latency.

One thing to check is whether any one VM is experiencing most of the ready time or if it's equally distributed across some or all VMs. If it's just one VM, is there anything special about that VM? (More vcpus, interesting workload?) If it's distributed across some/all VMs, then do these VMs have similar types of workloads running in them? You can situations where all VMs are active and idle at the same time. So all VMs are idle, then they all simultaneously become active. They all rush to use the CPU, but only some of them will be able to and others will have to wait, generating ready time. Finally those other VMs get to run, then all VMs go idle again. The interesting thing about cases like this is that the overall CPU usage can be lower than 90% or 100% yet there can still be significant ready time (in your case, you're seeing 11% or so ready time, which I'd consider fairly high). In this case, if this is a problem (which again you'd need to work with the VM/app owner to determine), then you can put anti-affinity rules on these VMs so that they stay on separate hosts.

Hope that helps.

kitcolbert · ‎10-24-2012

VK,

To answer your questions:

1. How is the blue line calculated?

This all comes down to vC Ops' analytics. We have analytics that try to determine the "normal behavior" of every metric in the system. This is done by observing the behavior of the metric over all historical vC Ops has for that metric and applying a variety of formulas to it. (So it's not just the last 4, 7, or 30 days, but indeed everything its ever collected.) It looks for various cycles in that data to make predictions about future behavior. The different formulas are pitted against one another to see which produces the most accurate forecast. The winner is then used to determine the official upper and lower bound for the metric over time. This is what's shown in the blue line.

2. What does Phys mem actually mean?

We do something a little bit interesting in vC Ops in that we separate out "physical" from "virtual" memory. There is a reason to differentiate between the two. The issue is that you can have memory shortfalls at two levels: either the VM can have a configured memory size too small or there can be too little physical RAM to give to all VMs, meaning some VMs will need to have memory reclaimed from them. Thus there are two levels at which memory performance problems can occur, the "virtual" level, where the VM memory is sized too small and the "physical" level, where there are too many VMs on the host and memory is overcommitted or if there's a limit set on the VM. So we break out these two levels so you can see very clearly which level is acting as a limiting factor. If the Virt layer is maxed out, it means that the VM is sized too small. If the Phys layer is maxed out, it means there is contention for memory at the host level. This allows you to quickly make a determination as to where the problem is.

kitcolbert · ‎10-24-2012

gravesg,

To answer your question, many of the metrics in vC Ops don't have equivalents in vCenter or in esxtop. The reason is that the metrics shown in those two are all "raw" metrics, in that they come directly from the host without any processing. With vC Ops, we prefer to show "derived" metrics, or metrics that vC Ops computes based on the raw metrics plus other info. The reason we do this is that each question you ask has a subtly different answer. For instance, the questions "do I have a *performance* problem with memory for this VM" and "do I have a *capacity* problem with memory for this VM?" require different metrics in order to properly answer them. For both, you want to use something like Memory Demand / Memory Effective Capacity, but exactly how vC Ops calculates each of those is different depending on whether the context is real-time performance vs long-term capacity planning. First of all the granularity of the data is different: performance is all about real-time while capacity wants to look for trends over time and thus should take more data points into account than just the current one. Moreover, capacity is looking at sizing and various factors play into that like the amount of non-pageable memory in a guest OS. For performance however, we don't want to take that into account since that memory can be reclaimed by the hypervisor if needed.

In any case, these are all just examples. The basic gist of the matter is that the raw data describe the raw state of the system. They are not designed to answer operations management questions. vC Ops creates derived metrics that bring to bear the right set of underlying raw data in order to present you with the more accurate number to answer the question you're asking of the system. And thus there's no direct equivalent in esxtop.

vkaranam · ‎10-25-2012

I had got what i need. is there any guide where i can find regarding dashboards in detail? Thanks a lot Kitcolbert

Thanks

VK

kitcolbert · ‎10-25-2012

VK,

What info are you looking for on the dashboard? And which dashboard, the customizable dashboard?

vkaranam · ‎10-25-2012

Hello Kitcolbert,

I want to find info on how the vcenter collects the metrics and how it analyzes the data (Any alogrithm based) .

Also i can see only global thresholds on both standard and custom dashboards which speaks about only vm and infrastructure thresholds.

I figured out by assigning attribute packages to a particular resource kind or application we can apply different thresholds (say like custom thresholds) to different applications. is this something good procedure or is there any other way of assigning different custom threshols levels for a particular resource kind or application?

Thanks

VK

kitcolbert · ‎10-25-2012

VK,

For technical details on vC Ops algorithms and such, please check out my talk from VMworld last year: http://www.youtube.com/watch?v=LTesggnOocE.

In terms of data collection, vC Ops just uses the standard VIM API PerformanceManager interface to get all stats data.

In terms of configuration, we wanted to make the configuration through the vC Ops "vSphere UI" very simple, hence the limited set of sliders for VMs and infrastructure objects. The "Custom UI" has attribute packages, as you mention, which allow a much greater level of configuration. Our goal is that the simplified sliders should be sufficient for most use cases, but there may be some exceptional cases that need the attribute packages.

vkaranam · ‎10-25-2012

Thanks a lot KitColbert.

I will go through the vidoes and get back to you if have nay questions.

Thanks

VK

sorina · ‎11-06-2012

thanks for replying question 2

I have more question about memory workload :

for physical memory what means reserved memory?? for all of my vm there is reserved memory!!!

and what diffrece between demand of physical and demand of virtual !!??

thanks

gravesg · ‎11-06-2012

"vC Ops creates derived metrics that bring to bear the right set of underlying raw data in order to present you with the more accurate number to answer the question you're asking of the system"

The question then is how can you interpret that any of these derived values have crossed a best practice threshol? I cant just relly on the big badges because frankly, there have been instances of VM's with some performance hicups, but not reflected in the badge system.

kitcolbert · ‎11-06-2012

gravesg,

While we are deriving new metrics, conceptually they are similar to the existing metrics you're used to using. For instance, if you monitor CPU Usage %, then you can use simliar thresholds for CPU Demand %. Demand is more accurate in terms of whether a performance problem is occurring or not and the thresholds you use can be the same. But now you can have higher confidence that if a threshold is being surpassed, then there is a problem occurring.

sorina · ‎11-11-2012

kitcolbert

what is your mean about " So it's possible there's some workload spike happening inside the VM. This may be contributing to the contention at the host level."

thanks

admin · ‎11-12-2012

Hi,

Regarding the question around why there is Memory reserved for all VM's.

This is due to the Memory Overhead of a VM. That memory is always reserved.

kitcolbert · ‎11-12-2012

Sorina,

"Host contention" means many different VMs trying to get access to the same set of physical resources (CPUs, memory, etc). If VMs have more demand than there are physical resources to meet that demand, then it will cause "contention" or latency (VMs will have to wait to access the physical resources while other VMs use them. This is obviously bad for performance. Does that make sense?

All

Usage vs Demand