Resource group idea, looking for feedback

JCL_MDOT · ‎02-09-2019

Hello,

I am looking to possibly use resource groups to gain efficiency instead of using clusters. So have one or few large clusters, and then separate VMs such as production and development. Yes, I can already feel people rolling their eyes as they read this. After spending a fair bit of time reading everything I can, they are either done incorrectly (e.g. resource group pie paradox, or used as folders) or just become too complex or too much overhead to be worth it. But, I have an idea and I want to see if anyone out there might have some insight on it. I may even open a case with VMware because I would really like to find the answer.

Consider the scenario:

- Around 1,000 VMs overall on a bunch of powerful blade servers

- I want to mix DEV and PROD workloads, mainly from a CPU and RAM perspective

- They will have separate storage and networking

- Not for security really, things like DMZ or PCI are hosted on separate physical hardware (I know, NSX, we don't have that)

Now, I think some of the most important pieces of information here are:

- "DEV" is effectively prod to the server team, because if it goes down the developers can't do any work

- DEV just means that it is not user or public facing (DEV to the server team would be different)

- The SLA would mostly be 9-5 business hours, with occasional large overnight or over weekend jobs

- There are many levels, QA, Test, etc., some massive SQL presence

- DEV machines almost out-number prod, and sometimes consume more resources, there are over 100 developers, DBAs, QA, etc.

- Our old DEV environment constantly ran RED while resources sat unused in PROD, even after I turned off HA

Here is my idea, the novelty here being that it is deceptively simple, which is the goal

- Put all of the PROD machines in the root of the cluster (default resource group) or in a single "PROD" resource group

- Put all the DEV machines in a child resource group to prod

- Thus, the entirety of all DEV machines exist in a single object that is a sibling to all the PROD machines (DEV machines are all child objects)

- There could be more levels, such as TEST machines as a child to DEV, but you get the idea

- No limits or reservations, and if I am correct, it almost* won't matter what the shares are set to

- No overhead, never need to change any calculations, just sort machines into the right group

So, rather than trying to use two resource groups (PROD and DEV) as siblings and trying to constantly re-calculate them as things change, even with a script, PROD essentially always trumps DEV whenever there is contention. But, the DEV resource group is expandable, so it could use any resources that PROD is not using, even up to using the majority of all resources.

Now, most of the time we likely won't even have contention if I am allowed to mix PROD and DEV, so much of this may be irrelevant. Allot of what this is doing to to assuage fears of my coworkers who's eyes bulge at this idea because it appears radical as they have always traditionally kept these things separate (and don't understand resource groups yet). Management is concerned as well, but none of them are technical enough to really understand either way, so it is up to me to research this enough so that I can assure them and not have to eat crow later. I am prepared to test it first of course, but that would have to be done with at least a subset of real workloads because it could be difficult and time consuming to create an adequate simulation.

So, realistically, the only time I can foresee this being used is when/if there was a substantial outage, or if we were purposely taking down some portion of resources to do upgrades. The remaining hosts would be running much hotter. We have very few of those types of outages historically, and most maintenance work is done off-hours. It is potentially possible for a really noisy machine that is running hot either on purpose or because something is wrong to cause a problem, so that is a concern. We do have vROPs and we monitor for that with humans 24/7, so we would start to get alerts in that situation, likely within enough time to address it before it gets serious. I would also have DRS rules that would not even bring DEV machines up if there were not enough resources.

I am trying to find someone or some information that would support or refute this idea. I am trying to think outside the box, thinking along the lines of the Matix movie: "There is no spoon". But it seems resource groups are unpopular enough that it is hard to find.

*almost: This is one of the things I am not sure about, I am thinking you could leave the shares at defaults because it is the hierarchy that is important here. Obviously I could imagine share values that certainly would be a problem. So, if someone has an idea of what these should be set to (if this is a good idea at all), that would be great.

So:

- Is this idea just crazy enough to work?

- Am I missing something key that will be a problem?

- Does it simply not work the way I think it does?

- Has anyone done anything remotely similar?

Some feedback or pointing me to some better info would be greatly appreciated.

Thank you

-JCL

sk84 · ‎02-09-2019

Basically, resource pools are nothing bad. They are often used in the service provider area. vCloud Director, for example, uses RP excenssively and vRealize Automation sometimes uses them as well. They can simplify resource management by eliminating VM-level resource micromanagement. On the other hand, they can also make all resource management within a cluster more complex, depending on how far you nest it and there are also some pitfalls. Furthermore, you only profit from RP in some cases.

In my opinion, resource pools are always useful when you have an overcommited cluster with many different workloads with different resource requirements and want to separate them. This doesn't apply to many enterprise setups. Either there are not many VMs running, so that you can configure the resource management (shares, limits, reservations) for single VMs at VM level, or all VMs are equally important, then the default values are sufficient. Or the cluster resources are not overcommited and then you have no advantage from resource management, because each VM always gets the resources it wants. I think for these reasons, there is not much positive information about resource pools. And sometimes they make things more complex. But I'll come to that below.

In your case, using resource pools can be a good idea if you talk about resource conflicts. But I would make 2 resource pools on the same level. One for DEV and one for PROD. With shares, you can control that PROD VMs get more resources when the cluster is busy. But shares are a double-edged sword because they are not really a fixed number. The share value depends on the number of VMs in this RP and is also compared with other RPs on the same level. This is usually overlooked.

For example: If you have set the shares for DEV to 1000 and 10 machines are running there, each VM gets 100 shares. And if you have a PROD RP with 2000 shares and 40 VMs, each of these vm only get half the shares of a DEV vm (50 shares). In reality, you have to adjust the share value every time you power a virtual machine on or off.

This makes the management of resource pools sometimes complex and is one of the reasons why RP are so unpopular.

And in case of resource overcommitment and contention, I'd also consider using reservations. Of course it's uncomfortable if the developers can't work anymore, but usually it's worse if external customers are affected. The only important thing with reservations is that you don't set them too high.

The golden rule is:

Set the reservation value so high (or low) that the VMs can still handle their task and not to a value what they normally consume. The rest of their resource demand is then covered with the higher shares setting. Because if you set the reservations too high, you will lose too much flexibility.

To sum it up:

- Resource pools simplify resource management by eliminating resource micro management at VM level.

- Resource management is only necessary if the cluster is overcommited and resource conflicts occur.

- Reservations can be useful, but should not be set too high.

- If you have multiple RP on the same level, shares on RP are not fixed values but have to be adapted to the number of running VMs to be effective.

- Limits can help to cap the resource consumption.

--- Regards, Sebastian VCP6.5-DCV // VCP7-CMA // vSAN 2017 Specialist Please mark this answer as 'helpful' or 'correct' if you think your question has been answered correctly.

JCL_MDOT · ‎02-11-2019

Hello sk84, thank you for the detailed reply.

I did a little research, and I finally found an article that goes into a little more detail on what I am talking about:

Mixing Resource Pools and Virtual Machines on the same hierarchical level

http://frankdenneman.nl/2012/05/09/mixing-resource-pools-and-virtual-machines-on-the-same-hierarchic...

Actually, this guy has a great blog and I found numerous other articles about resource pools and shares that were really helpful, but this one seems most germane to what I am talking about here.

Interestingly, this article reads more like a warning not to mix resource pools and VMs at the same level. But, then go all the way to the end of the article, and you see this statement:

"Note: Shares are not simply a weighting system for resources. All scenarios to demonstrate way shares work are based on a worst-case scenario situation: every virtual machine claims 100% of their resources, the system is overcommitted and contention occurs. In real life, this situation (hopefully) does not occur very often. During normal operations, not every virtual machine is active and not every active virtual machine is 100% utilized. Activity and amount of contention are two elements determining resource entitlement of active virtual machines. For ease of presentation, we tried to avoid as many variable elements as possible and used a worst-case scenario situation in each example."

Which is my whole point. I don't think anyone runs a healthy environment where resource pools become a day-to-day factor. If you did, that would mean that your environment was running north of 80-90% utilization all the time, and it would already be past the point where most people would say you needed to expand capacity. That being the case, they may have other uses, but for the most part I see resource pools more as a safety net when something happens to greatly reduce capacity to the point where contention is reached.

Consider some scenarios and what we might expect to happen with and without nested resource groups:

- Non-prod / DEV machines start using too many resources or there is a run-away machine

- Without resource pools, if these were in the same cluster with PROD machines, they would be siblings and compete, which is why I think many people separate them, fearing that the DEV machines would crowd-out the PROD machines.

- With all the non-prod in a child resource pool, the DEV machines can expand to use resources that PROD is not using, but not more

- You have a bunch of host failures, or you take some down for maintenance

- Same as the previous example, the DEV machines would only get anything after the PROD machines were served first

But, here is where I am not so sure, consider this scenario:

- You have everything configured as I suggest, all PROD machines at one level, and all DEV machines as a child to that

- You have the child set that is is expandable, so the DEV machines can consume any resources that PROD is not using

- Lets say someone runs a bunch of jobs in DEV, and actually uses up all the spare capacity (say they run something big overnight)

- So, for the most part, things are good, the PROD has priority so DEV can't starve them out

- Now let's say PROD suddenly needs a bunch more resources for something, what would happen? How quickly could DEV give up the resources back to PROD?

- This is conceivable, an overnight job that is still running in the morning when everyone comes in, and PROD starts to wake up

- With CPU resources, I can see this being OK, because the PROD machines could take capacity back dynamically, the DEV machines would just be slowed down

- But with memory, if all that memory is in use, would ESXi be able to take the memory back from those machines quickly?

I think there is a possibility that in this situation, the PROD wants memory back, and it can't get it immediately. So I would want to better understand how VMWare reclaims expandable resources. If it starts to use allot of ballooning and swapping to compensate in this scenario, I could imagine reaching a "meltdown" condition where PROD wants a large amount of resources back from DEV, and it uses swapping the ballooning until it can get them back.

This may be where reservations or limits come in, but I am specifically trying to do this without having to use any reservations or limits, because as soon as you do, then you have to potentially have extensive knowledge of all the VMs running in the environment so that you can calculate the reservations, and this would need to be maintained as things change.

Basically I am trying to see if this idea can work, but it there are any unforeseen / unintended consequences. And for the most part, not in the conceptual design of how these features work, but in real world scenarios as to how things actually play out. There are allot of things we use in the IT world that work exactly as intended, and there are others that sound great in theory, but in reality it is buggy or unstable.

-JCL