I am looking to possibly use resource groups to gain efficiency instead of using clusters. So have one or few large clusters, and then separate VMs such as production and development. Yes, I can already feel people rolling their eyes as they read this. After spending a fair bit of time reading everything I can, they are either done incorrectly (e.g. resource group pie paradox, or used as folders) or just become too complex or too much overhead to be worth it. But, I have an idea and I want to see if anyone out there might have some insight on it. I may even open a case with VMware because I would really like to find the answer.
Consider the scenario:
- Around 1,000 VMs overall on a bunch of powerful blade servers
- I want to mix DEV and PROD workloads, mainly from a CPU and RAM perspective
- They will have separate storage and networking
- Not for security really, things like DMZ or PCI are hosted on separate physical hardware (I know, NSX, we don't have that)
Now, I think some of the most important pieces of information here are:
- "DEV" is effectively prod to the server team, because if it goes down the developers can't do any work
- DEV just means that it is not user or public facing (DEV to the server team would be different)
- The SLA would mostly be 9-5 business hours, with occasional large overnight or over weekend jobs
- There are many levels, QA, Test, etc., some massive SQL presence
- DEV machines almost out-number prod, and sometimes consume more resources, there are over 100 developers, DBAs, QA, etc.
- Our old DEV environment constantly ran RED while resources sat unused in PROD, even after I turned off HA
Here is my idea, the novelty here being that it is deceptively simple, which is the goal
- Put all of the PROD machines in the root of the cluster (default resource group) or in a single "PROD" resource group
- Put all the DEV machines in a child resource group to prod
- Thus, the entirety of all DEV machines exist in a single object that is a sibling to all the PROD machines (DEV machines are all child objects)
- There could be more levels, such as TEST machines as a child to DEV, but you get the idea
- No limits or reservations, and if I am correct, it almost* won't matter what the shares are set to
- No overhead, never need to change any calculations, just sort machines into the right group
So, rather than trying to use two resource groups (PROD and DEV) as siblings and trying to constantly re-calculate them as things change, even with a script, PROD essentially always trumps DEV whenever there is contention. But, the DEV resource group is expandable, so it could use any resources that PROD is not using, even up to using the majority of all resources.
Now, most of the time we likely won't even have contention if I am allowed to mix PROD and DEV, so much of this may be irrelevant. Allot of what this is doing to to assuage fears of my coworkers who's eyes bulge at this idea because it appears radical as they have always traditionally kept these things separate (and don't understand resource groups yet). Management is concerned as well, but none of them are technical enough to really understand either way, so it is up to me to research this enough so that I can assure them and not have to eat crow later. I am prepared to test it first of course, but that would have to be done with at least a subset of real workloads because it could be difficult and time consuming to create an adequate simulation.
So, realistically, the only time I can foresee this being used is when/if there was a substantial outage, or if we were purposely taking down some portion of resources to do upgrades. The remaining hosts would be running much hotter. We have very few of those types of outages historically, and most maintenance work is done off-hours. It is potentially possible for a really noisy machine that is running hot either on purpose or because something is wrong to cause a problem, so that is a concern. We do have vROPs and we monitor for that with humans 24/7, so we would start to get alerts in that situation, likely within enough time to address it before it gets serious. I would also have DRS rules that would not even bring DEV machines up if there were not enough resources.
I am trying to find someone or some information that would support or refute this idea. I am trying to think outside the box, thinking along the lines of the Matix movie: "There is no spoon". But it seems resource groups are unpopular enough that it is hard to find.
*almost: This is one of the things I am not sure about, I am thinking you could leave the shares at defaults because it is the hierarchy that is important here. Obviously I could imagine share values that certainly would be a problem. So, if someone has an idea of what these should be set to (if this is a good idea at all), that would be great.
- Is this idea just crazy enough to work?
- Am I missing something key that will be a problem?
- Does it simply not work the way I think it does?
- Has anyone done anything remotely similar?
Some feedback or pointing me to some better info would be greatly appreciated.