VMware Cloud Community
nyjz1298
Contributor
Contributor

DRS not doing a good job with 192 GB and 96 GB hosts in cluster

We have a cluster with 9 ESX hosts.  2 with 128 GB ram, 2 with 96 GB ram and 5 with 192 GB of ram.  DRS is set to the middle.  We have 195 VMs in this cluster.  Since I cranked down DRS from 4 to 3 the cluster seems very unbalanced.  We have some hosts with 50%-55% ram used and other hosts with 94% ram used.  Reciently, one of the hosts got a bit busy and actually went to 94% which initiated the balloon driver to kick in on two VMs.  I can understand that the balloon driver would kick in at 94% per VMwares documentation, but there was no VMotion.  None at all for several hours.  I noticed a VM on the 94% host finally migrated off 4 hours later once it got just a little busier.  I'm pretty annoyed that DRS let one host get to the "danger level" of resource consumption without doing a simple migration to another host with PLENTY of room.  If I bump DRS to 4, it seems theres a constant stream of VMotions.  That is something we don't want since there is very little benefit.

So, I called VMware and they said since there are large differences between the ram ammounts the DRS algorithm will not be as affective causing issues like this.  He suggested to break out the different hosts to different clusters.  I told him thats a bad idea since we want to only keep an N+1 and that conifugration would require basically three standby hosts.  He went on to recommend that I use DRS groups but failed to explain how well it will work for us.  I basically just want the ballooning thing not to happen again and not to increase the rate of VMotions drastically like going to 4 would do.  DRS groups sound cool, and sounds like they're a cluster within a cluster (sort of) but I'm not seeing the capability.  It sounds like I'd have to create three groups (based on the memory I listed above) and somehow divide the VMs between those groups and "manually" balance them when I setup the groups.  Hmm.... And would if new VMs join the cluster?  Theres no quick and easy way to see what "group" they're in.  Sounds like a rabbit hole....

Help.

Joe

0 Kudos
9 Replies
Troy_Clavell
Immortal
Immortal

besides the recommendation from VMware Support, you could also try to get more aggressive on your migration thresholds.  Beyond that, manually balancing the cluster may help as well.  Once you get the cluster manually balanced you may want to look at putting the migration threshold back in the middle.

I don't know if DRS groups would help, it too me, seems as though it may confuse DRS even more and make things worse.

0 Kudos
nyjz1298
Contributor
Contributor

We had the setting at 4 and the cluster looked to be balanced but there was a constant stream of vmotions.  Possibly 70 per day.  Now we bumped it down to 3 and after a week or two things seem very unbalanced.  And now DRS is letting a host to get to 94% and balloon... This is exactly why we have DRS enabled; to stop things like this from happening.  I asked if there's a way to make sure hosts stay below 90% but DRS doesn't have that configuration option.  I was thinking about attempting to setup a powercli script that puts the cluster to aggressive during the night for a small window, then puts it back to 3 to keep things a little extra balanced.

Thanks

0 Kudos
RParker
Immortal
Immortal

Lot of misinformation here.

DRS works fine, lets get that striaght right now.

DRS prioritizes based upon CPU, THEN memory.  CPU is more important.  Just because YOU feel like the hosts are not balanced doesn't mean that ESX / vCenter agree with you, since that's how VM Ware programmed it.

94% of 192GB of RAM is 180GB of RAM.  ESX only needs like 1 GB for overhead, the more RAM you have the more there is for VM's and a smaller percentage of ESX overhead.

DRS will move VM's based upon need not simply because the numbers are skewed.

In the FIRST place you have a mix of machines with RAM, that works, but as you can see there are differences.  DRS will ALSO take into account moving a VM from 192 GB machine to a 128 or LESS machine, and how much load that will be and impact on THAT machine as well.

you lack the technical knowledge to casually announce DRS isn't doing it's job, it works fine.  If your ESX hosts are NOT having performance issues, there is very little or NO swap, you really shouldn't judge a book by it's cover, you have NO idea what is going on behind the scenes.  There is also COMMIT memory, ACTIVE memory, and USED memory not the same.

There are lots of elements to DRS than just looking and seeing.. oh there is more memory over there, let me use it!  Not how it works at all.  CPU plays more important part.. I BET that if you look at your hosts the machines with LESS RAM probably have higher CPU usage.. that's more critical.

RAM is really a much smaller piece than just FREE RAM or available RAM on a host.

nyjz1298 wrote:

So, I called VMware and they said since there are large differences between the ram ammounts the DRS algorithm will not be as affective causing issues like this.  He suggested to break out the different hosts to different clusters.  I told him thats a bad idea since we want to only keep an N+1 and that conifugration would require basically three standby hosts. 

So let me get this straight, you didn't design VM Ware ESX, but you contend DRS is doing a lousy job of management.  VM Ware RECOMMENDS mind you, they SUPPORT this product.. and YOU don't want to take their advice.  So riddle me this.. how can DRS do a more effective job if you WON'T configure the hosts the way they are SUPPOSED to work?

do you tell your mechanic how to tune your engine and change the oil in your car as well?  If you want to do it yourself that's fine, but don't complain when it doesn't work.. it's NOT working because you don't WANT to follow recommendation, that's NOT the products fault.

0 Kudos
nyjz1298
Contributor
Contributor

All of the CPUs are between 15-25% used.  You probably know when an ESX server has only 6% free memory the ESX host will trigger one or more Vms balloon driver to kick in to free up some space which can cause performance issues.  I wish DRS was able to detect that VMs are ballooning due to a VMs load.  Do you think DRS is doing the right thing in this case?  Is there something more I can do to improve my situation?

Here is my hardware to be specific to this cluster:

esx01 - bl460c G6 - 8 cores - 16 logical processors - 192 GB ram

esx02 - bl460c G6 - 8 cores - 16 logical processors - 192 GB ram

esx03 - bl460c G6 - 8 cores - 16 logical processors - 192 GB ram

esx04 - bl460c G6 - 8 cores - 16 logical processors - 96 GB ram

esx05 - bl460c G6 - 8 cores - 16 logical processors - 96 GB ram

esx06 - bl460c G6 - 8 cores - 16 logical processors - 192 GB ram

esx07 - bl460c G6 - 8 cores - 16 logical processors - 192 GB ram

esx08 - BL480c G5 - 16 cores - 16 logical processors - 128 GB ram

esx09 - BL480c G5 - 16 cores - 16 logical processors - 128 GB ram

esx06 was the server that got extra busy causing some VMs to balloon instead of initiating a vMotion.  It seems the DRS algorithm doesn't factor the 6% memory mark that causes ballooning to start.

Thanks

0 Kudos
nyjz1298
Contributor
Contributor

Anyone?  I was hoping I could use some dissimilar hardware (cpu count / mem ammount) without an issue but it looks like I might be wrong as far as DRS is concerned.  Has anyone else encountered the ballooning issue in a case similar to mine?

Thanks!

Joe

0 Kudos
mchangscmp
Contributor
Contributor

I'm getting a similiar issue on my 3 node cluster with identically configured hosts.  Node 1 sits at 10% memory utilization with the other two at 85%.  I ran a test by putting even more load on my cluster and DRS seems to think activating ballooning on one of my nodes is better than putting even one more host on Node 1.  No idea what the reasoning is.  If this is the way it's supposed to work, you'd think there'd be a whitepaper on the topic.  I've not really come across anything that explains the behavior.

0 Kudos
mchangscmp
Contributor
Contributor

@RParker,

Your response is a bit harsh.  I don't think his intention is to declare to the world that DRS is broken.  It's just that the behavior in his scenario is counter-intuitive and the cause is not obvious.  I'm also having a similar issue in my cluster and have similar questions.

You seem to have in-depth knowledge of how DRS makes it's determinations.  For example you say that it weighs CPU usage more heavily than memory.  This is the first time I've heard this and it is useful to know in order to understand how DRS is making its decisions.  Do you have any links or references you could share with us?  I've been to the past couple of VMWorlds and done a lot of looking online, but haven't really found much on this topic.  Most resources give a general overview of how DRS works, best practices, specific configurations, slot calculations, etc.  Nothing I've read addresses scenario's where DRS may "seem" to the admin to be unbalanced on memory, but it's really for the best because...

0 Kudos
nyjz1298
Contributor
Contributor

Thanks mchangscmp.  I agree with you, RParker should appologize for his attitude in the response.  Upon second reading, it's pretty repulsive.  I've never read something with such arrogance that wasn't intended as a joke.

Thanks

0 Kudos
mcowger
Immortal
Immortal

Thats what the 'report abuse' button on every post is there for.

--Matt VCDX #52 blog.cowger.us
0 Kudos