VMware

This Question is Possibly Answered

1 "correct" answer available (10 pts) 2 "helpful" answers available (6 pts)
7 Replies Last post: May 15, 2007 1:19 AM by eliot  

DRS and patching ESX hosts posted: Mar 7, 2007 10:11 AM

Click to view DigitalVoodoo's profile Hot Shot 110 posts since
Mar 7, 2006
With the advent of the new ESX patches, I placed one of my 4 ESX nodes in a DRS/HA cluster (set to apply recommendations with 3 or more stars) into maintenance mode, applied the patches, rebooted the host, and exited maintenance mode. After 10 minutes or so DRS moved a couple of VMs to the recently rebooted node, so I know that DRS was alive and well. I then placed a different node in the cluster into maintenance mode, and watched DRS only send VMs to the 2 nodes that had not been touched yet. With 100 VMs now effectively running on 2 hosts (the first patched node still only running 2 VMs), the hosts' resource utilization skyrocketed.

After all the VMs had been cleared off the host now in maintenance mode and running in this state for around 10 minutes, DRS began moving VMs off to the first server. My question is this - why didn't DRS send most/all of the VMs to the first server in the first place as it was extremely underutilized at the time in comparison to the other 2 servers? Does DRS not "trust" a host that has been recently rebooted or exited from maintenance mode until a certain amount of time has passed? Do I have to patch one server per day to let DRS get comfortable with it and then move VMs appropriately? This makes the prospect of patching DRS/HA clusters a much longer propsition than I think it should be.

Since this is now the second time that I've experienced this, I wanted to get everyone's thoughts - am I the only one seeing this, why is it doing this, and how are other people with similar or larger clusters dealing with this?

Thanks!

Re: DRS and patching ESX hosts

1. Mar 7, 2007 10:35 AM in response to: DigitalVoodoo
Click to view hicksj's profile Master 1,250 posts since
May 6, 2005
Does DRS not "trust" a host that has been recently rebooted or
exited from maintenance mode until a certain amount of time has passed?

Interesting theory. There's been discussions about what algorithm is used by DRS, and I don't remember this point being brought up before. Its certainly plausible. All ideas submitted so far have been speculation AFAIK... VMware will likely keep this proprietary - but I wish some details would be published. (i.e. what does "significant," "good," and "moderate" improvements to the cluster performance mean - http://www.vmware.com/pdf/vmware_drs_wp.pdf)
And what all factors in besides memory & CPU?

Re: DRS and patching ESX hosts

2. Mar 7, 2007 10:58 AM in response to: hicksj
Click to view kix1979's profile Champion User Moderators 3,769 posts since
Oct 14, 2004
And what all factors in besides memory & CPU?
That's it. As far as my testing goes, disk makes NO difference. In the future Disk/Network IO should really be added. Think about one VM that sucks up all the disk IO for a path/hba. Right now that host would suffer because DRS doesn't know about it, and Virtual Center doesn't really show it either. :(

Re: DRS and patching ESX hosts

3. Mar 7, 2007 11:01 AM in response to: hicksj
Click to view Jae Ellers's profile Master 1,097 posts since
Feb 6, 2004
Looks to me like it doesn't care if things are highly utilized, and migrates based on contention. So it's OK to have a full server as long as nothing wants more CPU or memory than it's getting. At least on the setting I'm running on, moderate.

Re: DRS and patching ESX hosts

4. Mar 7, 2007 11:05 AM in response to: Jae Ellers
Click to view hicksj's profile Master 1,250 posts since
May 6, 2005
Agreed. I'm trying to find the other thread that discussed this, but I think the general concensus there was 'contention' not true load balancing.

Re: DRS and patching ESX hosts

6. May 14, 2007 4:13 PM in response to: DigitalVoodoo
Click to view eliot's profile Hot Shot 145 posts since
Sep 29, 2005
I’ve experienced this exact behaviour twice in production (taking a box down for patching) and both times i put it down to the fact that there wasn't enough contention to justify moving stuff onto the newly re-joined server.

However I been playing around with resource pools on a new non-prod cluster and have created plenty contention by overloading two servers (running 15 memtest86 sessions) – exiting maintenance mode on the third server, should of resulted in it moving things around to level the loads out. But it didn't.

I have the cluster summary page showing two servers only delivering 60-70% of their resources and one (the newly joined) server showing 100% delivered - as it only has one VM running on it, which was automatically moved onto it shortly after joining the cluster. (so DRS works slightly)

The hosts tab shows two hosts at 100% & 95% utilisation and the newly joined server sat Idle at 2% . I have reproduced this behaviour before running 10-15 iometer sessions for soak testing and saw the same thing – but assumed it was down to the synthetic environment.

Should mention that the ESX boxes aren’t fully patched (March 07 only) – but I have yet another system that is fully patched that I can duplicate the same tests on soon.

Re: DRS and patching ESX hosts

7. May 15, 2007 1:19 AM in response to: eliot
Click to view eliot's profile Hot Shot 145 posts since
Sep 29, 2005
To add, From the training manual:
"Drs also rebalances when you add hosts to your cluster. So when you add more hosts to add capacity to your cluster, we will automatically migrate VM's onto that new host in order to exploit the new capacity that just came on line"

Ive left it overnight, still two hosts maxed and one host doing nothing with one VM running.

VMware Beta Programs

Want to be Considered for Future Beta Programs?

Learn More

VMware Developer

Download SDKs, APIs, videos,
training, and more in the Developer community.

Learn More

Developer
Sample Code

Increase your developer productivity with VMware API sample code.

Learn More

VMworld
Sessions & Labs

Online access to the latest VMworld Sessions & Labs and online services.

Learn more

Purchase PSO Credits Online

Purchase credits to redeem training and consulting services online.

Buy Now

Community Hardware Software

View reported configurations or report your own.

Learn More

Only VMware ... Delivers Nexus 1000V

Ensure consistent, policy-based network capabilities to virtual machines across your data center.

Learn More

Communities