VMware Cloud Community
MikeOD
Enthusiast
Enthusiast

DRS didn't seem to be working after HA event

We have 12 HP BL465C G7 blades with AMD processors in a VSphere cluster.  The blades have been in production OK for about a year, and we rebuilt each blade with VSphere 5 last December, and applied update 1 about a week ago.

The cluster is running with HA and DRS (set to "automatic").  We've not had any issues with moving machines, either manually, as part of a maintenance mode, or automatic DRS.  We have enough capacity that we could handle at least two host failures.

Sunday morning, we had a blade fail, with about 12 VM's on it.  HA took care of relocating the VM's and restarting them OK. Due to an unrelated issue, I didn't get notified about the failure.

Several hours later, we received a warning that one of the running blades had high memory usage.  I logged into VCenter and checked the system.  It showed the one blade being unreachable, all of the others had the yellow diamond with the warning that the "HA agent on the host couldn't reach some of the management network addresses of the other hosts", and the one blade that triggered the memory alert with the red diamond.

The odd thing was that the blade that triggered the memory warning was over 95%, but there were at least three blades under 30% load.  It looks like DRS wasn't working.  I manually moved a couple of VM's from the high memory one to one of the lighter loaded ones, it migrated with no issue.

Why didn't DRS move some of the VM's off the overloaded blade?  Does DRS stop working when is in a HA "faulted" situation?  Is there some way to have it continue to move things as needed, even if one host is down?

Any comments woud be appreciated.

Mike O'Donnell

0 Kudos
9 Replies
Troy_Clavell
Immortal
Immortal

where is the "slider" set for your automation level for DRS, and you are setup for fully automated, correct?  You may think about changing the settting to be a little more aggressive to see if DRS kicks in.

Another thing.  If you look at the Summary Tab of your cluster, you should see a box for vSphere DRS.  If there is a green check box Load balanced, DRS will do nothing.

0 Kudos
MikeOD
Enthusiast
Enthusiast

We are set to "fully automated", and the slider is set at the middle, to apply priority 1, 2, and 3 (I don't think we've ever changed it from the default).

The "summary" tab shows a "current host load standard deviation" of .112 and says "load imbalanced", but looking at the DRS tab doesn't show any recommendations.

Looking at the logs, during the time we had the failed blade (and the others in the "warning" state), several times the one blade went over 95% memory for over 15 minutes at a time, while others were under 30% memory and CPU.   Wouldn't that have been enough to trigger a DRS move, even with the slider set at the middle setting?

0 Kudos
Troy_Clavell
Immortal
Immortal

move the slider to the right, maybe just 1 tick for now, to see if it helps.

0 Kudos
MikeOD
Enthusiast
Enthusiast

I just did that before I read your message.  We've got three blades over 80%, and three under 10%, so I would think we'd see something with the priority set to pick up 1-4.

If it is going to make any changes, about how long should it take to see something happen?

Would it do anything to disable DRS on the cluster then re-enable it?  I know that used correct some issues with the "old" 4.x HA..

0 Kudos
weinstein5
Immortal
Immortal

One thing to remember is DRS function is to insure that the VMs are receiving sufficient resources - so if your VMs are receiving sufficient resources than DRS will not move them which means after an HA you might end up in the situation you are in where it appears a host is overloaded  but in reality the VMs are not constrained for resources either memory or cpu - so even diabling and restarting DRS will not do anything as long as nothing else changes

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
0 Kudos
MikeOD
Enthusiast
Enthusiast

i reaize DRS won't evenly balance the load, but I would have thought I should have seen some movement, especially at the most agressive setting.

At the normal, (middle setting), on the summary page, it was showing a target host discrepency (I don't remember the exact term) of .11, a target of <.08, and a yellow triangle and the word "unbalanced", so shouldn't it be moving some things?

0 Kudos
MikeOD
Enthusiast
Enthusiast

It's working now.

I first tried disabling DRS then re-enabling it, but nothing changed.

I then rebooted the VCenter server.  As soon as it came back up, it started moving VM's.  It's now below the "target" deviation and is showing as "Load Balanced".

0 Kudos
chriswahl
Virtuoso
Virtuoso

For the future, you can force DRS to evaluate your environment by clicking the "Run DRS" link at the top right corner of the DRS tab.

I believe it evaluates the environment on a 10 minute cycle.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
MikeOD
Enthusiast
Enthusiast

I had tried that, it didn't work.  Before the reboot, DRS wasn't even coming up with any recommendations.  It was showing as "unbalanced" on the summary tab, but on the DRS tab there were no DRS Recommendations listed.

0 Kudos