VMware Cloud Community
Anders_Gregerse
Hot Shot
Hot Shot

Maintenance mode will not do vmotion (HA enabled cluster)

Hi

I have a cluster that consist of 6 hosts where manual vmotion works fine, but when I put my hosts in Maintenance mode, it just doesn't vmotion the vm's to the other hosts in the cluster. This problem started after we had some major network problems and every host have been rebooted after the network problem was solved. I've done vmkping and looked in different logs without finding any clues. Any idea on how to fix this (so that I can update to Update 2 without putting to much manual work into it)?

Anders

Reply
0 Kudos
30 Replies
weinstein5
Immortal
Immortal

Confirm that DRS is in fully automatic mode - if it is not in fully automatic maintenance mode will not cause the vms to be vmotioned off -

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

DRS is set to fully automatic and I can se vm's be moved around from time to time, but not when entering maintenance mode.

Reply
0 Kudos
weinstein5
Immortal
Immortal

can you manually vmotion one of the vms sitting on the host that you want to move to maintenance mode?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Yes, as I said manual vmotion works fine and when all vm's have been vmotioned the maintenance mode continues.

Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Restarting the Virtual Center service didn't make a difference either.

Reply
0 Kudos
wallbreaker
Contributor
Contributor

Hi, yesterday we've updated our 6 esx from 3.5u1 to 3.5u2, and we've the same issue.

Maintenance mode will not do vmotion, but drs is ok and manual vmotion too.

Vincent

Reply
0 Kudos
ets_vm_atl
Contributor
Contributor

If you've updated VC to Update 2, the following is listed in the known bugs:

Virtual Machine Migrations Are Not Recommended When the ESX Server Host Is Entering the Maintenance or Standby Mode

No virtual machine migrations will be recommended (or performed, in fully automated mode) off of a host entering maintenance or standby mode, if the VMware HA failover level would be violated after the host enters the requested mode. This restriction applies whether strict HA admission control is enabled or not.

Since the upgrades from ESX2.x/VC1.x, some of our clusters still inaccurately report that the HA failover capacity is 0 (still working on troubleshooting that one). Is your VC showing HA failover capacity at 1 or more for the cluster with the issue?

Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

#¤&"#%#/¤&, had to get it out...

My VC is also showing a HA capacity of 0. Nice. Only did the update because of ESX 3i Update 2.

Reply
0 Kudos
scot21
Contributor
Contributor

Great, Guess i am done with update 2 until this is fixed.

Reply
0 Kudos
ets_vm_atl
Contributor
Contributor

If you have plenty of resources but HA is showing 0 available, check your guest CPU/memory reservations and limits. If you choose the farm in the left pane and the Resource Allocation tab in the right, you can see them all in one place. Make sure the Reservations are set to 0 (zero) and the Limits are set to Unlimited for both CPU and memory for all guests. This will not change each guest's hardware allocation. There's a tech article or post somewhere that mentions that ESX2.x defaulted to values, and those values got carried over if you upgraded the guests. VI3 defaults to 0/Unlimited, which HA seems to prefer. Apparently, HA doesn't always calculate correctly if these still have the carryover values from ESX2. I changed them all to 0/Unlimited on one cluster and that resolved the HA incorrectly showing 0 on that cluster.

Reply
0 Kudos
wallbreaker
Contributor
Contributor

On my second cluster (2 IBM x3650 with 23 VM), I've plenty of resources and HA is showing a current failover capacity of 1 and a configured failover capacity of 1.

CPU Reservation Used: 800 MHz

CPU Unreserved: 31698 Mhz

Memory Reservation Used: 2373 MB

Memory Unreserved: 56471 MB

If HA is disable, maintenance mode will do vmotion, all works fine. But if HA is enabled, maintenance will not do vmotion.

Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

Experienced that in our 3.0.2 to 3.5 migration. But we only have a few reservations/limits set (for good reason), but the current HA calculations where the hosts are sliced in to slices based on the biggest vm doesn't come close to real size and we have two big vm's that is far bigger in terms of memory and cpu that the rest of our vm's thereby reducing our HA capacity below our true capacity. When peaking I'm using 50% of the memory and 25% of the cpu and with 6 hosts I should be able to loose at least 2 hosts before it starts to get tight.

Reply
0 Kudos
ets_vm_atl
Contributor
Contributor

I opened a support case yesterday and found out the issue I've been having with HA falsely reporting no failover capacity is fixed in Update 2. Apparently in the first version, HA used 1 CPU for each guest to calculate HA. Since that wasn't enough, they changed the calculation in Update 1 to use the number of CPUs from the largest guest, times the number of guests. So if you have one 4-vCPU guest and all the rest are 1-vCPU, HA capacity is calculated based on all guests having 4-vCPUs. That method is flawed, of course, and shows less failover capacity than is truly available. Update 2 is said to use the actual number of vCPUs on the guests, reporting HA failover capacities correctly.

The calculation takes into account reservations, also. If you've got reservations set, you could change them to zero and your issue might go away. My understanding is that unless you're resource constrained, reservations don't do anything.

Reply
0 Kudos
Anders_Gregerse
Hot Shot
Hot Shot

After submitting a SR, the current status is that HA has been tightened and even though I can loose 50% of my hosts, HA says that I have 0 failover capacity. Because of the 0 failover capacity in HA, setting the host into maintenance mode will not automatically vmotion vm's to other hosts (there are no resources to move the vm's to according to HA), so it's up to the humans to make that decision and do it manually (keep in mind that this requires knowledge of which Resource groups the vm's are in). Powered off vm's are moved of the host automatically since they do not require any resources.

I'm going to pursue this issue futher both through the SR and other channels.

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

In the cluster settings, under VMware HA, there's an option box for Admission Control which allows you to choose whether to power on virtual machines if they violate availability constraints. Try setting that to "Allow virtual machines to be powered on . . ." if it is not already set. Since VMware HA has always reported zero hosts available, I always leave this set.

Reply
0 Kudos
mdparker
Contributor
Contributor

I am having this same problem. VC is update 2 and all ESX servers are update 2. I have turned the "Allow virtual machines to be powered on . . ." and off ... no difference. Wierd thing is, I noticed this problem when my ESX servers were on update1 (VC on update 2) and I have other servers in another cluster still running update 1 which work fine. For my broken cluster, the only thing that works is to undo HA. I'm working with a VMware support person. If I get any results, I'll post here, but so far he's not provide any useful answers.

Mike

Reply
0 Kudos
tomaddox
Enthusiast
Enthusiast

I ran into similar problems at one point, and it turned out that the VMotion flag had been disabled on the VMkernel interface for one of the ESX servers in the cluster. DRS and manual VMotion operations could succeed because the VMs would go to other hosts, but putting a host into maintenance mode would cause problems because all the VMs would try to migrate at once, and a bunch would time out. In any case, check the VMkernel interfaces on all your ESX hosts and make sure that VMotion is enabled on all of them.

Also, if you're using VLANs, make sure that you have the VLAN ID set correctly for your VMkernel interfaces.

Reply
0 Kudos
epping
Expert
Expert

Hi all

Just to confirm, this is now a known bug with 3.5 update 2 and HA, so far support are saying the only work around is to dissable HA while entering mainenence mode...... remember this will also cause problems with Patch Manager.

if anyone gets an answer on this please post back

very frustrating

Reply
0 Kudos
epping
Expert
Expert

Anders is there any chance to change the tiltle of the tread to "HA enabled Cluster", hust to help others who are searching.

thanks

Reply
0 Kudos