Has anyone been having a problem with Maintenance Mode getting stuck at 2%? This is a random issue that doesn't occur at all times. This was discovered during testing a new ESX rollout. Occasionally when a hosts is put into maintenance mode some or none of the virtual machines are not migrated off of the host and the Maintenance Mode task stays at 2%. This only applies when HA is enabled. This process does not error out or produce and error message.
ESX 3.5 Update 1 with VCMS 2.5 Update 1
2 hosts (16 CPU's, 32 GB RAM)
4 virtual machines (4 vCPU's, 3.5 GB RAM)
During a call with VMware support (with a live example of the problem), we changed the setting in HA to allow virtual machines to be started if this constraint would be violated. Within 10 seconds the remainder of the Virtual Machines began to migrate and the Maintenance Mode task completed properly.
After this setting numerous attempts were made to recreate the issue with the new HA setting and it has not come back.
Has anyone else seen this? reported it to VMware Support?
if entering maintenance mode is getting stuck it means that the VC can not migrate the VMs away to another ESX server. What you describe seams not to be a bug but a feature (violating M$'s copyright here ). If you only have 2 ESX servers that means if there's only 1 ESX this would be a violation of your HA settings because then it's not possible to restart the VMs on another host (because there's only left). So to make this work you have the following possibilities:
Use at least 3 ESX hosts for VMware HA so that there are always two hosts left for HA.
Or violate the HA configuration by entering the maintenance mode on one of the hosts and keep the remaining ESX server without the possibility to restart virtual machines on a host-failure (and so without HA).
There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.
This issue only comes up once in a while at this clients site. It only hung up once in 5 different attempts. When you attempt to put a hosts into maintenance mode it tells you that you are about to violate the constraints of the HA cluster and asks if you really want to do it. How else do you patch hosts or perform hardware maintenance? You don't have to disable HA to patch a host in a 2 host cluster. I have done this numerous times in my lab at work without an issue. VMware support has confirmed that this was an issue and that it is not a feature.
Typically I've seen it when the VM CDROM is setup for host instead of being configured for the client. Either that or someone left a Client CDROM connection enabled and closed the management interface.
That will stop vmotion in its tracks...
Double-check to make sure you don't have any DRS anti-affinityh rules in place.
Also, how many VM's do you have created between two hosts?
Check this as well regarding HA failover capacity, http://www.vmwarewolf.com/ha-failover-capacity
There are only 4 virtual machines in the cluster at this point. I have ensured that any CDROMs and floppies are disconnected from all of the VM's. There are no snappshots on any of the VM's. There are also no affinity or anti-affinity rules set on the cluster either.
Have you tried to disable and re-enable HA? How about removing one server from the cluster and adding it back in, and then doing the same with the other.
Verify also on the vm properties that you don't have additional reservations/limits set.
I have enabled and re-enabled everything to do with HA and DRS. I went as far a rebuilding the Cluster from scratch and rebuilding the host. We removed all resource pools and double checked each vm for reservations and limits. Once we got on the phone with VM Support, we went through every single setting again. The support guy was dumbfounded.
On the running ESX hosts, how much memory do you see in VC? How much memory on the summary tab does it say is available and reserved and/or in-use?
The VM Support person checked that as well. There was more than enough available and barley anything reserved or in use.
At this point, I wanted to check and see if the NICs you are using are sharing interrupts, which is why I asked for that output. It's a strange issue, and so if it's not a resource issue, maybe there is still an underlying communication problem.
In VI3.5 HA&DRS has strict rules. on your HA configuration(in summary tab of the cluster) what does the "current failover capacity" shows. It should show 1, so that the "Enter maintenance mode" command would work.
Hope this helps
It looks like on your esx01 and esx02, your onboard NICs are all sharing IRQ 16, as well as 1 port on the quad Intel NIC.
How are you teaming your NICs?
I would use this teaming scheme so as not to share any IRQs on your NICs.
vmnic0 --> vmnic2
vmnic1 --> vmnic3
vmnic4 --> vmnic5
Message was edited by: kjb007 : changed iternal to onboard
Also, if you have any vm's that have begun to "Install Tools", it will not migrate. Be sure that if you right click on the machine name, confirm that you do not have something like "End Install/Update Tools". If you do, then end it and try again. Some OS's like Netware leave the Install active, and you must manually end it when completed.
For redundancy we designed the following:
vmnic0 --> vmnic3 (Service Console)
vmnic1 --> vmnic 4 (VMnetwork)
vmnic5 (unused at this time)