Re: Maintenance Mode stuck at 2%

pgifford · ‎04-26-2008

Hi,

Has anyone been having a problem with Maintenance Mode getting stuck at 2%? This is a random issue that doesn't occur at all times. This was discovered during testing a new ESX rollout. Occasionally when a hosts is put into maintenance mode some or none of the virtual machines are not migrated off of the host and the Maintenance Mode task stays at 2%. This only applies when HA is enabled. This process does not error out or produce and error message.

Sample Configuration:

ESX 3.5 Update 1 with VCMS 2.5 Update 1

2 hosts (16 CPU's, 32 GB RAM)

4 virtual machines (4 vCPU's, 3.5 GB RAM)

During a call with VMware support (with a live example of the problem), we changed the setting in HA to allow virtual machines to be started if this constraint would be violated. Within 10 seconds the remainder of the Virtual Machines began to migrate and the Maintenance Mode task completed properly.

After this setting numerous attempts were made to recreate the issue with the new HA setting and it has not come back.

Has anyone else seen this? reported it to VMware Support?

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

fejf · ‎04-26-2008

Hi,

if entering maintenance mode is getting stuck it means that the VC can not migrate the VMs away to another ESX server. What you describe seams not to be a bug but a feature (violating M$'s copyright here ). If you only have 2 ESX servers that means if there's only 1 ESX this would be a violation of your HA settings because then it's not possible to restart the VMs on another host (because there's only left). So to make this work you have the following possibilities:

Use at least 3 ESX hosts for VMware HA so that there are always two hosts left for HA.

Or violate the HA configuration by entering the maintenance mode on one of the hosts and keep the remaining ESX server without the possibility to restart virtual machines on a host-failure (and so without HA).

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

pgifford · ‎04-26-2008

Hi,

This issue only comes up once in a while at this clients site. It only hung up once in 5 different attempts. When you attempt to put a hosts into maintenance mode it tells you that you are about to violate the constraints of the HA cluster and asks if you really want to do it. How else do you patch hosts or perform hardware maintenance? You don't have to disable HA to patch a host in a 2 host cluster. I have done this numerous times in my lab at work without an issue. VMware support has confirmed that this was an issue and that it is not a feature.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

xcomiii · ‎04-26-2008

I've had similiar issues if there are any snapshots / cd-rom issues on the VM's that needs to be migrated to the other host.

Rumple · ‎04-26-2008

Typically I've seen it when the VM CDROM is setup for host instead of being configured for the client. Either that or someone left a Client CDROM connection enabled and closed the management interface.

That will stop vmotion in its tracks...

kjb007 · ‎04-26-2008

Double-check to make sure you don't have any DRS anti-affinityh rules in place.

Also, how many VM's do you have created between two hosts?

Check this as well regarding HA failover capacity, http://www.vmwarewolf.com/ha-failover-capacity

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pgifford · ‎04-26-2008

There are only 4 virtual machines in the cluster at this point. I have ensured that any CDROMs and floppies are disconnected from all of the VM's. There are no snappshots on any of the VM's. There are also no affinity or anti-affinity rules set on the cluster either.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

kjb007 · ‎04-26-2008

Have you tried to disable and re-enable HA? How about removing one server from the cluster and adding it back in, and then doing the same with the other.

Verify also on the vm properties that you don't have additional reservations/limits set.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pgifford · ‎04-27-2008

I have enabled and re-enabled everything to do with HA and DRS. I went as far a rebuilding the Cluster from scratch and rebuilding the host. We removed all resource pools and double checked each vm for reservations and limits. Once we got on the phone with VM Support, we went through every single setting again. The support guy was dumbfounded.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

kjb007 · ‎04-27-2008

On the running ESX hosts, how much memory do you see in VC? How much memory on the summary tab does it say is available and reserved and/or in-use?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pgifford · ‎04-27-2008

The VM Support person checked that as well. There was more than enough available and barley anything reserved or in use.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

kjb007 · ‎04-27-2008

Can you post an esxcfg-nics -l and an lspci -v?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

pgifford · ‎04-27-2008

I don't have access to the hosts at this point. I will try to get a hold of them on Monday.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

kjb007 · ‎04-27-2008

At this point, I wanted to check and see if the NICs you are using are sharing interrupts, which is why I asked for that output. It's a strange issue, and so if it's not a resource issue, maybe there is still an underlying communication problem.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

Rajeev_S · ‎04-27-2008

In VI3.5 HA&DRS has strict rules. on your HA configuration(in summary tab of the cluster) what does the "current failover capacity" shows. It should show 1, so that the "Enter maintenance mode" command would work.

Hope this helps

pgifford · ‎04-28-2008

Here are the configs from esx01

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

pgifford · ‎04-28-2008

and here are the configs from esx02

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

kjb007 · ‎04-29-2008

It looks like on your esx01 and esx02, your onboard NICs are all sharing IRQ 16, as well as 1 port on the quad Intel NIC.

How are you teaming your NICs?

I would use this teaming scheme so as not to share any IRQs on your NICs.

vmnic0 --> vmnic2

vmnic1 --> vmnic3

vmnic4 --> vmnic5

Message was edited by: kjb007 : changed iternal to onboard

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

cryptonym · ‎04-29-2008

Also, if you have any vm's that have begun to "Install Tools", it will not migrate. Be sure that if you right click on the machine name, confirm that you do not have something like "End Install/Update Tools". If you do, then end it and try again. Some OS's like Netware leave the Install active, and you must manually end it when completed.

pgifford · ‎04-29-2008

Hi,

For redundancy we designed the following:

vmnic0 --> vmnic3 (Service Console)

vmnic1 --> vmnic 4 (VMnetwork)

vmnic2 (VMotion)

vmnic5 (unused at this time)

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.

All

Maintenance Mode stuck at 2%