VMware Cloud Community
pgifford
Contributor
Contributor

Maintenance Mode stuck at 2%

Hi,

Has anyone been having a problem with Maintenance Mode getting stuck at 2%? This is a random issue that doesn't occur at all times. This was discovered during testing a new ESX rollout. Occasionally when a hosts is put into maintenance mode some or none of the virtual machines are not migrated off of the host and the Maintenance Mode task stays at 2%. This only applies when HA is enabled. This process does not error out or produce and error message.

Sample Configuration:

ESX 3.5 Update 1 with VCMS 2.5 Update 1

2 hosts (16 CPU's, 32 GB RAM)

4 virtual machines (4 vCPU's, 3.5 GB RAM)

During a call with VMware support (with a live example of the problem), we changed the setting in HA to allow virtual machines to be started if this constraint would be violated. Within 10 seconds the remainder of the Virtual Machines began to migrate and the Maintenance Mode task completed properly.

After this setting numerous attempts were made to recreate the issue with the new HA setting and it has not come back.

Has anyone else seen this? reported it to VMware Support?

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
35 Replies
fejf
Expert
Expert

Hi,

if entering maintenance mode is getting stuck it means that the VC can not migrate the VMs away to another ESX server. What you describe seams not to be a bug but a feature (violating M$'s copyright here Smiley Wink ). If you only have 2 ESX servers that means if there's only 1 ESX this would be a violation of your HA settings because then it's not possible to restart the VMs on another host (because there's only left). So to make this work you have the following possibilities:

Use at least 3 ESX hosts for VMware HA so that there are always two hosts left for HA.

Or violate the HA configuration by entering the maintenance mode on one of the hosts and keep the remaining ESX server without the possibility to restart virtual machines on a host-failure (and so without HA).

--

There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.

-- There are 10 types of people. Those who understand binary and the rest. And those who understand gray-code.
Reply
0 Kudos
pgifford
Contributor
Contributor

Hi,

This issue only comes up once in a while at this clients site. It only hung up once in 5 different attempts. When you attempt to put a hosts into maintenance mode it tells you that you are about to violate the constraints of the HA cluster and asks if you really want to do it. How else do you patch hosts or perform hardware maintenance? You don't have to disable HA to patch a host in a 2 host cluster. I have done this numerous times in my lab at work without an issue. VMware support has confirmed that this was an issue and that it is not a feature.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
xcomiii
Contributor
Contributor

I've had similiar issues if there are any snapshots / cd-rom issues on the VM's that needs to be migrated to the other host.

Reply
0 Kudos
Rumple
Virtuoso
Virtuoso

Typically I've seen it when the VM CDROM is setup for host instead of being configured for the client. Either that or someone left a Client CDROM connection enabled and closed the management interface.

That will stop vmotion in its tracks...

Reply
0 Kudos
kjb007
Immortal
Immortal

Double-check to make sure you don't have any DRS anti-affinityh rules in place.

Also, how many VM's do you have created between two hosts?

Check this as well regarding HA failover capacity, http://www.vmwarewolf.com/ha-failover-capacity

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
pgifford
Contributor
Contributor

There are only 4 virtual machines in the cluster at this point. I have ensured that any CDROMs and floppies are disconnected from all of the VM's. There are no snappshots on any of the VM's. There are also no affinity or anti-affinity rules set on the cluster either.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
kjb007
Immortal
Immortal

Have you tried to disable and re-enable HA? How about removing one server from the cluster and adding it back in, and then doing the same with the other.

Verify also on the vm properties that you don't have additional reservations/limits set.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
pgifford
Contributor
Contributor

I have enabled and re-enabled everything to do with HA and DRS. I went as far a rebuilding the Cluster from scratch and rebuilding the host. We removed all resource pools and double checked each vm for reservations and limits. Once we got on the phone with VM Support, we went through every single setting again. The support guy was dumbfounded.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
kjb007
Immortal
Immortal

On the running ESX hosts, how much memory do you see in VC? How much memory on the summary tab does it say is available and reserved and/or in-use?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
pgifford
Contributor
Contributor

The VM Support person checked that as well. There was more than enough available and barley anything reserved or in use.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
kjb007
Immortal
Immortal

Can you post an esxcfg-nics -l and an lspci -v?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
pgifford
Contributor
Contributor

I don't have access to the hosts at this point. I will try to get a hold of them on Monday.

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
kjb007
Immortal
Immortal

At this point, I wanted to check and see if the NICs you are using are sharing interrupts, which is why I asked for that output. It's a strange issue, and so if it's not a resource issue, maybe there is still an underlying communication problem.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Rajeev_S
Expert
Expert

In VI3.5 HA&DRS has strict rules. on your HA configuration(in summary tab of the cluster) what does the "current failover capacity" shows. It should show 1, so that the "Enter maintenance mode" command would work.

Hope this helps Smiley Happy

Reply
0 Kudos
pgifford
Contributor
Contributor

Here are the configs from esx01

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
pgifford
Contributor
Contributor

and here are the configs from esx02

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos
kjb007
Immortal
Immortal

It looks like on your esx01 and esx02, your onboard NICs are all sharing IRQ 16, as well as 1 port on the quad Intel NIC.

How are you teaming your NICs?

I would use this teaming scheme so as not to share any IRQs on your NICs.

vmnic0 --> vmnic2

vmnic1 --> vmnic3

vmnic4 --> vmnic5

Message was edited by: kjb007 : changed iternal to onboard

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
cryptonym
Enthusiast
Enthusiast

Also, if you have any vm's that have begun to "Install Tools", it will not migrate. Be sure that if you right click on the machine name, confirm that you do not have something like "End Install/Update Tools". If you do, then end it and try again. Some OS's like Netware leave the Install active, and you must manually end it when completed.

Reply
0 Kudos
pgifford
Contributor
Contributor

Hi,

For redundancy we designed the following:

vmnic0 --> vmnic3 (Service Console)

vmnic1 --> vmnic 4 (VMnetwork)

vmnic2 (VMotion)

vmnic5 (unused at this time)

Paul Gifford Virtualization Practice Lead | Mainland Information Systems Ltd.
Reply
0 Kudos