VMware Cloud Community
STTH
Enthusiast
Enthusiast

ESX 3.5.0 Update 3 Build 123630 - Bug causing HA to restart VMs after VMotion completes

Since we recently upgraded our ESX hosts with the latest release (ESX 3.5.0 Update 3 Build 123630), HA does not work properly after VMotion completes.

When manually moving a running production VM from one host to another, it apparently looses the heartbeat, which is bad because it causes HA to restart the VM a few minutes after the migration, even though it is running fine.

What is especially strange is that VMotion itself appears to work fine, and RDP sessions to the VMs throughout the VMotion process show no loss of network whatsover ; only the heartbeat apparently gets lost.

Same thing happens when DRS initiates migrations automatically. The VM gets moved but connectivity is lost soon thereafter (apparently hearthbeat loss), causing HA to restart the VM.

Again this is bad because it inevitably causes guest filesystem corruption and possible data loss on the VM and perturbs production significantly during business hours.

Nothing has changed in the VirtualCenter (2.5.0 Build 119598). Only the ESX hosts have been upgraded (and restarted) to Update 3 (last week after VMware announced it was GA).

Like after every major update, HA and DRS were disabled and re-enabled, but it did not help. All ESX hosts were restarted but it did not help either.

Anybody out there having the same problem ? Is this a bug that came with this latest release ? Update 2 had introduced major bugs (license timebomb), so everything's possible...

For now, we disabled DRS because it will cause VMs to go down constantly after they are automatically migrated (load balancing)... but we really need DRS as soon as possible.

Help !

0 Kudos
18 Replies
gdesmo
Enthusiast
Enthusiast

You dont have upgrade tools checked in your vm's do you?

http://communities.vmware.com/message/1062122#1062122

0 Kudos
STTH
Enthusiast
Enthusiast

Thank you. I checked your post and compared with our environment. None of our VMs have that option checked. So... you believe that if that option was checked, it could have caused our issue, like it did when upgrading ESX to Update 2 ? By the way, what is the purpose of this option ?

0 Kudos
gdesmo
Enthusiast
Enthusiast

This option will auto update your vm tools after a re-boot if they are "out of date" Then a 2nd re-boot is done after the upgrade has been completed.

0 Kudos
STTH
Enthusiast
Enthusiast

Interesting... It sounds like that option would be beneficial... we do not have it checked though.

It will automatically update the VM Tools after what kind of reboot ? Any reboot ? Including those initiated by Windows Task Scheduler on the Windows guest OS, or the admin when restarting the VM manually ?

If it does not take long for the VM Tools to get installed, then it should not be a problem to use that option as long as reboots are scheduled to occur at relatively large enough windows of time, while the systems can be taken off of production, correct ?

So in the end, what is the relationship with this option and the reboots caused by HA not picking up the heartbeat properly ?

0 Kudos
gdesmo
Enthusiast
Enthusiast

Upon any startup the tools state will be checked. If it is out of date it will upgrade it then re-boot. You loose some control but gain a way to keep your vm's tool more uptodate.

There is curentlly a bug if you have this option checked. During a vmotion if your tools are out of date it could be upgraded and re-booted. Not nice!

Vmware is working on a patch to be released soon they say.

0 Kudos
STTH
Enthusiast
Enthusiast

While this is good to know, unfortunately it does not specifically apply to our problem, because this option has always been unchecked, so we have not been using that feature.

We have upgraded all the guests with the latest VM Tools and as we migrated some VMs within the cluster, HA still is causing them to reboot soon after. Any idea why that is ?

0 Kudos
markus_herbert
Enthusiast
Enthusiast

I have trouble with HA (also with U2) until up to separate the VMotion network from the Service Console Network.

Often I get an error of an HA Agent during VMotion Operation.

So after separating this two networks, I didn't habe problems either. I have two NIC's per esx-host and two separate Ethernet-switches now.

What's your network configuration?

0 Kudos
RS_1
Enthusiast
Enthusiast

0 Kudos
actixsupport
Contributor
Contributor

I can confirm this is a def a U3 bug. I have suspected something funny the last 2 weeks or so and tonight I saw it happen and this post confirms it.

I was running 4 hosts on U3 with DRS and HA not a problem, upgraded 2 to U3 for a bit and did noticed HA kick in but was happening intermitently so too hard to track.

Upgraded all to U3 last weekend, all VM's have latest VM Tools and it's only doing it with the ones with the Upgrade Tools ticked.

What a bloody rubbish bug! Anyone know if there's a fix?

0 Kudos
atbnet
Expert
Expert

I updated our ESX 3.5 servers to U3 today and tested VMotion and it appears to work fine. However the VMs are not set to upgrade vmware tools. I will test that tommorrow and see if thats the case and let you know.

Andy, VMware Certified Professional (VCP),

If you found this information useful please award points using the buttons at the top of the page accordingly.

Andy Barnes
VCP / VCA-DT / MCITP:EA / CCIA
Help, Guides and How Tos... www.VMadmin.co.uk

If you found this information useful please award points using the buttons at the top of the page accordingly.
0 Kudos
carrmic
Contributor
Contributor

I can confirm the same problem here. It's been happening since the very first host in my cluster was updated. I haven't be able to confirm specifically that it's the HA feature. What logs should I look at to find out if HA is restarting a VM? There is nothing in Virtual Center indicating a restart that I can see. I'd like to diagnose this further and open a support ticket. Any tips for checking logs?

0 Kudos
djuengst
Contributor
Contributor

Hello, i can see this in the case of enable "virtual machin monitoring" in the cluster HA setting, aber disable this feature the vm stay online without reset after simply vmotion or "enter maintenance mode" of the Host

0 Kudos
carrmic
Contributor
Contributor

I can confirm djuengst's information. If I disable "virtual machine monitoring" before I VMotion, I do not have any trouble. DRS has not kicked in for me yet since I have plenty of capacity right now. Has anyone opened a ticket or gotten a hotfix for this issue?

0 Kudos
actixsupport
Contributor
Contributor

yep, submitted last week and they're working on a patch, will post here when VMWare gets back to me.

Ray

0 Kudos
STTH
Enthusiast
Enthusiast

I can confirm it is a bug and it has been known from VMware for about a month. They said we should disable "VM Monitoring" which is the particular component that has the problem (legacy ESX host monitoring works fine). VMware introduced VM Monitoring as an option within HA since Update 2, so it is a fairly new feature. The problem is that they forced this option, instead of requiring admins to manually activate it. When they released Update 2, most people did not notice that it had been implemented because it worked fine in the background. Unfortunately, Update 3 brought this new bug which prompted many to investigate why VMs were hot rebooting. I am guessing that it will be probably fixed in the next update. But for now, VMware said we should disable VM Monitoring, which will set HA functionality back the way it was before Update 2 (with only ESX host monitoring). This being said, it is a pretty bad bug, since it can cause the loss of the guest's filesystem, registry or database consistency pretty quickly (especially after multiple reboots), and also severly impacts production usage ; so I am not sure why VMware did not inform their customers sooner and more openly... or at least they could hurry in releasing a VUM patch for that.

0 Kudos
carrmic
Contributor
Contributor

Great news. I don't think I'll bother opening a new ticket since they are aware of the situation.

0 Kudos
Allowencer
Contributor
Contributor

actixsupport, just curious if you have heard anything back about this issue. Its been a 2 weeks since you last posted, just curious if you recieved an update yet by chance.

Thanks,

-Eric

0 Kudos
NTurnbull
Expert
Expert

FYI, VMware KB1007899

Thanks,

Neil

Thanks, Neil
0 Kudos