VMware Cloud Community
ktwebb68
Contributor
Contributor

VMotion fails at 10%

Just started doing this after fully patching the hosts.

This is what gets written in the vmkernel log

Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)Migrate: vm 1058: 6849: Setting migration info ts = 3509083115, src ip = <192.168.3.3> dest

ip = <192.168.3.5> Dest wid = 1071

Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)World: vm 1103: 693: Starting world migSendHelper-1058 with flags 1

Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)World: vm 1104: 693: Starting world migRecvHelper-1058 with flags 1

Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.300 cpu0:1103)WARNING: MigrateNet: 285: 3509083115: Connect to <192.168.3.5>:8000 failed: Timeout

Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu0:1103)WARNING: Migrate: 1153: 3509083115: Failed: Timeout (0xbad0020) @0x8bd968

Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu0:1103)World: vm 1103: 3867: Killing self with status=0xbad0020:Timeout

Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu1:1104)World: vm 1104: 3867: Killing self with status=0x0:Success

Times out. Something definitely changed after patching. I am also having a problem where I lose connectivity when I reboot. I have to go in, remove the second NIC from the vswitch (esxcfg-vswitch vSwitch -U vmnic1) and remove the VMotion portgroup, bounce the server, add the vmnic1 back and then recreate the VMotion interface.

3.01 and VC2.01.

I was looking at patching all my hosts but hell no I am not now. Patching is done via IIS repository and done in sequence. The host I am troubleshooting right now is from scratch. Just installed the OS. Blade so only two NIC's. Currently they're teamed and all portgroups are on the same multiNic VSwitch. I know, I should have the VMotion interface segmented. Tried that. That's the first time I saw the 10% issue and thought it was the configuration I had done. But all my hosts are having trouble with VMotion after patching.

Any advice? Jeez what a mesxs.

Reply
0 Kudos
41 Replies
rigathor
Contributor
Contributor

Sounds like mine are set up the same. Seperate vswitch for service console, vmkernel, and vm network and consistent vmnics assigned to each e.g. vmkernel always vmnic2. It "shouldn't" make a difference anyway (but you never know.... Smiley Happy

Reply
0 Kudos
Lee_Sargeant
Enthusiast
Enthusiast

We are having the same problem. Originally we thought this was an issue with a single server in the 2 host cluster that we were running. However we have just removed the 2 original hosts and added 4 new hosts. We have come in this morning and suddenly we are having the same issues. All hosts are affected. In addition to the 10% problem of vMotion, everything that we do on a guest machine fails. eg. If we try to change the config of a machine this also times out after 15 mins. We are fully patched as we patched all the servers when we put them in 2 weeks ago. This includes the patch mentioned above.

Reply
0 Kudos
Oli_L
Enthusiast
Enthusiast

just an idea but make sure your vmotion nic is not conflicting with a COS / VMKernel IRQ / Vector share?

cat /proc/vmware/pci

or

cat /proc/vmware/interupts

Also is your bios and rompaqs up to date?

I know the workaround is to edit the vmotion vswitch properties, say click around in vmkernel gateway..... seems to kick start it in

Reply
0 Kudos
TiBoReR
Enthusiast
Enthusiast

I got the same problem here today.

I have 4 ESX 3.0.1 on blades with 2 physical nics teamed with Service Console, VMotion and VMs on 1 vSwitch. VC 2.01 Patch 2.

All was running well before I applied 3 last patchs from july 2007 on 1 host. Any VMotion from other hosts to that host updated were failing at 10% with opration timed out.

To get it working, I removed one physical nic from the vSwitch and clicked okay then put the physical nic again in the vSwitch and clicked okay. Started to work right after that.

I hope VMware will find a real solution to that problem.

Reply
0 Kudos
matzoni1
Contributor
Contributor

We have the same problem here. Could'nt fix.

I hope there will be a patch soon.

THX to Oli L for the WorkArround !!

Reply
0 Kudos
flanster
Contributor
Contributor

Its doing my head in, I have the same problem and nothing will kick vmotion off. HA works a treat. I can vmkping and ping each of all the vmkernal ips. I will keep trying, but I might be doing a reinstall from scratch as this is in a test lab. I have been on the VI3 Install and Configure course, so I not a newby to VM and still cannnot get it to work.

Reply
0 Kudos
mforbes
Enthusiast
Enthusiast

Had same issue, tried this and worked for us as well.

Thanks for the post

Mike Forbes
Reply
0 Kudos
Allsopp
Contributor
Contributor

What a lifesaver!!!

I've been searching the forums for a solution to this (one of many) issue. I'm upgrading from 3.01 to 3.5.

The first host I tried to upgrade died with a GRUB message, not a prompt and I ended up doing a complete re-install. That wasn't too bad, but I could not use vmotion to proceed with the upgrades on my other hosts.

After 6 hours of trying everything else, I removed the NIC from the VMkernel switch, closed the dialog box, and then added the same NIC. IT Worked for me.

Thanks again.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

Same issue here, I've got about 30 ESX 3.0.1 hosts (with minimal patches that are required to connect to VC2.5)

Running VC 2.5 and while VMotionning between the 3.0.1 and the 3.5 it dies @ 10% as well. Haven't tested the work-arounds that are mentioned here, will do that later.

Reply
0 Kudos
ZakL
Contributor
Contributor

I had the same issue here. I installed 3.5 on a server and when vmotioning from 3.0.1 to 3.5 vmotion failed at 10%. Removing the nic from vmotion vSwitch and adding it again solved the issue for me.

Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso

My problem was solved by doing a vmkping between the vmotion IP's

It stays madness however that this must be done to solve the problem.

Reply
0 Kudos
tharpy
Contributor
Contributor

Found another solution for this problem. At our most recent deployment, the vmotion vlan had been created on a cisco 3750 and although we don't know if network dude did this knowingly or not, bottom line was that a cisco management interface was created with an ip of X.X.X.1; same address as the vmkernel address assigned to first host. Found it by clicking on the bubble next to the virtual switch and noticed that the cisco management address showed up on the vmkernel switch of all three ESX hosts....that got us looking closer.

we're using ESX 3.5 and VC 2.5; we originally thought patching was to blame, but were able to duplicate it on a non-patched host.

changed the ip address on host1 and vmotion worked cleanly from then on.

It's important to note that removing the vSwitch and reinstalling it seemed to fix the problem in the short term (i.e. vmotion would work)....but if left alone without traffic, problem would come back. (thought I'd add this as I read other posts with a similar gist)

in retrospect...shutting down the bad interface and vmkping'ing it should have uncovered the duplicate ip on the vmotion subnet.

....credit goes to VMware support for chasing this down.

Cheers!

Reply
0 Kudos
jeffko
Contributor
Contributor

Just wanted to add some insight I had when battling this problem. I had the comm team check the arp table for all my server's vmotion nic IP's. All the servers I was having a problem with did not show up on the table. We fixed it by opening the network card config and closing it again, but if there was an easier way to make the vmotion nic update its arp record, then it seems like a fix. We seem to have it on all our 3.0.2 machines, (I need to double check path levels) but don't have this on our 3.5 servers.

Just in case this helps anyone.

Jeff

Reply
0 Kudos
rookie_c
Contributor
Contributor

We have been having similar problems, we have one operational enviroment on servers with ESX 2.5.4 and have a testing enviromnet using 3.0.1. Have looked at all that has been said on this thread including the last from Jeffko77. Has there been any update to this vmotion nic/arp table update issue (our network guys say there is nothing they can do...I am not so sure).

Reply
0 Kudos
nbcasey
Contributor
Contributor

I've been having a similar issues, doing some research I found this article

Sure enough all 5 of my ESX 3.0.1 boxes are Dell 6650's with the Broadcom chipset, I'd be interested in how many of your servers have the same chipset. VMWare support basically had no other answer than to say "use another brand NIC". A quick fix for me is to simply pull the patch cable from the NIC for a second then pop it back in. The NIC instantly comes back up.....

I find it rather irritatimg that VMWare is unable to work with Broadcom to resolve this issue, after all the server (and it's internal NIC's) are on the qualified hardware list.

Reply
0 Kudos
abaum
Hot Shot
Hot Shot

Funny thing happened last night. Out of the blue, my vmotions start failing at 10%. Sure enough, I am using the onboard NIC which uses the Broadcom chipset. I run an almost fully patched 3.02 farm. The only patches missing are the ones that came out in the last two weeks. Never had this problem before. What I did find is that a LOW PRIORITY vmotion works fine. Strange

adam

Reply
0 Kudos
Ajay_Nabh
Enthusiast
Enthusiast

Abaum

Spot on My Friend!!! my vmotion works with low priorty too. I would still like to talk to VMware Support

Thanks for your help

Ajay

Reply
0 Kudos
Ajay_Nabh
Enthusiast
Enthusiast

Hi All

Now I don't know but today after couple of error messages vmotion has started to work. I did not do any fix on it however i was using vmotion in low priorty.I was thinking a lot about this issue, reading, thinking of reporting to support, yes that all i did to fix it. Anyway it is working...

Ajay

Reply
0 Kudos
weinstein5
Immortal
Immortal

what error do you get when trying to vmotion?

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful
Reply
0 Kudos
Ajay_Nabh
Enthusiast
Enthusiast

Hi Weinstein5

1) A specified parameter was not correct.

2)A general system error occured: failed to initialize migrationat destination. Error 0xbad00a3. vmotion failed to start due to lack of cpu or memory resources.

funny was that I can migrate on low priorty but not on high

Cheers

Ajay

Reply
0 Kudos