Just started doing this after fully patching the hosts.
This is what gets written in the vmkernel log
Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)Migrate: vm 1058: 6849: Setting migration info ts = 3509083115, src ip = <192.168.3.3> dest
ip = <192.168.3.5> Dest wid = 1071
Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)World: vm 1103: 693: Starting world migSendHelper-1058 with flags 1
Apr 18 12:59:51 ilmgmtblade03 vmkernel: 1:02:28:11.295 cpu3:1057)World: vm 1104: 693: Starting world migRecvHelper-1058 with flags 1
Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.300 cpu0:1103)WARNING: MigrateNet: 285: 3509083115: Connect to <192.168.3.5>:8000 failed: Timeout
Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu0:1103)WARNING: Migrate: 1153: 3509083115: Failed: Timeout (0xbad0020) @0x8bd968
Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu0:1103)World: vm 1103: 3867: Killing self with status=0xbad0020:Timeout
Apr 18 13:01:06 ilmgmtblade03 vmkernel: 1:02:29:26.301 cpu1:1104)World: vm 1104: 3867: Killing self with status=0x0:Success
Times out. Something definitely changed after patching. I am also having a problem where I lose connectivity when I reboot. I have to go in, remove the second NIC from the vswitch (esxcfg-vswitch vSwitch -U vmnic1) and remove the VMotion portgroup, bounce the server, add the vmnic1 back and then recreate the VMotion interface.
3.01 and VC2.01.
I was looking at patching all my hosts but hell no I am not now. Patching is done via IIS repository and done in sequence. The host I am troubleshooting right now is from scratch. Just installed the OS. Blade so only two NIC's. Currently they're teamed and all portgroups are on the same multiNic VSwitch. I know, I should have the VMotion interface segmented. Tried that. That's the first time I saw the 10% issue and thought it was the configuration I had done. But all my hosts are having trouble with VMotion after patching.
Any advice? Jeez what a mesxs.
Sounds like mine are set up the same. Seperate vswitch for service console, vmkernel, and vm network and consistent vmnics assigned to each e.g. vmkernel always vmnic2. It "shouldn't" make a difference anyway (but you never know.... ![]()
We are having the same problem. Originally we thought this was an issue with a single server in the 2 host cluster that we were running. However we have just removed the 2 original hosts and added 4 new hosts. We have come in this morning and suddenly we are having the same issues. All hosts are affected. In addition to the 10% problem of vMotion, everything that we do on a guest machine fails. eg. If we try to change the config of a machine this also times out after 15 mins. We are fully patched as we patched all the servers when we put them in 2 weeks ago. This includes the patch mentioned above.
just an idea but make sure your vmotion nic is not conflicting with a COS / VMKernel IRQ / Vector share?
cat /proc/vmware/pci
or
cat /proc/vmware/interupts
Also is your bios and rompaqs up to date?
I know the workaround is to edit the vmotion vswitch properties, say click around in vmkernel gateway..... seems to kick start it in
I got the same problem here today.
I have 4 ESX 3.0.1 on blades with 2 physical nics teamed with Service Console, VMotion and VMs on 1 vSwitch. VC 2.01 Patch 2.
All was running well before I applied 3 last patchs from july 2007 on 1 host. Any VMotion from other hosts to that host updated were failing at 10% with opration timed out.
To get it working, I removed one physical nic from the vSwitch and clicked okay then put the physical nic again in the vSwitch and clicked okay. Started to work right after that.
I hope VMware will find a real solution to that problem.
We have the same problem here. Could'nt fix.
I hope there will be a patch soon.
THX to Oli L for the WorkArround !!
Its doing my head in, I have the same problem and nothing will kick vmotion off. HA works a treat. I can vmkping and ping each of all the vmkernal ips. I will keep trying, but I might be doing a reinstall from scratch as this is in a test lab. I have been on the VI3 Install and Configure course, so I not a newby to VM and still cannnot get it to work.
Had same issue, tried this and worked for us as well.
Thanks for the post
What a lifesaver!!!
I've been searching the forums for a solution to this (one of many) issue. I'm upgrading from 3.01 to 3.5.
The first host I tried to upgrade died with a GRUB message, not a prompt and I ended up doing a complete re-install. That wasn't too bad, but I could not use vmotion to proceed with the upgrades on my other hosts.
After 6 hours of trying everything else, I removed the NIC from the VMkernel switch, closed the dialog box, and then added the same NIC. IT Worked for me.
Thanks again.
Same issue here, I've got about 30 ESX 3.0.1 hosts (with minimal patches that are required to connect to VC2.5)
Running VC 2.5 and while VMotionning between the 3.0.1 and the 3.5 it dies @ 10% as well. Haven't tested the work-arounds that are mentioned here, will do that later.
I had the same issue here. I installed 3.5 on a server and when vmotioning from 3.0.1 to 3.5 vmotion failed at 10%. Removing the nic from vmotion vSwitch and adding it again solved the issue for me.
My problem was solved by doing a vmkping between the vmotion IP's
It stays madness however that this must be done to solve the problem.
Found another solution for this problem. At our most recent deployment, the vmotion vlan had been created on a cisco 3750 and although we don't know if network dude did this knowingly or not, bottom line was that a cisco management interface was created with an ip of X.X.X.1; same address as the vmkernel address assigned to first host. Found it by clicking on the bubble next to the virtual switch and noticed that the cisco management address showed up on the vmkernel switch of all three ESX hosts....that got us looking closer.
we're using ESX 3.5 and VC 2.5; we originally thought patching was to blame, but were able to duplicate it on a non-patched host.
changed the ip address on host1 and vmotion worked cleanly from then on.
It's important to note that removing the vSwitch and reinstalling it seemed to fix the problem in the short term (i.e. vmotion would work)....but if left alone without traffic, problem would come back. (thought I'd add this as I read other posts with a similar gist)
in retrospect...shutting down the bad interface and vmkping'ing it should have uncovered the duplicate ip on the vmotion subnet.
....credit goes to VMware support for chasing this down.
Cheers!
Just wanted to add some insight I had when battling this problem. I had the comm team check the arp table for all my server's vmotion nic IP's. All the servers I was having a problem with did not show up on the table. We fixed it by opening the network card config and closing it again, but if there was an easier way to make the vmotion nic update its arp record, then it seems like a fix. We seem to have it on all our 3.0.2 machines, (I need to double check path levels) but don't have this on our 3.5 servers.
Just in case this helps anyone.
Jeff
We have been having similar problems, we have one operational enviroment on servers with ESX 2.5.4 and have a testing enviromnet using 3.0.1. Have looked at all that has been said on this thread including the last from Jeffko77. Has there been any update to this vmotion nic/arp table update issue (our network guys say there is nothing they can do...I am not so sure).
I've been having a similar issues, doing some research I found this article
Sure enough all 5 of my ESX 3.0.1 boxes are Dell 6650's with the Broadcom chipset, I'd be interested in how many of your servers have the same chipset. VMWare support basically had no other answer than to say "use another brand NIC". A quick fix for me is to simply pull the patch cable from the NIC for a second then pop it back in. The NIC instantly comes back up.....
I find it rather irritatimg that VMWare is unable to work with Broadcom to resolve this issue, after all the server (and it's internal NIC's) are on the qualified hardware list.
Funny thing happened last night. Out of the blue, my vmotions start failing at 10%. Sure enough, I am using the onboard NIC which uses the Broadcom chipset. I run an almost fully patched 3.02 farm. The only patches missing are the ones that came out in the last two weeks. Never had this problem before. What I did find is that a LOW PRIORITY vmotion works fine. Strange
adam
Abaum
Spot on My Friend!!! my vmotion works with low priorty too. I would still like to talk to VMware Support
Thanks for your help
Ajay
Hi All
Now I don't know but today after couple of error messages vmotion has started to work. I did not do any fix on it however i was using vmotion in low priorty.I was thinking a lot about this issue, reading, thinking of reporting to support, yes that all i did to fix it. Anyway it is working...
Ajay
what error do you get when trying to vmotion?
Hi Weinstein5
1) A specified parameter was not correct.
2)A general system error occured: failed to initialize migrationat destination. Error 0xbad00a3. vmotion failed to start due to lack of cpu or memory resources.
funny was that I can migrate on low priorty but not on high
Cheers
Ajay
