VMware Cloud Community
mcowger
Immortal
Immortal

vMotion Fails At 14%

Hi Everyone,

Having a bit of a strange issue with vMotion today.  It's been working in my lab for months, and was still working fine after I upgraded to 5.1 last week.

Now, today, for whatever reason, it just refuses to progress beyond 14%, failing with timeout.  Here's the log snippet:

2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##1_6438/op/=to
2012-09-18T23:52:14.117Z| vmx| I120: Could not identify IP address family of in/srcLogIp:
2012-09-18T23:52:14.117Z| vmx| I120: Could not identify IP address family of in/dstLogIp:
2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type: 1 srcIp=<10.5.132.60> dstIp=<10.5.132.61> mid=4ca02aa12688e uuid=4c4c4544-004c-4a10-8031-c8c04f4c4b31 priority=high checksumMemory=no maxDowntime=0 encrypted=0 resumeDuringPageIn=no latencyAware=yes diskOpFile=
2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type 1 unsharedSwap 0 memMinToTransfer 0 cpuMinToTransfer 0 numDisks 0 numStreamIps 1
2012-09-18T23:52:14.117Z| vmx| I120: Received migrate 'to' request for mid id 1348012698921102, src ip <10.5.132.60>, dst ip <10.5.132.61>(invalidate source config).
2012-09-18T23:52:14.117Z| vmx| I120: SVGA: Maximum display topology 2560x1600.
2012-09-18T23:52:14.120Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.5.132.60> dstIp=<10.5.132.61> mid=1348012698921102 uuid=4c4c4544-004c-4a10-8031-c8c04f4c4b31 priority=high
2012-09-18T23:52:14.120Z| vmx| I120: MigrateSetState: Transitioning from state 0 to 1.
2012-09-18T23:52:14.120Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to emigrating for migrate 'to' mid 1348012698921102
2012-09-18T23:53:44.117Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to failure for migrate 'to' mid 1348012698921102
2012-09-18T23:53:44.121Z| vmx| I120: MigrateSetStateFinished: type=1 new state=5
2012-09-18T23:53:44.121Z| vmx| I120: MigrateSetState: Transitioning from state 1 to 5.
2012-09-18T23:53:44.121Z| vmx| I120: Migrate_SetFailureMsgList: switching to new log file.
2012-09-18T23:53:44.122Z| vmx| I120: Migrate_SetFailureMsgList: Now in new log file.
2012-09-18T23:53:44.139Z| vmx| I120: [msg.migrate.expired] Timed out waiting for migration start request.
2012-09-18T23:53:44.139Z| vmx| I120: Migrate: cleaning up migration state.
2012-09-18T23:53:44.139Z| vmx| I120: MigrateSetState: Transitioning from state 5 to 0.

I can vmkping between the relevant interfaces just fine:

# vmkping 10.5.132.61

PING 10.5.132.61 (10.5.132.61): 56 data bytes

64 bytes from 10.5.132.61: icmp_seq=0 ttl=64 time=0.131 ms

64 bytes from 10.5.132.61: icmp_seq=1 ttl=64 time=0.162 ms

64 bytes from 10.5.132.61: icmp_seq=2 ttl=64 time=0.122 ms

--- 10.5.132.61 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.122/0.138/0.162 ms

I've tried restarting vCenter itself, no help.  Tried multiple different VMs.  Tried various combinations of my 4 hosts, and none of them work.

I just dont know where else to go - Ideas?

Edit:  Other things I've confirmed:  Time sync is good.  Disk free space is good.  FWD and REV name resolution is good.

--Matt VCDX #52 blog.cowger.us
34 Replies
frankdenneman
Expert
Expert

Hi Roberto,

Thank you for this feedback.

1. We have been discussing support for routable vMotion internally and official product feature requests by customers would definetely help to prioritize this feature correctly. http://frankdenneman.nl/2012/11/12/vmware-feature-request/

2. vMotion checks the path, but not as far as you would like. This could increase the overhead substantially as some customers have an extreme network config and we need to take any configuration into account if we promise to check the entire path. Similar to previous point, please submit an feature request if you want to see this on the scope of our product managers. http://frankdenneman.nl/2012/11/12/vmware-feature-request/

3. Multi nic can split up the packets of a vMotion operation of a single VM and will do so to leverage the available bandwidth. The sooner vMotion copies over the "dirty" pages, the sooner the complete state is copied and we avoid the overhead of pages that are dirtied again during the copy process.

Regards,

Frank

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
dmholmes000
Contributor
Contributor

I had the same problem in my lab environment after updating one host from esxi5.0 to esxi5.1.  I am running a four host cluster using two distributed switches.  I performed the update using  Update Manager.  After the host updated, everything looked good with the exception of vMotion.  After doing a little digging, I found that my vMotion virtual adapter's IP address had changed.  Updated a second host to 5.1, had no issues with being able to immediately begin vMotioning VMs onto it.  Root cause undetermined, but the fix was a simple correction of IP address on the vmk virtual adapter.

Reply
0 Kudos
wikusvanderwalt
Contributor
Contributor

Thanks for your reply dmholmes000. My issue turned out to be the ACL on my NFS export.

Reply
0 Kudos
bmontalban
Contributor
Contributor

Had the same issue here. 2 pNICs per host dedicated to iSCSI/vMotion.  One pNIC per vSwitch, with iSCSI port binding enabled, vMotion enabled. They attach to dedicated iSCSI switches, on a seperate vLAN and subnet. vMotion failed at 14% and logs show which vmk port they are trying to use. I disabled vMotion on those particular ones and vMotion now works using the other specified ports.

Reply
0 Kudos
serbo
Contributor
Contributor

Thank you for your correspondence. I am now out of the office returning on the Monday 25th February. I will reply, if required, as soon as possible on my return.

Thanks

Stuart

Reply
0 Kudos
kramsen
Contributor
Contributor

I've added a second vMotion vmkernel Port on every ESXi Host. The ip address is in the same subnet like the first vMotion vmkernel port. Everthing works fine after that change.

thanks community !!

Matthias

Reply
0 Kudos
J3anss0n
Contributor
Contributor

I recently ran into the same issue but it was isolated to migrotions to and from one particular ESXi 5.1 host After verifying that there were no Ip conflicts or other connectivity issues on the vMotion network i restarted the management services on the host ( services.sh restart ) This instantly resolved the issue and the issue has not returned since.

Would have been nice to know the actual root cause but I was unable to figure that out.

How hard can it be, it's only one's and zero's
Reply
0 Kudos
mefendi
Contributor
Contributor

Hi All,

I have similar problem when upgrade Host 5.0 U1 to 5.1.

.

I have a vSwitch0 in each host and consist of  vMotion portgroup with IP Host 1 : 192.168.1 and Host 2 : 192.168.1.2. I also have a management portgroup with IP Host 1 : 10.33.10.16 and IP Host 2 : 10.33.10.18.

When I tried to vMotion between both host, task are failed in 14 % vMotion error.JPG. After that I disable vMotion in vMotion portgroup and enable vMotion in Management portgroup. After that, vMotion is success.

I hope that's can help you.

Thanks,

Mansur

Reply
0 Kudos
JonBelanger2011
Enthusiast
Enthusiast

I just had the same problem after adding a new 5.1 U1 host to our cluster.  The vnetwork configs were identical and yet I couldn't vmotion to the new host... It worked for 4 VMs and then nothing... They all failed at 14%.

After reading a few posts on here I asked my networking team to permit all vlans on the physical switch (dedicated cisco switches for vmotion in an C7000 enclosure) and it worked right away.  They reconfigured the switches to permit only the correct vlans (same config as when it was not working) and it has been working flawlessly since... :smileyconfused:

Reply
0 Kudos
nlongstreet
Contributor
Contributor

Had a similar issue updating to 5.1U1. I'm running a 6 cluster c7000 HP blade chassis. All hosts running 5.0. I used update manager to update one server to 5.1U1. The installation was completed successfully and was able to log into ESX reconnect to vCenter the whole 9. All my network settings transferred over no issues except vMotion failing at 14%. I logged a support ticket with VMware and they told me my vmotion NICs were not able to see the vlan associated to those vmotion NICS, however, he put 1 NIC into standy and vmotion worked...or so I thought. I updated my second host to 5.1U1 and same vmotion issue after upgrading, however, this time placing one nic into standby did not resolve the vmotion error. It was at this point the first host I put on 5.1 u1 starting acting really strange. Console was showing fork errors, couldn't change any settings on vCenter, recconnect, restarting management services didn't help only a hard boot corrected those issues but vmotion still fails at 14%. The strange part is that even tho vMotion failed and anything else I did failed when I rebooted the host the guest VMs did transfer over without a reboot of the guest. I placed another call into VMWare today but i'm really feeling a rollback may be in order to maintain some stability of my systems.

Reply
0 Kudos
JonBelanger2011
Enthusiast
Enthusiast

I just had the same problem in my c7000 also... Try to see if you can permit all vlan in your interconnect switches (we have 3020s) and see if it solves the problem.

Reply
0 Kudos
nlongstreet
Contributor
Contributor

I've got all the 3020s setup as trunk ports with all vLANs allowed. the vswitch network adapter properties is still only showing three vlans (none are the vmotion network). What's odd is the same switches with the same port setups are working fine on the 5.0 esx hosts. I'm wondering what the big change is with 5.1 or u1.

Reply
0 Kudos
Mikeyyb
Contributor
Contributor

Thanks for saving me the time.  This was my issue.

Reply
0 Kudos
ugob
Contributor
Contributor

In my case, it was the upstream switch that didn't pass the vMotion vlan.  When a host was using a nic connected to virtual connect 1 and the other host used a nic connected to virtual connect 2, the traffic needed to go through the upstream switch which didn't allow the vlan.

Reply
0 Kudos
ovdleun
Contributor
Contributor

Hi all,

Just encountered the same problem. At least the same symptoms. Mine were caused by an orphaned vmx-**.vswp file. My solution you'll find here:

http://www.b00z.nl/blog/2013/12/vmotion-fails-at-14-with-at-least-one-solution/

hope it will help someone.

Regards,

Onno.

Reply
0 Kudos