VMware Cloud Community
mcowger
Immortal
Immortal

vMotion Fails At 14%

Hi Everyone,

Having a bit of a strange issue with vMotion today.  It's been working in my lab for months, and was still working fine after I upgraded to 5.1 last week.

Now, today, for whatever reason, it just refuses to progress beyond 14%, failing with timeout.  Here's the log snippet:

2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##1_6438/op/=to
2012-09-18T23:52:14.117Z| vmx| I120: Could not identify IP address family of in/srcLogIp:
2012-09-18T23:52:14.117Z| vmx| I120: Could not identify IP address family of in/dstLogIp:
2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type: 1 srcIp=<10.5.132.60> dstIp=<10.5.132.61> mid=4ca02aa12688e uuid=4c4c4544-004c-4a10-8031-c8c04f4c4b31 priority=high checksumMemory=no maxDowntime=0 encrypted=0 resumeDuringPageIn=no latencyAware=yes diskOpFile=
2012-09-18T23:52:14.117Z| vmx| I120: VMXVmdbVmVmxMigrateGetParam: type 1 unsharedSwap 0 memMinToTransfer 0 cpuMinToTransfer 0 numDisks 0 numStreamIps 1
2012-09-18T23:52:14.117Z| vmx| I120: Received migrate 'to' request for mid id 1348012698921102, src ip <10.5.132.60>, dst ip <10.5.132.61>(invalidate source config).
2012-09-18T23:52:14.117Z| vmx| I120: SVGA: Maximum display topology 2560x1600.
2012-09-18T23:52:14.120Z| vmx| I120: MigrateSetInfo: state=1 srcIp=<10.5.132.60> dstIp=<10.5.132.61> mid=1348012698921102 uuid=4c4c4544-004c-4a10-8031-c8c04f4c4b31 priority=high
2012-09-18T23:52:14.120Z| vmx| I120: MigrateSetState: Transitioning from state 0 to 1.
2012-09-18T23:52:14.120Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to emigrating for migrate 'to' mid 1348012698921102
2012-09-18T23:53:44.117Z| vmx| I120: VMXVmdb_SetMigrationHostLogState: hostlog state transits to failure for migrate 'to' mid 1348012698921102
2012-09-18T23:53:44.121Z| vmx| I120: MigrateSetStateFinished: type=1 new state=5
2012-09-18T23:53:44.121Z| vmx| I120: MigrateSetState: Transitioning from state 1 to 5.
2012-09-18T23:53:44.121Z| vmx| I120: Migrate_SetFailureMsgList: switching to new log file.
2012-09-18T23:53:44.122Z| vmx| I120: Migrate_SetFailureMsgList: Now in new log file.
2012-09-18T23:53:44.139Z| vmx| I120: [msg.migrate.expired] Timed out waiting for migration start request.
2012-09-18T23:53:44.139Z| vmx| I120: Migrate: cleaning up migration state.
2012-09-18T23:53:44.139Z| vmx| I120: MigrateSetState: Transitioning from state 5 to 0.

I can vmkping between the relevant interfaces just fine:

# vmkping 10.5.132.61

PING 10.5.132.61 (10.5.132.61): 56 data bytes

64 bytes from 10.5.132.61: icmp_seq=0 ttl=64 time=0.131 ms

64 bytes from 10.5.132.61: icmp_seq=1 ttl=64 time=0.162 ms

64 bytes from 10.5.132.61: icmp_seq=2 ttl=64 time=0.122 ms

--- 10.5.132.61 ping statistics ---

3 packets transmitted, 3 packets received, 0% packet loss

round-trip min/avg/max = 0.122/0.138/0.162 ms

I've tried restarting vCenter itself, no help.  Tried multiple different VMs.  Tried various combinations of my 4 hosts, and none of them work.

I just dont know where else to go - Ideas?

Edit:  Other things I've confirmed:  Time sync is good.  Disk free space is good.  FWD and REV name resolution is good.

--Matt VCDX #52 blog.cowger.us
34 Replies
chriswahl
Virtuoso
Virtuoso

Matt,

Has anything changed other than the 5.1 upgrade a few weeks back?

It seems like most of your work has centered around the vSphere equation. How much digging have you done on your network, such as looking at the physical switches (perhaps in debug mode) to find any issues? Any new gear plugged in, IP conflicts, etc.?

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
mcowger
Immortal
Immortal

Well, with a reboot of all 4 hosts, everything has cleared up.

I dont like this answer, but such is life Smiley Sad

--Matt VCDX #52 blog.cowger.us
Reply
0 Kudos
agarciape
Contributor
Contributor

I have the same problem after upgrade to 5.1

Reply
0 Kudos
appica_ian
Contributor
Contributor

Any solution for this?  I have a 5.1 host that also fails on vmotions at 14%.  Rebooting all my hosts didn't resolve it.


Thanks,
Ian

Reply
0 Kudos
appica_ian
Contributor
Contributor

I was able to get this resolved.  In my case, the problematic host didn't have proper access to the NFS exports.  I use Netapps VSC to connect the storage and it only assigned r/w access instead of root on this one host.

Ian  

kozzy3032011101
Contributor
Contributor

I was having the same issue when I upgraded one of the hosts from ESXi 5.0up1 to 5.1, I originally had the vmotion network only setup with one vmkernel port, I added an extra vmkernel port to each host as per the guide

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=200746...

Initially I added an IP address within the same range to the new vmkernel port eg host 1 vmotion IP1 = 172.19.19.10 IP2 = 172.19.19.11 host 2 vmotion IP1 172.19.19.12 IP2 172.19.19.13

Vmotion was still failing at 14%

I changed the second vmkernel port to use a different subnet

Host 1 vmotion IP1 = 172.19.19.10 IP2 = 172.19.20.10

Host 2 vmotion IP1 = 172.19.19.12 IP2 = 172.19.20.12

After the new subnet was added vmotion started working again.

Reply
0 Kudos
elgreco81
Expert
Expert

Hi all,

Same situation here Smiley Sad This is the second "major" error since the recent upgrade from 5.0 to 5.1. First one was solved in this discussion.

http://communities.vmware.com/message/2150582#2150582

In my enviroment I have an openfiler and my 2 ESXi hosts are nested on the same host.

If I tail -f the messages.log in the openfiler I find this. I will try to get to the bottom of it but if anyone knows where I should start looking, it will be very much appreciated:

The log goes like this while the migration is stucked at 14%.

kern.info<6>: Nov 20 16:18:08 openfiler kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun1 by sid:84552445757952 (Unknown LUN)

Then it justs repeats over and over again these messages (right after the first one)

kern.info<6>: Nov 20 16:18:10openfiler kernel: last message repeated 26 times

So from my ignorance, I'm guessing that for some reason after the upgrade, the path is pointing to a place that is not right...but, as stated before, I'll try to find out exactly what's happening...but PLEASE Smiley Happy any help is more than welcome!!!

Best regards,

elgreco81

Please remember to mark as answered this question if you think it is and to reward the persons who helped you giving them the available points accordingly. IT blog in Spanish - http://chubascos.wordpress.com
Reply
0 Kudos
TraversiRoberto
Contributor
Contributor

Hello,

similar problem here with multiple vmotion networks.

I have 3 hosts on BL460c Gen 8 inside one HP C7000 shelf with VirtualConnect Flexfabric and ethernet networks of VirtualConnect are configured in Vlan tunneling.

~ # esxcfg-route -l
VMkernel Routes:
Network          Netmask          Gateway          Interface
10.2.106.0       255.255.255.0    Local Subnet     vmk3
10.102.5.0       255.255.255.0    Local Subnet     vmk0
10.102.88.0      255.255.255.0    Local Subnet     vmk1
192.168.12.0     255.255.255.0    Local Subnet     vmk2
192.168.14.0     255.255.255.0    Local Subnet     vmk4
default          0.0.0.0          10.102.5.1       vmk0
~ #

vmk0 is used for management traffic while vmk1 to vmk4 should be used for vmotion traffic

vmk0,1,2 belong to first vDS (uplink vmnic0 and vmnic1) while vmk3,4 belong to second vDS (uplink vmnic2 and vmnic3)

i get 14% vmotion fail error if i enable more than one vmknic for vmotion while if i keep one vmknic for vmotion it works perfectly no matter which vmknic i chose (of course the same vmknik on the 3 hypervisors)

I found this http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203748... but i don't think it is my case since the problem should arise even with only one vmknic and not only with more than one.

My ESXi version is VMware ESXi 5.1.0 build-799733.

Just to complete scenario when i get the error with multiple vmotion networks (all of them are tagged on different vlans) it seems that host1 tries to contact host2 for example from 10.2.106.0 network to 192.168.12.0 while it should try the connection on 10.102.106.0.

Does anyone have suggestions?

Best regards, Roberto Traversi.

Reply
0 Kudos
siegfriedLH
Contributor
Contributor

Hello,

I've got the same problème. I've just upgrade one host from 5.0up1 to 5.1 and i can't vmotion.

I've tried to delete and reconfigure vmkernel port, it works fine, but if i reboot the host, the problem appair again.

Reply
0 Kudos
TraversiRoberto
Contributor
Contributor

Hello, i confirm, same problem as you, suddenly and without further action (just flagged or deflagged vmotion in vmk nics) everything worked fine, if i reboot the Hypervisor the problem comes out again, moreover the vmotion fails towards rebooted Hypervisor while works fine among other hypervisors not rebooted.

I opened a support case, i'll keep you posted with updates.

Best regards, Roberto.

Reply
0 Kudos
siegfriedLH
Contributor
Contributor

Hello,

I've tried to put this host out of the cluster and get it back again and VMotion works.

I'm installing this patch too: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203454...

See you Smiley Happy

PS: the patch didn't work for me, i revert back to 5.0up1

Reply
0 Kudos
serbo
Contributor
Contributor

Probably totally unrelated but I got the same error at 14%. After a bit of digging turns out I had duplicate IPs on our vMotion VLAN. It may help some of you.

Cheers

Reply
0 Kudos
admin
Immortal
Immortal

Multiple vmkernel interfaces with vmotion enabled will cause this issue. I disabled vmotion on and all interfaces but one and confirmed it worked.

Reply
0 Kudos
TraversiRoberto
Contributor
Contributor

In our environment just leaving one vmotion nic (we tried all one by one) solves the issue so it is not related to a duplicate ip address.

Finally i succeded in uploading log file to VMWare support i hope to have an answer from them.

Reply
0 Kudos
TraversiRoberto
Contributor
Contributor

Hello,

just a quick update, i had answer from VMWare support, multiple nic vmotion is supported only on the same subnet and the same vlan (if vlan tagging is used). Honestly reading documentation i didn't undestood that, i'll have to review it more carefully.

Best regards, Roberto.

Reply
0 Kudos
elgreco81
Expert
Expert

Hi all!

For what I see in this discussion and other places on the web. Fails at 14% are solved in different ways so it must be caused from several errors. Most of them from network or vnetwork configurations.

The solutions quoted in this discussion didn't work for my case but most likely they will work for others.

The good thing is, that if that other community members keep on feeding this discussion, we will all have a pretty good document about vMotions failing at 14% and ways to solve that error! Smiley Happy

It would be also good, to completly understand vmotion mechanism in order to know what vMotion does up until 14% and what it tries to do after 14%...hopefully someone out there knows the answer and post it here Smiley Happy

Thank you all for your answers and for making this community such a great tool!!!

Regards,

elgreco81

Please remember to mark as answered this question if you think it is and to reward the persons who helped you giving them the available points accordingly. IT blog in Spanish - http://chubascos.wordpress.com
Reply
0 Kudos
frankdenneman
Expert
Expert

Roberto,

I have written two articles about the correct configuration of the vMotion network, it might be helpfull to check your network configuration against it:

http://frankdenneman.nl/vmotion/2879/ (Designing your vMotion network)

http://frankdenneman.nl/vmotion/multi-nic-vmotion-failover-order-configuration/ (Multi-NIC vMotion – failover order configuration)

elgreco81,

It's highly unlikely that we are going to publically intimate details about the vMotion process, I would rather see everybody file an SR if they experience this problem as this will provide us a lot of feedback to enhance and improve our vMotion code. After GSS provided the answer to them, they can share this answer with the communitity on this board.

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series
Reply
0 Kudos
wikusvanderwalt
Contributor
Contributor

Hi everyone,

I've got a similar issue with my lab setup.  I started a discussion and noticed you guys have this one. http://communities.vmware.com/message/2174867#2174867

I've tried a few things myself and was wondering if anyone can do a storage vmotion whilst they have 14% vmotion issue?

Cheers,
Wikus

Reply
0 Kudos
TraversiRoberto
Contributor
Contributor

Hello Frank,

i read your article and i understand that i imagined the multi-nic vmotion differently.

Although vmotion in ESXi 5 works better than in 4 i think that some improvements should be considered (i'll try to ask them as feature request hoping they could be taken into consideration)

  1. multi subnet and multi vlan support (not necessarily meaning we have to set static router in esxi, attached L2 network is just fine)
    • this feature could lead to a better use of switches backplane bandwidth (if you use a single addressing and a single vlan the traffic could pass between switch interlink even in non-faulty scenarios)
    • some customers (us for example) have different groups of switches not interconnected each other but suitable for vmotion traffic
  2. ESXi should try to send vmotion traffic on a "path" after having checked that path as available (with keepalives packets exchanged among hosts, multicat or unicast depending on the needs)
    • this will prevent failures as we are experiencing now and this could even be useful to chose the best path to use based on keepalive round trip
    • keepalive packets round trip could reveal even if a 10G Link is efficient or not, link speed propagated by the network interface card driver is not reliable to determine the real speed a vmotion could achieve
  3. i imagined milti-nic vmotion a sort of multipath communication with the entire packet to transfer splitted among all the available nics and reassembled on destination, after closing the case i understood that it gives advantages only if more than one vm is vmotioned otherwise only one nic is used

Best regards, Roberto Traversi.

Reply
0 Kudos