vMotion from ESXi 5.0 to 5.1 U1 stalls at 65% but ...

xiong023 · ‎06-05-2013

We recently upgrade a few hosts in one of our cluster from ESXi 5.0 build 702118 (Dell R710) to ESXi 5.1 U1 build 1117900 (Dell R720) and we're noticing that during vMotion operations (maintenance mode and manual) from the 5.0 host to the 5.1 host some VMs are hanging at 65% and causing the VMs to lose multiple pings (anywhere from 5 to 12 pings) but eventually completes. With that much ping loss, the applications are affected. This only happens going from a 5.0 host to a 5.1 host, not the other way around. And its completely random, some VMs will experience this while others don't. Also trying to reproduce this on just one VM multiple times is random as well. Our cluster intially had three 5.0 hosts, each time one was upgraded to 5.1 U1 we saw the problem, again random VMs on different VLANs. We saw the same problems using a single 1GB NIC vMotion vSwitch , multi-NIC 1GB vMotion vSwitch, and also on a single 10GB NIC vMotion vSwitch setup.

Our setup:

vCenter 5.1.0 Build 947673

3 hosts upgraded from ESXi 5.0 build 702118 to ESXi 5.1 U1 build 1117900 and the hardware swapped during each upgrade from Dell R710s to Dell R720s. All Intel NICs.

This issue has got us worried about upgrading our other clusters. On that note, other clusters with all ESXi 5.1 U1 build 1117900 hosts has no problems.

KB2036892 doesn't apply to us as the vMotion doesn't fail and with the 5.1 build, it includes the fix. Thoughts? We have not opened a support case on this yet.

vmroyale · ‎06-05-2013

Note: Discussion successfully moved from VMware vCenter™ to vMotion & Resource Management

LordISP · ‎06-05-2013

use esxtop to determine the world id of your VMs that have failed and grep your /var/log/vmkernel.log on source and destination host. This shall give you relevant event information about what went wrong. I don't have to mention you have checked the HCL (VMware Compatibility Guide: System Search) about your hardware configuration and firmware level.

http://rafaelcamison.wordpress.com http://communities.vmware.com/people/LordISP/blog

xiong023 · ‎06-06-2013

All hardware is on the HCL. The 1GB NIC is a few driver versions back but the 10GB NIC has the latest driver installed and vMotion over that produces the same random issues. Looking at one of the failed VM in the vmkernel.log I don't see any errors:

Source Host:

2013-06-05T15:04:31.431Z cpu5:5282792)Migrate: vm 5282793: 3234: Setting VMOTION info: Source ts = 1370444633537231, src ip = <10.100.12.206> dest ip = <10.100.12.205> Dest wid = 400487 using SHARED swap

2013-06-05T15:04:31.433Z cpu5:5282792)Tcpip_Vmk: 1059: Affinitizing 10.100.12.206 to world 6541311, Success

2013-06-05T15:04:31.433Z cpu5:5282792)VMotion: 2425: 1370444633537231 S: Set ip address '10.100.12.206' worldlet affinity to send World ID 6541311

Destination Host:

2013-06-05T15:04:31.196Z cpu8:400447)World: vm 400487: 1421: Starting world vmm0:LNMTCODCPD20 with flags 8

2013-06-05T15:04:31.196Z cpu8:400447)Sched: vm 400487: 6416: Adding world 'vmm0:LNMTCODCPD20', group 'host/user/pool2', cpu: shares=-3 min=0 minLimit=-1 max=-1, mem: shares=-3 min=2097152 minLimit=-1 max=-1

2013-06-05T15:04:31.196Z cpu8:400447)Sched: vm 400487: 6431: renamed group 695847 to vm.400447

2013-06-05T15:04:31.196Z cpu8:400447)Sched: vm 400487: 6448: group 695847 is located under group 695838

2013-06-05T15:04:31.311Z cpu8:400447)Migrate: vm 400487: 3273: Setting VMOTION info: Dest ts = 1370444633537231, src ip = <10.100.12.206> dest ip = <10.100.12.205> Dest wid = 0 using SHARED swap

2013-06-05T15:04:31.315Z cpu8:400447)Hbr: 3308: Migration start received (worldID=400487) (migrateType=1) (event=0) (isSource=0) (sharedConfig=1)

2013-06-05T15:05:40.766Z cpu14:400447)VSCSI: 3780: handle 8222(vscsi0:0):Creating Virtual Device for world 400487 (FSS handle 3226637)

2013-06-05T15:05:41.034Z cpu7:400487)VMMVMKCall: 208: Received INIT from world 400487

2013-06-05T15:05:41.035Z cpu7:400487)LSI: 1986: LSI: Initialized rings for scsi0 async=1, record=0 replay=0

2013-06-05T15:05:41.039Z cpu7:400487)VMotion: 5679: 1370444633537231 😧 Received all changed pages.

2013-06-05T15:05:41.044Z cpu7:400487)VmMemMigrate: vm 400487: 5005: Regular swap file bitmap checks out.

2013-06-05T15:05:41.056Z cpu7:400487)VMotion: 5458: 1370444633537231 😧 Resume handshake successful

2013-06-05T15:05:41.240Z cpu8:400745)Hbr: 3405: Migration end received (worldID=400487) (migrateType=1) (event=1) (isSource=0) (sharedConfig=1)

2013-06-05T15:05:41.270Z cpu9:400751)Swap: vm 400487: 3254: Starting prefault for the migration swap file

2013-06-05T15:05:41.341Z cpu9:400751)Swap: vm 400487: 3429: Finish swapping in migration swap file. (faulted 0 pages, pshared 0 pages). Success.

2013-06-05T16:27:57.958Z cpu8:400487)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x412401d0d1c0, 0) to dev "mpx.vmhba35:C0:T0:L0" on path "vmhba35:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

2013-06-05T16:27:57.958Z cpu8:400487)ScsiDeviceIO: 2331: Cmd(0x412401d0d1c0) 0x1a, CmdSN 0xb0b502 from world 0 to dev "mpx.vmhba35:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

kpoppen6 · ‎09-12-2013

Were you or anyone able to find a valid answer to this? I have a very similar issue where the Vmotion is completing succesfully. No real errors in Vmkernel. All components on HCL, All 10GB. I have another cluster with identical setup without this issue. VMware has not been able to pinpoint the issue to this point and i am running out of ideas. So hoping someone found an answer

xiong023 · ‎09-16-2013

We didn't get an answer as to why it was happening but we got a workaround. Support suggested to change the vMotion VMkernel Port Security settings for Promiscuous Mode to "accept", basically turning it into a hub, and that seems to have worked. We haven't been able to reproduce the problem since. In our environment, the Management and vMotion networks are on the same vSwitch but separated from the VM Guest networks so its not too much of a concern.

All

vMotion from ESXi 5.0 to 5.1 U1 stalls at 65% but eventually completes