GreyhoundHH
Enthusiast
Enthusiast

Vmotion failure: timeout

Hello!

I'm facing the problem of migration failures in a setup of two ESX 3.5 hosts. They share a LUN connected via FC. I can Vmotion VMs from host#2 onto host#1. But when I try to Vmotion from host#1 onto host#2 i get a "operation timed out"-error. I've attached a screenshot of the VirtualCenter-Log, displaying the migraton-attempt.

This is what the vmkernel.log says on host#1:

Mar 19 10:25:21 egmagvmw01 vmkernel: 1:19:25:03.908 cpu5:1037)SCSI: vm 1037: 109: Sync CR at 64

Mar 19 10:25:22 egmagvmw01 vmkernel: 1:19:25:04.882 cpu5:1037)SCSI: vm 1037: 109: Sync CR at 48

Mar 19 10:27:28 egmagvmw01 vmkernel: 1:19:27:11.471 cpu6:1093)VSCSIFs: 439: fd 4128 status Busy

Mar 19 10:27:28 egmagvmw01 vmkernel: 1:19:27:11.479 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 64

Mar 19 10:27:29 egmagvmw01 vmkernel: 1:19:27:12.495 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 48

Mar 19 10:27:30 egmagvmw01 vmkernel: 1:19:27:13.466 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 32

Mar 19 10:27:31 egmagvmw01 vmkernel: 1:19:27:14.510 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 16

Mar 19 10:27:33 egmagvmw01 vmkernel: 1:19:27:15.539 cpu2:1043)SCSI: vm 1043: 109: Sync CR at 0

Mar 19 10:27:33 egmagvmw01 vmkernel: 1:19:27:15.539 cpu2:1043)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts

Mar 19 10:27:33 egmagvmw01 vmkernel: 1:19:27:15.539 cpu2:1043)WARNING: FS3: 4784: Reservation error: SCSI reservation conflict

Mar 19 10:27:33 egmagvmw01 vmkernel: 1:19:27:15.539 cpu2:1043)WARNING: FS3: 4978: Reclaiming timed out heartbeat failed: SCSI reservation conflict

Mar 19 10:27:33 egmagvmw01 vmkernel: 1:19:27:16.090 cpu2:1043)FS3: 4828: Reclaimed timed out heartbeat

This is what the vmkernel.log says on host#2:

Mar 19 10:23:30 egmagvmw02 vmkernel: 15:18:44:14.355 cpu1:1040)SCSI: vm 1040: 109: Sync CR at 64

Mar 19 10:23:37 egmagvmw02 vmkernel: 15:18:44:21.460 cpu1:1041)World: vm 1169: 895: Starting world vmware-vmx with flags 4

Mar 19 10:23:50 egmagvmw02 vmkernel: 15:18:44:34.547 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:23:50 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:24:02 egmagvmw02 vmkernel: 15:18:44:46.633 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:24:14 egmagvmw02 vmkernel: 15:18:44:58.820 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:24:14 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:24:26 egmagvmw02 vmkernel: 15:18:45:10.709 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:24:31 egmagvmw02 vmkernel: 15:18:45:15.135 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:24:31 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:24:43 egmagvmw02 vmkernel: 15:18:45:27.844 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1117

Mar 19 10:24:44 egmagvmw02 vmkernel: 15:18:45:28.616 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:24:55 egmagvmw02 vmkernel: 15:18:45:39.798 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:24:55 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:25:09 egmagvmw02 vmkernel: 15:18:45:53.790 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:25:22 egmagvmw02 vmkernel: 15:18:46:06.870 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1196

Mar 19 10:25:23 egmagvmw02 vmkernel: 15:18:46:07.050 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:25:23 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:25:34 egmagvmw02 vmkernel: 15:18:46:18.230 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:25:34 egmagvmw02 vmkernel: 15:18:46:18.230 cpu4:1169)Fil3: 9825: Max retries (5) exceeded for caller 0x943fd1 (status 'Lock rank violation detected')

Mar 19 10:25:37 egmagvmw02 vmkernel: 15:18:46:21.026 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:25:37 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:25:46 egmagvmw02 vmkernel: 15:18:46:30.886 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1272

Mar 19 10:25:49 egmagvmw02 vmkernel: 15:18:46:33.362 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:25:59 egmagvmw02 vmkernel: 15:18:46:43.764 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:25:59 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:26:13 egmagvmw02 vmkernel: 15:18:46:57.009 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:26:17 egmagvmw02 vmkernel: 15:18:47:01.175 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:26:17 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:26:28 egmagvmw02 vmkernel: 15:18:47:11.995 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:26:40 egmagvmw02 vmkernel: 15:18:47:24.103 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:26:40 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:26:48 egmagvmw02 vmkernel: 15:18:47:32.053 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:26:56 egmagvmw02 vmkernel: 15:18:47:40.820 cpu4:1169)FS3: 1974: Checking if lock holders are live for lock [type 10c00005 offset 33717248 v 334, hb offset 3704320

Mar 19 10:26:56 egmagvmw02 vmkernel: gen 88, mode 1, owner 47de7983-d27b28f2-26b9-001a4bee92c8 mtime 147710]

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.638 cpu4:1169)Res3: 3463: Rank violation threshold reached: cid 0xc1d00002, resType 4, cnum 95

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.638 cpu4:1169)Fil3: 9825: Max retries (5) exceeded for caller 0x943fd1 (status 'Lock rank violation detected')

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.867 cpu6:1169)World: vm 1170: 895: Starting world vmm0:EGSRVW032 with flags 8

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.867 cpu6:1169)Sched: vm 1170: 5333: adding 'vmm0:EGSRVW032': group 'host/user': cpu: shares=-3 min=0 minLimit=-1 max=-1

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.867 cpu6:1169)Sched: vm 1170: 5352: renamed group 47 to vm.1169

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.867 cpu6:1169)Sched: vm 1170: 5366: moved group 47 to be under group 4

Mar 19 10:27:06 egmagvmw02 vmkernel: 15:18:47:50.874 cpu6:1169)Swap: vm 1170: 2169: extending swap to 524288 KB

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.141 cpu1:1041)Config: 414: "Enabled" = 1, Old Value: 1, (Status: 0x0)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2655: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2587: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2599: irq 4 happened while being masked

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2655: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2587: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.475 cpu0:1024)Host: 2599: irq 3 happened while being masked

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2655: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2587: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2599: irq 4 happened while being masked

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2655: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2587: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.688 cpu0:1024)Host: 2599: irq 3 happened while being masked

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.809 cpu6:1169)Migrate: vm 1170: 7332: Setting migration info ts = 1205918554140625, src ip = <192.168.240.23> dest ip = <0.0.0.0> Dest wid = -1 using SHARED swap

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.809 cpu6:1169)World: vm 1171: 895: Starting world migSendHelper-1170 with flags 1

Mar 19 10:27:09 egmagvmw02 vmkernel: 15:18:47:53.809 cpu6:1169)World: vm 1172: 895: Starting world migRecvHelper-1170 with flags 1

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2655: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2587: irq 4 vector 0x39 (host 0x49)

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2599: irq 4 happened while being masked

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2655: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2587: irq 3 vector 0x31 (host 0x41)

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.415 cpu0:1024)Host: 2599: irq 3 happened while being masked

Mar 19 10:27:13 egmagvmw02 vmkernel: 15:18:47:57.467 cpu1:1039)Config: 414: "VMOverheadGrowthLimit" = -1, Old Value: 0, (Status: 0x0)

Mar 19 10:27:20 egmagvmw02 vmkernel: 15:18:48:03.948 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1672

Mar 19 10:27:29 egmagvmw02 vmkernel: 15:18:48:12.954 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1788

Mar 19 10:27:47 egmagvmw02 vmkernel: 15:18:48:30.966 cpu5:1066)<6>qla24xx_abort_command(0): handle to abort=1892

Mar 19 10:28:05 egmagvmw02 vmkernel: 15:18:48:49.619 cpu1:1040)Config: 414: "VMOverheadGrowthLimit" = 0, Old Value: -1, (Status: 0x0)

Mar 19 10:28:05 egmagvmw02 vmkernel: 15:18:48:49.916 cpu0:1041)Config: 414: "HostLocalSwapDirEnabled" = 0, Old Value: 0, (Status: 0x0)

Mar 19 10:28:09 egmagvmw02 vmkernel: 15:18:48:53.811 cpu6:1169)WARNING: Migrate: 1345: 1205918554140625: Migration considered a failure by the VMX. It is most likely a timeout, but check the VMX log for the true error.

Mar 19 10:28:09 egmagvmw02 vmkernel: 15:18:48:53.811 cpu6:1169)WARNING: Migrate: 1242: 1205918554140625: Failed: Migration determined a failure by the VMX (0xbad0091) @0x9dd8c5

Mar 19 10:28:09 egmagvmw02 vmkernel: 15:18:48:53.812 cpu6:1169)Sched: vm 1170: 1031: name='vmm0:EGSRVW032'

Mar 19 10:28:09 egmagvmw02 vmkernel: 15:18:48:53.812 cpu6:1169)CpuSched: vm 1170: 13864: zombified unscheduled world: runState=NEW

Mar 19 10:28:09 egmagvmw02 vmkernel: 15:18:48:53.812 cpu6:1169)World: vm 1170: 2483: deathPending set; world not running, scheduling reap

Sorry for those long logs, but this is just from the timeframe of the migration-attempt.

I'm wondering about the "dest ip = <0.0.0.0>" entry in the log of the 2nd host. Is this normal??

I've already tried to rebind the physical NIC of each virtual switch with the Vmotion-interface attached to it and reenabling the Vmotion-feature on each of the VMKernel-Interfaces. But this didn'tdo anything.

Any help/hints appreciated!

thx

Tags (4)
0 Kudos
6 Replies
DCasota
Hot Shot
Hot Shot

Hi,

Just to be sure: You can vmkping the vmotion port of the 2nd esx from the 1st esx and vice versa? If not and if you don't have a VMotion gateway defined, maybe there is a name resolution problem: one test would be to write the IPs and names of both esx in the /etc/hosts file (on each esx).

Are the esx connected throught a fc-switch (or point-to-point)? if fc-switch, check if the port speed is set to fix and f-type only.

Hope it helps.

GreyhoundHH
Enthusiast
Enthusiast

Hi, thanks for your answer!

Yes, I can vmkping in each direction, forgot to post that earlier. Name-resoulution is also working. I have a DNS-server defined and the hostnames in each host-file.

Unfortunatly I don't know more about the SAN-setup. It is a shared environment, so I assume that we're connected through FC-switches.

The strange thing is that it used to work properley. I discovered this issue this morning by incidence 😕

cya

0 Kudos
DCasota
Hot Shot
Hot Shot

Hi again

Just as test: Rescan the FC HBA . Do you have red error messages on the esx server screen?

If yes, ask your SAN specialist, if there was a change and/or if he can check the SAN-LUN-mapping to both esx. Can you print the error message of the screen?

Bye

Daniel

0 Kudos
GreyhoundHH
Enthusiast
Enthusiast

Hi!

Do you mean the screen of the local console? I did a rescan and checked the screen via the server's ILO-Board. All I got was the regular ESX-Console Start screen. I did Alt-F1 to login and there are a couple of messages of attached SCSI-disks, but those aren't errors as far as I can see. I've attached a screenshot.

cya

0 Kudos
DCasota
Hot Shot
Hot Shot

Hi again,

Yes, I meant the local console of your HP server. For the moment I have no further idea. Have a look at following threads:

http://communities.vmware.com/message/497689#497689

http://communities.vmware.com/message/763093#763093

Maybe one of those cases is similar to yours.

Hope it helps.

Bye

Daniel

0 Kudos
DCasota
Hot Shot
Hot Shot

I noticed in the console output that you have a hitachi SAN. Have a look at following kb article:

0 Kudos