Re: Storage Vmotion stops at 18%

richardt · ‎05-06-2010

HI all,

Got a problem with storage VMotion in a Cluster of 4 esx host (vsphere 4.0.0,208167)

VMotion works ok. no errors/warning in the host's vmkernel or vmKwarning logs. Checked storage and paths no problems found.Run a test with a VM it can be storage VMotion on two out of the 4 esx hosts in the cluster.The same error is logged in the vm's vmwre.log

Host capacity memory 77%, CPU 34%

The vm's vmware.log reports the following error:

May 06 15:40:07.831: Worker#0| DISKLIB-VMFS :CopyData '/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1-flat.vmdk' : failed to move data (Cannot allocate memory:0xc0009).

Any ideas?

May 06 15:40:02.104: vmx| VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##9_20cc2/op/=to

May 06 15:40:02.105: vmx| VmxMigrateGetParam: type: 2 srcIp=<10.1.2.7> dstIp=<10.1.2.7> mid=485e65f566323 uuid=44454c4c-3600-1033-8058-b9c04f323253 priority=none checksumMemory=no maxDowntime=0 encrypted=0 resumeDuringPageIn=no

May 06 15:40:02.107: vmx| VmxMigrateGetParam: type 2 unsharedSwap 1 memMinToTransfer 0 cpuMinToTransfer 0 numDisks 1

May 06 15:40:02.111: vmx| Received migrate 'to' request for mid id 1273124395311907, src ip <10.1.2.7>, dst ip <10.1.2.7>.

May 06 15:40:02.142: vmx| MigrateSetInfo: state=1 srcIp=<10.1.2.7> dstIp=<10.1.2.7> mid=1273124395311907 uuid=44454c4c-3600-1033-8058-b9c04f323253 priority=low

May 06 15:40:02.143: vmx| MigrateStateUpdate: Transitioning from state 0 to 1.

May 06 15:40:06.707: vmx| VMXVmdbCbVmVmxMigrate: Got SET callback for /vm/#_VMX/vmx/migrateState/cmd/##9_20ccc/op/=start

May 06 15:40:06.718: vmx| VmxMigrateGetStartParam: mid=485e65f566323 dstwid=165341

May 06 15:40:06.718: vmx| Received migrate 'start' request for mig id 1273124395311907, dest world id 165341.

May 06 15:40:06.720: vmx| WORKER: Creating new group with numThreads=1 (4)

May 06 15:40:06.723: vmx| CPT current = 0, requesting 1

May 06 15:40:06.724: vmx| MigrateStateUpdate: Transitioning from state 1 to 2.

May 06 15:40:06.725: vmx| Set thread 5 stack size to 1048576: Success

May 06 15:40:06.726: vcpu-0| Sync monModules(1).

May 06 15:40:06.729: vcpu-0| Done Sync monModules(1).

May 06 15:40:06.729: vmx| CPT: monitor ACKing mode 1

May 06 15:40:06.730: vmx| Destroying virtual dev for scsi0:0 vscsi=44625

May 06 15:40:06.730: vmx| VMMon_VSCSIStopVports: No such target on adapter

May 06 15:40:06.731: vmx| DISKLIB-VMFS : "/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1-flat.vmdk" : closed.

May 06 15:40:06.732: vmx| CPT current = 1, requesting 2

May 06 15:40:06.733: vmx| CPT: monitor ACKing mode 2

May 06 15:40:07.553: vmx| CPT current = 2, requesting 6

May 06 15:40:07.563: vmx| Checkpoint_Unstun: vm stopped for 840003 us

May 06 15:40:07.564: vcpu-0| Sync monModules(6).

May 06 15:40:07.564: vcpu-0| Done Sync monModules(6).

May 06 15:40:07.564: vmx| CPT: monitor ACKing mode 6

May 06 15:40:07.569: vmx| DISK: OPEN scsi0:0 '/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1.vmdk' persistent R[]

May 06 15:40:07.590: vmx| DISKLIB-VMFS : "/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1-flat.vmdk" : open successful (10) size = 18376324096, hd = 101089547. Type 3

May 06 15:40:07.590: vmx| DISKLIB-DSCPTR: Opened : "ns1-flat.vmdk" (0xa)

May 06 15:40:07.590: vmx| DISKLIB-LINK : Opened '/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1.vmdk' (0xa): vmfs, 35891258 sectors / 17.1 GB.

May 06 15:40:07.590: vmx| DISKLIB-LIB : Opened "/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1.vmdk" (flags 0xa).

May 06 15:40:07.591: vmx| DISK: OPEN '/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1.vmdk' Geo (2234/255/63) BIOS Geo (0/0/0)

May 06 15:40:07.592: vmx| DISKLIB-CTK : Auto blocksize for size 35891258 is 128.

May 06 15:40:07.603: vmx| DISKLIB-CBT : Initializing ESX kernel change tracking for fid 101089547.

May 06 15:40:07.604: vmx| DISKLIB-CBT : Successfuly created cbt node 606810b-cbt.

May 06 15:40:07.604: vmx| DISKLIB-CBT : Opening cbt node /vmfs/devices/cbt/606810b-cbt

May 06 15:40:07.605: vmx| Creating virtual dev for scsi0:0

May 06 15:40:07.605: vmx| DumpDiskInfo: scsi0:0 createType=11, capacity = 35891258, numLinks = 1, allocationType = 0

May 06 15:40:07.606: vmx| SCSIDiskESXPopulateVDevDesc: Using FS backend

May 06 15:40:07.653: Worker#0| DISKLIB-LIB : CREATE: "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1.vmdk" -- vmfs capacity=0 (0 bytes) adapter=buslogic cowGran=0

May 06 15:40:07.655: Worker#0| DISKLIB-LIB : DiskLib_CreateChildCreateParam: grain size set to : 1

May 06 15:40:07.656: Worker#0| DISKLIB-LIB : CREATE: "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1.vmdk" -- vmfs capacity=35891258 (17.1 GB) adapter=buslogic cowGran=1

May 06 15:40:07.730: Worker#0| DISKLIB-LINK : "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1.vmdk" : creation successful.

May 06 15:40:07.731: Worker#0| DISKLIB-VMFS : "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1-flat.vmdk" : open successful (17) size = 18376324096, hd = 0. Type 3

May 06 15:40:07.770: Worker#0| DISKLIB-VMFS : "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1-flat.vmdk" : closed.

May 06 15:40:07.773: Worker#0| DISKLIB-VMFS : "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1-flat.vmdk" : open successful (528) size = 18376324096, hd = 151765263. Type 3

May 06 15:40:07.773: Worker#0| DISKLIB-DSCPTR: Opened : "ns1-flat.vmdk" (0x210)

May 06 15:40:07.774: Worker#0| DISKLIB-LINK : Opened '/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1.vmdk' (0x210): vmfs, 35891258 sectors / 17.1 GB.

May 06 15:40:07.774: Worker#0| DISKLIB-LIB : Opened "/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1.vmdk" (flags 0x210).

May 06 15:40:07.817: Worker#0| DISKLIB-DDB : "longContentID" = "2594bb631353b27093f8f350dd11abc3" (was "329c79216ae7078d034051b559a20051")

May 06 15:40:07.831: Worker#0| DISKLIB-VMFS :CopyData '/vmfs/volumes/4ae53235-da7ef348-3f56-001e4f1dd095/ns1/ns1-flat.vmdk' : failed to move data (Cannot allocate memory:0xc0009).

May 06 15:40:07.831: Worker#0| DISKLIB-LIB : DiskLib_Clone: Could

not perform clone using vmkernel data mover. Falling back to

non-accelerated clone. Cannot allocate memory

May 06 15:40:11.880: vcpu-0| HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/46f9b29c-8959b018-376f-001aa025a31a/ns1/ns1.vmdk'

May 06 15:40:11.898: vcpu-0| DISKLIB-CHAIN : UpdateContentID: old = 0xdd11abc3, new = 0x8ae3b0a2

May 06 15:40:11.899: vcpu-0| DISKLIB-DDB : "longContentID" = "b5392b788526fa1622fe17f58ae3b0a2" (was "2594bb631353b27093f8f350dd11abc3")

May 06 15:40:11.904: vcpu-0| DISKLIB-CHAIN : UpdateContentID: updated long content id

May 06 15:44:47.816: Worker#0| DISKLIB-LIB : numIOs = 50000 numMergedIOs = 0 numSplitIOs = 0

May 06 15:44:47.824: Worker#0| DISKLIB-LIB : numIOs = 50000 numMergedIOs = 0 numSplitIOs = 0

May 06 15:49:52.996: Worker#0| DISKLIB-LIB : numIOs = 100000 numMergedIOs = 0 numSplitIOs = 0

May 06 15:49:53.031: Worker#0| DISKLIB-LIB : numIOs = 100000 numMergedIOs = 0 numSplitIOs = 0

May 06 15:55:42.037: Worker#0| DISKLIB-LIB : numIOs = 150000 numMergedIOs = 0 numSplitIOs = 0

May 06 15:55:42.047: Worker#0| DISKLIB-LIB : numIOs = 150000 numMergedIOs = 0 numSplitIOs = 0

Message was edited by: richardt

golddiggie · ‎05-06-2010

Are all four hosts identical? How about the spec's of the hosts? Are you actually using local datastores for the VM's (on each host)? How are the network's configured for each host (vSwitches, pNIC's and vNIC's configurations, specifications)?

Storage vMotion is more for moving VM's from one LUN to another on shared storage/SAN's... While you can (probably) use it to shift between hosts' local storage (as you can with half your hosts) it's not recommended, nor really designed with that in mind.

VMware VCP4

Consider awarding points for "helpful" and/or "correct" answers.

richardt · ‎05-06-2010

More inofrmation:

All hosts are the same hardware dell r710. Storage is EMC cx3-20 and AX4-5.

Storage Vmotion is between SAN volumes, all luns are visable to all hosts.

Run a test with a VM:

VM on datastore "A' vmotion between all 4 Hosts ok.
VM is storage Vmotion to datastore "B" using host "8"
VM on datastore "B" vmotion between all 4 hosts ok.
The VM fails storage Vmotion between the same datastores as used above, on 2 out of the 4 host.
The error is the same on the 2 failed hosts.

The VM can run on all host using both datastores, but can all storage Vmotion between datastores when on 2 of the hosts.

The vmkernel log from a failed Storage Vmotion:

May 7 06:37:00 vm5 vmkernel: 144:14:21:37.944 cpu7:217926)FSR: 1136: 1273178212912056 S: Swap pre-copy iteration 0: 0 swapped pages copied

May 7 06:37:00 vm5 vmkernel: 144:14:21:37.944 cpu7:217926)FSR: 1148: 1273178212912056 S: Completed pre-copy iteration, with 0 changed pages. Changed page count less than threshold (4096), ready to stop.

May 7 06:37:00 vm5 vmkernel: 144:14:21:37.951 cpu1:217915)VSCSI: 6011: handle 69388(vscsi0:0):Destroying Device for world 217916 (pendCom 0)

May 7 06:37:01 vm5 vmkernel: 144:14:21:38.749 cpu4:217915)CBT: 1031: Created device 3b78f1e-cbt for cbt driver with filehandle 62361374

May 7 06:37:01 vm5 vmkernel: 144:14:21:38.750 cpu3:217915)VSCSI: 3472: handle 69389(vscsi0:0):Creating Virtual Device for world 217916 (FSS handle 53825311)

May 7 06:37:01 vm5 vmkernel: 144:14:21:38.963 cpu4:217927)FS3DM: 1595: status Out of memory copying 16777216 bytes from file offset 0 to file offset 0, bytesTransferred = 0 extentsTransferred: 0

esexon · ‎05-07-2010

Found this link which may help.

http://communities.vmware.com/thread/196142

It talks about the Service Console memory. Also are all your hosts at the same revision?

Evan

richardt · ‎05-13-2010

Got an answer from vmware support:

"This problem is caused by the VMFS3-DM

(Data Mover) having to use contiguous memory space on the host.

Apparently, when host's kernel memory

usage is >50% and memory has been fragmented, the DM cannot allocate more

space and throws the errors.

Note from the example above, the host

has been up for 161 days.

Resolution:-

Until the fix is published (scheduled in

upcoming releases), migrate sufficient VMs off the host to reduce memory

consumption and attempt the sVmotion again."

jhanekom · ‎05-17-2010

Thanks for posting the workaround - saved me quite a bit of time today. Had this once before and ended up cancelling the migration and forcing it to start on another host (shutting down the "faulty" host.)

Bit of extra information: in my case today, the host was only up for 13 days, so don't feel too bad that you leave your hosts running for longer periods!

GarethAW · ‎06-02-2010

Thanks Richard, you have saved us a few hours of digging around by posting your response.

Did VMWare allocate you a bug reference that we can watch out for in the release notes?

Many Thanks,

Gareth

richardt · ‎06-02-2010

Hi Gareth,

I didn't get a bug ref :(.

Richard Taylor

On 02/06/2010, at 7:30 PM, "GarethAW" <communities-emailer@vmware.com