Solved: Re: Can't SVmotion machine

kevin79 · ‎07-10-2013

I am trying to SVmotion some machines to new storage. Most worked fine but I have 4 machines that continually fail. When trying to move it, an error pops up that says "Error caused by file /vmfs/volumes/<datastore>/<servername>/<config>.vmdk (See attached image). I looked on the logs for vmkernel and hostd. Hostd.log doesn't really have an information. Vmkernel.log has the following error:

2013-07-10T12:43:10.945Z cpu12:3494)FS3DM: 1814: status I/O error copying 1 extents between two files, bytesTransferred = 0 extentsTransferred: 0

I tried using VEEAM but it also fails with the error "Error: Client error: VDDK error: 1.Unknown error Unable to retrieve next block transmission command. Number of already processed blocks: [8036].

I also tried VMware Converter but that also failed with "FAILED: A general system error occurred: TaskInfo indicates error but no exception is specified". Using the datastore browser also fails to copy the file. I even unregistered the VM and registered it again but that didn't seem to help.

Any idea's on how to move the machine?

continuum · ‎07-11-2013

I see this type of problems quite regularly ...
This is what I do then ...

0. assume your backuptool (Veeam or whatever ...) that uses CBT is doing a poor job: disable CBT for that VM , disable any existing backup jobs and disconnect the backuptool from that datastore
1. check if the vmdk is locked - if there is no lock or only one from 00000... reboot the host that has registered the VM
2. if vmkfstools -i fails : create a snapshot so that the partly damaged vmdk will only be used readonly - then use vmkfstools -i suspicious-000001.vmdk fixed.vmdk

3 if the VM still boots : use a Linux LiveCD and clone the vmdks with dd to a newly created vmdk
4. try to clone the vmdks with Converter - also use an empty snapshot of the VM to work with

5. if 1 - 4 fails : read the VMFS-volume with vmfs-fuse from Linux and use gddrescue to copy the vmdk to USB or network

6. if 5. fails and the vmdk is thick provisioned: find out start and endpoint of the vmdk and copy the area with gddrescue from a Linux LiveCD

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

View solution in original post

zXi_Gamer · ‎07-10-2013

Can you make a check that the 4 VMs are not used currently doing any writes to their disks or that the . I know it might sound clumsy. But the reason I am asking is that

bytesTransferred = 0 extentsTransferred: 0

might indicate that no svmotion started at ALL. Also, are the 4 VMs having any disk types different from the others or snapshots.

kevin79 · ‎07-10-2013

One VM that I am testing with is turned off so they aren't writing to the disk at all. They don't have any snapshots and the virtual disks are all standard (Thick Provision Lazy Zeroed).

admin · ‎07-10-2013

This is what I am seeing in the vmkernel logs just before the status: I/O error

2013-07-10T12:43:10.929Z cpu10:2058)ScsiDeviceIO: 2316: Cmd(0x412441b19e00) 0x28, CmdSN 0x2ae9a67 to dev "naa.60060160c20c1d01bc22cae7a81ae111" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.

H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 -- This sense key may also be returned if the device server is unable to distinguish between a flaw in the medium and a specific hardware failure

1. I am not sure whether naa.60060160c20c1d01bc22cae7a81ae111 is a destination/source datastore or may not be relevant to that. You can find this from the vSphere client itself by going to the datastore properties

2. At this time stamp 2013-07-10T12:43:10.929, I don't see any failures (Storage vmotion failure)

3. Also when I search the hostd.log i do not see any entries for the VM jag-lic - This might happen if we do not have the logs from the time when the issue happen

4. What is the source and destination datastore type?

5. Is there any other VM on the source datastore which successfully migrated

6. Is ther anyother VM which successfully migrated to the same destination datastore?

admin · ‎07-10-2013

2. At this time stamp 2013-07-10T12:43:10.929, I don't see any failures (Storage vmotion failure) in hostd.log

amanvcp · ‎07-10-2013

Ok lets see this:

# Can you try and do a vmotion of these machines to a different host in the cluster. And then try and do a sVmotion to target datastore.

# Also, check the MPP for the target datastore on the ESX host that presently is hosting the VM.

# last thing I will check for is the FAs presented to target lun on your SAN. If its 2, talk to your SAN admin if he can give 4.

I have done tonns of sVmotion from DMX to VMAX on vSphere and at times found issues ranging from SAN to host to VM for failing sVmotions. Let me know if this helps.

kevin79 · ‎07-10-2013

AT10:

1. This is the source datastore.

2 and 3. The hostd log file is from when SVmotion failed. I grabbed it off of the server about 30 seconds after the error.

4. I'm not sure what you are asking. The source is on an Dell (EMC) AX-150 connected via FC. There are two connections on each server. The destination is an EMC AX4-5 connected via FC. There are also two connections on each server. They are all direct connections, there isn't a FC switch.

5. Yes, there are other VM's that were on the source that migrated fine.

6. Yes, VM's were successfully migrated to the destination datastores.

AMANVCP:

1. I can vmotion the machines between hosts but svmotion doesn't work from either host for these VM's.

2. What is MPP?

3. I don't understand what you are saying.

admin · ‎07-10-2013

I see you enabled SSH on that host in the hostd.log. Did you enable SSH after the migration or before the migration?

kevin79 · ‎07-10-2013

After.

khaliqamar · ‎07-10-2013

you said svmotion worked well for some VMs but only few having problem.

are you sure your vmdk is healthy? you may check by simply copy to other or same datastore and if u still not able to copy past it then your vmdk might be corrupted.

kevin79 · ‎07-10-2013

They seem to be corrupt as a copy doesn't work. Is there a way to fix them?

admin · ‎07-10-2013

Copy wont tell u whether the vmdk is corrupted or not. If copy is not working then seems to be problem in the storage. If vmdk is corrupted you will not be able to power on the VM. I would say check with your storage vendor to see if there is any problem. As I said before 0x3 0x11 0x0 is not a good sign.

I checked your kernel log again, its filled with this sense code also I see abort for that datastore. Which is also not a good sign. I'm not sure how some VMs successfully migrated.

If you can power off the VM, can u try this command and post the error message

vmkfstools –v 0 –t 0 <vm.vmdk>

admin · ‎07-10-2013

vmkfstools –v 0 –t 0 <vm.vmdk> -- this may take long time depends on the size of the disk. Its will scan the complete flat disk.

kevin79 · ‎07-10-2013

It only ran for a couple seconds, this was the output:

/vmfs/volumes/4ed6788c-87fd085e-2102-782bcb544683/jag-lic # vmkfstools -v 0 -t 0 jag-lic.vmdk

DISKLIB-VMFS : "./jag-lic-flat.vmdk" : open successful (14) size = 42949672960, hd = 111480549. Type 3

DISKLIB-DSCPTR: Opened [0]: "jag-lic-flat.vmdk" (0xe)

DISKLIB-LINK : Opened 'jag-lic.vmdk' (0xe): vmfs, 83886080 sectors / 40 GB.

DISKLIB-LIB : Resuming change tracking.

DISKLIB-LIB : Opened "jag-lic.vmdk" (flags 0xe, type vmfs).

Mapping for file jag-lic.vmdk (42949672960 bytes in size):

[ 0: 23068672] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376647450624 --> 376670519296)]

[ 23068672: 5242880] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376670519296 --> 376675762176)]

[ 28311552: 3145728] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376676810752 --> 376679956480)]

[ 31457280: 5242880] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376679956480 --> 376685199360)]

[ 36700160: 9437184] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376685199360 --> 376694636544)]

[ 46137344: 58720256] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1289955049472 --> 1290013769728)]

[ 104857600: 12050235392] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1290013769728 --> 1302064005120)]

[ 12155092992: 26214400] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1302064005120 --> 1302090219520)]

[ 12181307392: 13619953664] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1302090219520 --> 1315710173184)]

[ 25801261056: 17146314752] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1315710173184 --> 1332856487936)]

[ 42947575808: 1048576] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1332856487936 --> 1332857536512)]

[ 42948624384: 1048576] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1332857536512 --> 1332858585088)]

DISKLIB-VMFS : "./jag-lic-flat.vmdk" : closed.

AIOMGR-U : stat o=1 r=0 w=0 i=13 br=0 bw=0

AIOMGR-S : stat o=1 r=3 w=0 i=0 br=49152 bw=0

kevin79 · ‎07-10-2013

I also tried vmkfstools -i jag-lic.vmdk test.vmdk and it failed at 45% with an Input/output error (327689).

admin · ‎07-10-2013

I assume you pasted the complete output of vmkfstools, I dont see any errors which means that your vmdk and flat vmdk is very much fine. As I said before this something to do with your storage side. Could you please check in your storage management console if there is any error or warning showing up.

prabuvmware · ‎07-11-2013

I see two LUN id's in the logs..

Does your VM spead across 2 LUNS? Do you have any LUN extents?..Please check that and try...

naa.60060160c20c1d01bc22cae7a81ae111

naa.600508e000000000780ee98a724e3203

kevin79 · ‎07-11-2013

I don't see any errors on either AX.

kevin79 · ‎07-11-2013

naa.60060160c20c1d01bc22cae7a81ae111 - The source LUN

naa.600508e000000000780ee98a724e3203 - The local datastore on the ESX server, nothing is stored on it.

The VM is on a single LUN. We don't have any LUN extents either.

admin · ‎07-11-2013

Is it possible for you check with your storage vendor to make sure everything is fine?

All

Can't SVmotion machine