I am trying to SVmotion some machines to new storage. Most worked fine but I have 4 machines that continually fail. When trying to move it, an error pops up that says "Error caused by file /vmfs/volumes/<datastore>/<servername>/<config>.vmdk (See attached image). I looked on the logs for vmkernel and hostd. Hostd.log doesn't really have an information. Vmkernel.log has the following error:
2013-07-10T12:43:10.945Z cpu12:3494)FS3DM: 1814: status I/O error copying 1 extents between two files, bytesTransferred = 0 extentsTransferred: 0
I tried using VEEAM but it also fails with the error "Error: Client error: VDDK error: 1.Unknown error Unable to retrieve next block transmission command. Number of already processed blocks: [8036].
I also tried VMware Converter but that also failed with "FAILED: A general system error occurred: TaskInfo indicates error but no exception is specified". Using the datastore browser also fails to copy the file. I even unregistered the VM and registered it again but that didn't seem to help.
Any idea's on how to move the machine?
I see this type of problems quite regularly ...
This is what I do then ...
0. assume your backuptool (Veeam or whatever ...) that uses CBT is doing a poor job: disable CBT for that VM , disable any existing backup jobs and disconnect the backuptool from that datastore
1. check if the vmdk is locked - if there is no lock or only one from 00000... reboot the host that has registered the VM
2. if vmkfstools -i fails : create a snapshot so that the partly damaged vmdk will only be used readonly - then use vmkfstools -i suspicious-000001.vmdk fixed.vmdk
3 if the VM still boots : use a Linux LiveCD and clone the vmdks with dd to a newly created vmdk
4. try to clone the vmdks with Converter - also use an empty snapshot of the VM to work with
5. if 1 - 4 fails : read the VMFS-volume with vmfs-fuse from Linux and use gddrescue to copy the vmdk to USB or network
6. if 5. fails and the vmdk is thick provisioned: find out start and endpoint of the vmdk and copy the area with gddrescue from a Linux LiveCD
Can you make a check that the 4 VMs are not used currently doing any writes to their disks or that the . I know it might sound clumsy. But the reason I am asking is that
bytesTransferred = 0 extentsTransferred: 0
might indicate that no svmotion started at ALL. Also, are the 4 VMs having any disk types different from the others or snapshots.
One VM that I am testing with is turned off so they aren't writing to the disk at all. They don't have any snapshots and the virtual disks are all standard (Thick Provision Lazy Zeroed).
This is what I am seeing in the vmkernel logs just before the status: I/O error
2013-07-10T12:43:10.929Z cpu10:2058)ScsiDeviceIO: 2316: Cmd(0x412441b19e00) 0x28, CmdSN 0x2ae9a67 to dev "naa.60060160c20c1d01bc22cae7a81ae111" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0.
H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x0 -- This sense key may also be returned if the device server is unable to distinguish between a flaw in the medium and a specific hardware failure
1. I am not sure whether naa.60060160c20c1d01bc22cae7a81ae111 is a destination/source datastore or may not be relevant to that. You can find this from the vSphere client itself by going to the datastore properties
2. At this time stamp 2013-07-10T12:43:10.929, I don't see any failures (Storage vmotion failure)
3. Also when I search the hostd.log i do not see any entries for the VM jag-lic - This might happen if we do not have the logs from the time when the issue happen
4. What is the source and destination datastore type?
5. Is there any other VM on the source datastore which successfully migrated
6. Is ther anyother VM which successfully migrated to the same destination datastore?
2. At this time stamp 2013-07-10T12:43:10.929, I don't see any failures (Storage vmotion failure) in hostd.log
Ok lets see this:
# Can you try and do a vmotion of these machines to a different host in the cluster. And then try and do a sVmotion to target datastore.
# Also, check the MPP for the target datastore on the ESX host that presently is hosting the VM.
# last thing I will check for is the FAs presented to target lun on your SAN. If its 2, talk to your SAN admin if he can give 4.
I have done tonns of sVmotion from DMX to VMAX on vSphere and at times found issues ranging from SAN to host to VM for failing sVmotions. Let me know if this helps.
AT10:
1. This is the source datastore.
2 and 3. The hostd log file is from when SVmotion failed. I grabbed it off of the server about 30 seconds after the error.
4. I'm not sure what you are asking. The source is on an Dell (EMC) AX-150 connected via FC. There are two connections on each server. The destination is an EMC AX4-5 connected via FC. There are also two connections on each server. They are all direct connections, there isn't a FC switch.
5. Yes, there are other VM's that were on the source that migrated fine.
6. Yes, VM's were successfully migrated to the destination datastores.
AMANVCP:
1. I can vmotion the machines between hosts but svmotion doesn't work from either host for these VM's.
2. What is MPP?
3. I don't understand what you are saying.
I see you enabled SSH on that host in the hostd.log. Did you enable SSH after the migration or before the migration?
After.
you said svmotion worked well for some VMs but only few having problem.
are you sure your vmdk is healthy? you may check by simply copy to other or same datastore and if u still not able to copy past it then your vmdk might be corrupted.
They seem to be corrupt as a copy doesn't work. Is there a way to fix them?
Copy wont tell u whether the vmdk is corrupted or not. If copy is not working then seems to be problem in the storage. If vmdk is corrupted you will not be able to power on the VM. I would say check with your storage vendor to see if there is any problem. As I said before 0x3 0x11 0x0 is not a good sign.
I checked your kernel log again, its filled with this sense code also I see abort for that datastore. Which is also not a good sign. I'm not sure how some VMs successfully migrated.
If you can power off the VM, can u try this command and post the error message
vmkfstools –v 0 –t 0 <vm.vmdk>
vmkfstools –v 0 –t 0 <vm.vmdk> -- this may take long time depends on the size of the disk. Its will scan the complete flat disk.
It only ran for a couple seconds, this was the output:
/vmfs/volumes/4ed6788c-87fd085e-2102-782bcb544683/jag-lic # vmkfstools -v 0 -t 0 jag-lic.vmdk
DISKLIB-VMFS : "./jag-lic-flat.vmdk" : open successful (14) size = 42949672960, hd = 111480549. Type 3
DISKLIB-DSCPTR: Opened [0]: "jag-lic-flat.vmdk" (0xe)
DISKLIB-LINK : Opened 'jag-lic.vmdk' (0xe): vmfs, 83886080 sectors / 40 GB.
DISKLIB-LIB : Resuming change tracking.
DISKLIB-LIB : Opened "jag-lic.vmdk" (flags 0xe, type vmfs).
Mapping for file jag-lic.vmdk (42949672960 bytes in size):
[ 0: 23068672] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376647450624 --> 376670519296)]
[ 23068672: 5242880] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376670519296 --> 376675762176)]
[ 28311552: 3145728] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376676810752 --> 376679956480)]
[ 31457280: 5242880] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376679956480 --> 376685199360)]
[ 36700160: 9437184] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 376685199360 --> 376694636544)]
[ 46137344: 58720256] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1289955049472 --> 1290013769728)]
[ 104857600: 12050235392] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1290013769728 --> 1302064005120)]
[ 12155092992: 26214400] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1302064005120 --> 1302090219520)]
[ 12181307392: 13619953664] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1302090219520 --> 1315710173184)]
[ 25801261056: 17146314752] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1315710173184 --> 1332856487936)]
[ 42947575808: 1048576] --> [VMFS -- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1332856487936 --> 1332857536512)]
[ 42948624384: 1048576] --> [VMFS Z- LVID:4ed6788a-32b093d0-d527-782bcb54468 3/4ed6788a-070e711e-ea04-782bcb544683/1:( 1332857536512 --> 1332858585088)]
DISKLIB-VMFS : "./jag-lic-flat.vmdk" : closed.
AIOMGR-U : stat o=1 r=0 w=0 i=13 br=0 bw=0
AIOMGR-S : stat o=1 r=3 w=0 i=0 br=49152 bw=0
I also tried vmkfstools -i jag-lic.vmdk test.vmdk and it failed at 45% with an Input/output error (327689).
I assume you pasted the complete output of vmkfstools, I dont see any errors which means that your vmdk and flat vmdk is very much fine. As I said before this something to do with your storage side. Could you please check in your storage management console if there is any error or warning showing up.
I see two LUN id's in the logs..
Does your VM spead across 2 LUNS? Do you have any LUN extents?..Please check that and try...
naa.60060160c20c1d01bc22cae7a81ae111
naa.600508e000000000780ee98a724e3203
I don't see any errors on either AX.
naa.60060160c20c1d01bc22cae7a81ae111 - The source LUN
naa.600508e000000000780ee98a724e3203 - The local datastore on the ESX server, nothing is stored on it.
The VM is on a single LUN. We don't have any LUN extents either.
Is it possible for you check with your storage vendor to make sure everything is fine?