OCCDave
Contributor
Contributor

vMotion failing at 82% with "A general system error occurred: Source detected that destination failed to resume"

Hi,

I'm trying to diagnose a problem with our ESX 3.5 environment. One of the hosts is unable to vMotion any VMs off to other hosts. Each time, the process seems to start normally, but at 82%, the following message appears and the vMotion fails. "A general system error occurred: Source detected that destination failed to resume".

The cluster consists of 7 hosts, all using FibreChannel storage LUNs - we have no NFS partitions within the system.

The machines which are on this host need to be migrated off live (shutting down and migrating the VMs isn't really an option as they are pretty much "mission critical"), so rebooting the ESX host isn't possible until the machines are safely running elsewhere.

Any thoughts or help would be greatly appreciated.

Tags (3)
0 Kudos
10 Replies
f10
Expert
Expert

Reset the migrate.enabled value as described in artilce http://kb.vmware.com/kb/1013150 and let me know if this helps Smiley Happy

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

f10

VCP3,VCP4,HP UX CSA

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
OCCDave
Contributor
Contributor

No, that didn't fix it, f10...

I reset the value on each of the 7 hosts and tried migrating to 3 of them and it failed every time at the same point - 82%, and with the same error.

0 Kudos
f10
Expert
Expert

What version of ESX are you using ? You may wanna check http://kb.vmware.com/kb/1006052 the symptom is the same VMotion fails at 82% which indicates that the storage is in consistent across all hosts. If the KB does not resolve the issue we would have to take a look at the /var/log/vmware/hostd.log

All the best Smiley Happy

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

f10

VCP3,VCP4,HP UX CSA

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
OCCDave
Contributor
Contributor

Not the exact answer - but that was very helpful thanks... I found the log file to take a look through based on what you sent me - and found that the swapfiles were failing on transfer. Now I know why as well...

Last week, we had a server failure (IBM Blade) and had to have a motherboard replaced. A spare server was put into the ESX environment while the original was being fixed, and because we boot from FibreChannel, it all seemed to work fine - however, the INTERNAL drive of the replacement host had a different Hardware ID and so wasn't recognised by the other hosts in the cluster - hence we could migrate ONTO the problematic host, but not off.

I've now changed the swapfiles to be stored with the VM rather than in the host's store and the machines are migrating off. I'll be putting the original machine back into the cluster later today so hopefully we shouldn't see the problem again!

Thanks a million for your help f10 - like I said - greatly appreciated!

Dave.

0 Kudos
f10
Expert
Expert

Hey Dave,

I am glad to hear that the issue is resolved, time to celebrate Smiley Happy

f10

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
OCCDave
Contributor
Contributor

Ok - might be a red herring actually - I've put the original blade back in and I'm still getting the same errors at the same point. Attached is a dump of the TARGET hostd.log file for a failed vMotion - any ideas?

hostd-8 is from the target, esxhost03.

hostd-6 is from the source where all the problems are happening.

I can migrate machines around in any of the other hosts of the 7 in that cluster, but for some reason, things seem to be failing from host-1.

0 Kudos
f10
Expert
Expert

Hi Dave,

On destination host

-=> Vmotion has been initiated

VMotionPrepare (1280912830082420): Sending 'from' srcIp=10.206.240.101 dstIp=10.206.240.103

-=> Error with which VMotion fails

ResolveCb: Failed with fault: (vmodl.fault.SystemError) {

dynamicType = ,

reason = "Failed to open the swap file.",

msg = ""

}

State Transition (VM_STATE_IMMIGRATING -> VM_STATE_OFF)

On Source host VMotion fails with error

ResolveCb: Failed with fault: (vmodl.fault.SystemError) {

dynamicType = ,

reason = "Source detected that destination failed to resume.",

msg = ""

}

State Transition (VM_STATE_EMIGRATING -> VM_STATE_ON)

Does the VMotion fail for only one VM ?

If its for only one VM, try to remove the virtual disks and unregister the VM, register the VM and add the virtual disks back.

You may also power off the VM, unregister the VM. Go to the console and rename the .vsp i.e. swap file to .vsp.old and in the .vmx file next to the swap file location delete the existing location and keep empy ""

If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".

f10

VCP3,VCP4,HP UX CSA

http://kb.vmware.com/

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
OCCDave
Contributor
Contributor

No...

Actually, it's really strange - I can migrate VMs into esxhost01, but I can't get them back out again without intervention... If I edit the settings of the running VM, and select that the swapfile should be stored with the VM (overriding the default), then 8 out of 10 servers will migrate off to other hosts. Some will not move off though, even if I change that setting - and those VMs have to be shut down and bought back up on a different host to migrate them...

0 Kudos
f10
Expert
Expert

I would suggest that you log a call with support.

Regards,

f10

Regards, Arun Pandey VCP 3,4,5 | VCAP-DCA | NCDA | HPUX-CSA | http://highoncloud.blogspot.in/ If you found this or other information useful, please consider awarding points for "Correct" or "Helpful".
0 Kudos
BearHuntr
Contributor
Contributor

I know that it's been a while, but was there any resolution to this issue for you?

I've started seeing the same exact thing on my Test environment after I started experimenting with moving the vswp files to a separate datastore from the rest of the VM files.  I have 2 hosts in the cluster and host #1 seems to be plagued with the same problem.  I can vmotion off of host #2 onto host #1 with no problems, but trying to vmotion off of host #1 gets a failure.  Strangely, it only seems to affect the 5 Windows VMs, the one Linux VM I have can vmotion back and forth with no issue.  If I edit the settings for the Windows VMs so that the vswp is stored with the VM files, it works for 2 of them.  Then, I found that if I disable HA on the cluster altogether, then they all move.  So, at the very least, I don't have to power anything down.  However, I would hate to have this be the norm for any vmotions needed.

I was hoping to move all of my production servers vswp files to a non-replicated datastore, but if this is the result on my production environment, that would be a disaster.

0 Kudos