VMware Cloud Community
dranik416
Contributor
Contributor

vMotion failure

Hi There,

We've always had working vMotion in our cluster and suddenly it stopped working.

There is supposed to be a reason for that, but we just seem can't find it.

What happens is as follows:  some machines cannot be vMotioned. Cold migration works always..

All ESXI hosts are in the same cluster. Logs do not provide any details. The error message is just generic:

"Timed out waiting for migration start request. The vMotion failed because the destination host did not receive data from the source host on the vMotion network. Please check your vMotion network settings and physical network configuration and ensure they are correct."

Here is what we've checked:

1. All settings:  (kernel / IPs / switches, jumbo frames)

2. We checked all logs and there is nothing there that would indicate why migration fails.

3. We turned VAAI off and tested Vmotion

For some reason some VMs can be vMotioned and some cannot. All windows 7 VDI machines vMotion across hosts with no problem. However majority of VMs with Windows 2008 R2 or windows 2012 cannot be vMOtioned.This is something new.

Any ideas / suggestions are appreciated.

Thank you

20 Replies
RajeevVCP4
Expert
Expert

are you able vmkping vmotion IP from source to destination and vice versa.

If yes

try to migrate by web client.

Rajeev Chauhan
VCIX-DCV6.5/VSAN/VXRAIL
Please mark help full or correct if my answer is use full for you
0 Kudos
dranik416
Contributor
Contributor

vmkping works, but vmotion does not.

What do you mean by migration using Web client?

It does not work.

0 Kudos
RJB3ST
Enthusiast
Enthusiast

Hello,

Please see the following useful KB articles:

The first contains the exact error message when looking at the vmware.log file.

Performing vMotion fails at 14% despite vmkping succeeding from source to target IP address (2042654...

The second is just a handy KB article about each step of a vMotion and what to troubleshoot.

Understanding and troubleshooting vMotion (1003734) | VMware KB

I hope this is useful to you!

Kind Regards,

RJ

0 Kudos
dranik416
Contributor
Contributor

Thank you. I will try this. I went through this few days ago, but not in every detail. Will try again

0 Kudos
RJB3ST
Enthusiast
Enthusiast

Have the KB articles helped resolved the issue? or are you still having a problem?

Kind Regards,

RJ

0 Kudos
a_nut_in
Expert
Expert

Try this:

1. On every host, check which port group is tagged as the vmotion port. If you have only one port group that is using both management traffic and vmotion, try unchecking vmotion on all ndoes and re-enabling and check

2. Should the vmotion port group be separate, disable the vmotion port (uncheck) and try vmotion across the same port group as the management port group and see if this works

3. Why are *some* vm's working and some not - is that across the same hosts that some VM's are able to vmotion? If this is the case, what is the difference between the VM's that are working and onces that are not?

Do remember to mark my post as "helpful" or "correct" if I've helped resolve or answer your query!
0 Kudos
dranik416
Contributor
Contributor

Hi RJ,

We've tried, however, the problem is still there and we're kind at a loss.

Thank you

0 Kudos
PCTechStream
Hot Shot
Hot Shot

Maybe This can help! ERROR: "Timed out waiting for migration start request. The vMotion failed because the destination host did not receive data from the source host on the vMotion network.

The above error indicates that remote host did not accept the connection within the allowed time limit."

NOTE: It could be an issue with Jumbo frames and MTU settings on the NICs and switches.

Multi NIC vMotion with jumbo frames on directly connected ESXi

LINK: https://yuridejager.wordpress.com/2012/07/06/multi-nic-vmotion-with-jumbo-frames-on-directly-connect...

Performing vMotion fails despite vmkping succeeding from source to target IP address (2042654)

LINK: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=20426...

Raul.

VMware VDI Administrator

http://ITCloudStream.com/

www.ITSA.Cloud
0 Kudos
dranik416
Contributor
Contributor

Vmotion and Management use the port group.

vMotion fails across across all hosts for same machine. In other words if machine fails to migrate it will fail across all hosts, however machine that does migrate will migrate to any host in the cluster.

0 Kudos
dranik416
Contributor
Contributor

Thank you Raul.

We had jumbo frames enabled. Then we changed the settings back to MTU 1500, did not work.

Now we're changing back to jumbo frames, so now in the process of doing that.

It's just all frustrating and it seems there is something that we obviously overlook but we don't know what and lack of info in vmware logs does not help.

0 Kudos
PCTechStream
Hot Shot
Hot Shot

It's definitely a connection issue from the destination host to the source host, you must check every single VMkernel IP, I had the same issue when the VMkernel got the same IPs in different VSS & I fixed the problem just by replacing the IP with new ones in order & created an "Exclusion Range" in DHCP. Try that!...

Raul.

VMware VDI Administrator

http://ITCloudStream.com/

www.ITSA.Cloud
0 Kudos
dranik416
Contributor
Contributor

I will do that. And currently working on it.

By the way: did you have the same issue with some machines migrating from the same host and some not?

0 Kudos
RJB3ST
Enthusiast
Enthusiast

I would try (if you haven't already), creating a VMkernel Port just for vMotion on its own separate vSwitch for each host. This way its completely separate from everything else, and you can then input all the configuration again and ensure all the IPs, Subnet Masks, Default Gateway and possibly VLAN is correct.

For the MTU settings like Raul said, I guess you checked both the NIC and the physical switch?

Kind Regards,

RJ

PCTechStream
Hot Shot
Hot Shot

Yes! Some VMs migrated & the others not & from the same Host, same Cluster, same DataCenter, same Domain. :smileyinfo:

www.ITSA.Cloud
dranik416
Contributor
Contributor

Oh, this is the same what happens here. I will create exclusion ranges in DHCP.

Thank you! Will test and let you know.

0 Kudos
dranik416
Contributor
Contributor

Yes, RJ, we've tried that with two hosts.

I have 12 hosts in production cluster. So, we took two hosts and on them we separated vMotion and Management and tried migrating VM from one host to another and vMotion failed.

I am checking now every IP on the network since hosts' IPs have been checked already, however, everything needs to be double-checked Smiley Happy

Thank you

0 Kudos
VladimirMihailo
Contributor
Contributor

Hello everyone. I have same issues as yours. Smiley Happy Did you fix it at the end?

Best regards

Vladimir

0 Kudos
mguidini
Enthusiast
Enthusiast

Check if you have more than one vmkernel enable for vMotion. If you do, disable the others and keep just one vmk vmotion each host in the same subnet.

Try this to check your jumbo frames peer to peer:

vmkping -I <vmotion_vmk_number> -d -s 8000 <dst_vmkernel_ip>

vmkping -I <vmotion_vmk_number> -d -s 1300 <dst_vmkernel_ip>

I would also try to enable only the management vmk to perform the vMotion tasks, usually the vmk0.

Lastly, check if vMotion port TCP 8000 is allowed, it should be allowed in and out in the ESXi internal firewall, but also check your firewall or any port filtering in your network.

0 Kudos
SrVMwarer
Hot Shot
Hot Shot

Hello,

what vmkernel.log says during that timestamp

Regards, İlyas
0 Kudos