Hello all,
I have been looking into an issue that is happening only in a couple of ESXi hosts part of a cluster.
Any vMotion migration from other hosts to these ones fail at 14%, with the following message:
WARNING: MigrateNet: 1309: 1458908406025172 S: failed to connect to remote host <x.x.x.x> from host <y.y.y.y>: Timeout
WARNING: Migrate: 269: 1458908406025172 S: Failed: The ESX hosts failed to connect over the VMotion network (0xbad010b) @0x41802ba56f9a
I checked the configuration and compare the values with other working ESXi hosts as the following KB article describes:
The MTU, VMkernel settings, LAN settings, route table and so on looks identical to some other hosts working part of the same cluster.
I can even ping successfully the hosts through the vMotion network using the vmk interface configured.
I have been comparing the VMKernel logs performing the migration from different ESXi hosts to identify differences and I spotted the following:
- Between two ESXi hosts where vMotion works correctly:
Migrate: vm 747618: 3286: Setting VMOTION info: Dest ts = AAAAAAAAAAAA, src ip = <x.x.x.x> dest ip = <x.x.x.z> Dest wid = 0 using SHARED swap
SRC and DST IP addresses belong to the same LAN, which (ironically) are not part of the vMotion network at all, but the management one.
- Between two ESXi hosts where vMotion does not work:
Migrate: vm 727726: 3286: Setting VMOTION info: Dest ts = AAAAAAAAAAAA, src ip = <x.x.x.x> dest ip = <y.y.y.y> Dest wid = 0 using SHARED swap
SRC and DST IP addresses belong to the different LANs: SRC is the Management network and DST the vMotion one.
I am running out of ideas, does anyone know why I am seeing these differences?
Any help would be much appreciated.
Have you compared traceroute to the vmotion network from both good hosts and bad hosts to see what vmk the host is using and if it differs from other hosts?
Which vmk on each host is on your vmotion vlan and are they the only vmk's that have the "Use this adapter for Vmotion" setting checked? Maybe post screenshots of your settings so other eyes can have a look.
You don't by chance have any custom TCP/IP stacks configured on any of the hosts?
If the hosts are connected to different network switches, I would make sure the switch configurations are correct.
If you have a dedicated VMkernel interface just for vMotion, make sure no other VMkernel interface have the vMotion traffic enabled/selected.
Yes, I tested that...traceroute for the vMotion network shows the same info across the different hosts (1 hop only)
I have several vmks in each host, only one them is selected for vMotion...
It's true though I have a some vmk interfaces in the vMotion network, that are not used for vMotion but iSCSI. These do not have the vMotion checkbox selected (i checked several times...), but the iSCSI port binding option.
No, i don't use custom TCP/IP stacks...
The hosts are connected to the same dvSwitch and are part of the same port group.
Yes, there is only one VMKernel interface with the vMotion check enabled in each of the hosts.
- Between two ESXi hosts where vMotion does not work:
Migrate: vm 727726: 3286: Setting VMOTION info: Dest ts = AAAAAAAAAAAA, src ip = <x.x.x.x> dest ip = <y.y.y.y> Dest wid = 0 using SHARED swap
SRC and DST IP addresses belong to the different LANs: SRC is the Management network and DST the vMotion one.
As Richardson pointed earlier, Please check in that source host, it has got vmkernel interface marked for management and vMotion both. Or else double check the vmk ip assignment itself to see if there's no mistake in those details.
Update:
Please also refer following official KB for some more inputs.