I am trying to do a compute and svMotion of a multi TB VM (FTT1) from one vSAN cluster to another over 1Gbit uplink.
After several hours (around 10 hours and was more than 50%) it fails with Event: Migration to host xx.xx.xx.xx failed with error Connection closed by remote host, possibly due to timeout (195887167)
Source vSAN is 6.2U2 and destination vSAN is 6.7U2
vmkping -I does not show any ping loss from source to destination vmk when I test RTT min=0173 avg=0.460 max=1.218
Both cluster/hosts connected to the same switch
VM is not busy but still powered on, does it help if I try again with the VM powerd off?
I selected "Schedule vMotion with high priority (recommended) - not sure if instead I select "Schedule regular vMotion" would help?
I am thinking this might be saturation the uplink during the migration and timing out, could that be the case and would it not lower the transfer speed if so?
Any specific logs I should check or any other methods I can try to migrate?
I tested a migration (compute and svMotion) of a very small VM from the same cluster (same source host) to the same destination host in other cluster and migration worked fine, so think it must be failing because of its size.
On source host vmkernel I see events related to:
S: failed to read stream keepalive: Connection reset by peer
S: Migration considered a failure by the VMX. It is most likely a timeout...
Destroying Device for world xxxxxxxx
Destroying Device for world xxxxxxxx
disabled port xxxxxxxxxxxx
XVMotion: 2479 Timout out while waiting for disk 2's queue count to drop below the minimum limit of 32768 blocks. This could indicate network or storage problems...
You should consider moving this to general vSphere sub-forum as while there is a vSAN at either end of the vMotion, this is likely failing for generic reasons (unless of course either cluster on either end is having some issues - do of course check this).
"vmkping -I does not show any ping loss from source to destination vmk when I test"
Unless you were pinging this for the 10 hours that this was running, this is not exactly a valid test, e.g. that you see no connectivity issue now does not mean one did not occur.
Using high priority or regular shouldn't matter unless you are vMotioning other VMs at the same time:
If they are going over 1Gb uplinks and Management network (what it will go over if cold) also has 1Gb uplinks then doing this cold may be beneficial as the data will be static.
Is the VM comprised of multiple vmdks or one relatively large one? If multiple, then you could consider moving it using the advanced vMotion options of 1 to few disks at a time (which by the law of odds has less chance of timing out).
"I tested a migration (compute and svMotion) of a very small VM from the same cluster (same source host) to the same destination host in other cluster and migration worked fine, so think it must be failing because of its size."
Yes it surely is, if you have X chance of something timing out per hour (for whatever reason) and you multiply that by multiple orders of magnitude then the odds of this occurring are multiple orders of magnitude higher.
Both clusters look to be fine (no errors/connectivity issues and VMs been running on them for long), source vSAN cluster has Warning on Limits hence the reason why I want to migrate this VM before I start upgrading the hosts.
Agree on vmkping and small test VM that they are not really valid tests due to VM size/long hours it takes.
Unfortunately only 2 vmdk with one of them holding around 95% of the data
vMotion is shared with management on 1 Gbps uplinks on Standard vSwitch, the next option looks like to be shutting down the VM first and then try to migrate again (cold migration)
Open to any other suggestion (if any) and yes will consider moving this to General vSphere sub-forum in case I am out of options if the above fails.
In an FTT1 setup, does vSAN copy a single copy of the VM components via the network and then creates the second copy on the destination vSAN or does it copy both VM components straightaway? Assume live migration or cold migration does not change anything here?
In case the cold migration fails I can also consider cloning the existing vSAN Default Storage Policy and change the Primary level of failures to tolerate to 0 on the newly cloned copy (Ex. FTT0 vSAN Policy) by doing the below steps:
This is a secondary VM so this might be an option to consider if the above makes sense.