VMs hanging on migration at 68% when destination i...

DAsh69 · ‎03-26-2024

We have a mixed ESXi 7.0U3 and 8.0U2 environment with vCenter 8.0 to manage it. Since upgrading a number of hosts to ESXi 8 we are seeing vMotion operations sometimes causing a VM to hang during it's migration. The only way to recover from this is to reboot the destination host to clear the file locks. The VM will sit at 68% until action is taken to resolve it. When this happens it is also not possible to reboot the destination host cleanly and needs to be reset from the iLO.

This has only started happening since the hosts have been upgraded from ESX7 to 8. Prior to that, no problems at all. The hardware is HPE kit and the upgrades have been done on Gen10 and above hosts. We do have an SR opened to investigate and have been advised that the issue is caused by a parameter on the VM which is too small - vmotion.maxSwitchoverSeconds, default 100 seconds. The KB sent to us suggests we should up this to 200 seconds - vMotion or Storage vMotion of a VM fails with the error: The migration has exceeded the maximum swit...

However, as I've stated, this problem did NOT exist when we only had ESXi 7 hosts and has only occurred when migrating to the ESXi 8 hosts. I am currently testing the VM timeout on a handful of machines as we do have several hundred, which will be difficult to get the downtime to alter the parameter as suggested.

Has anyone else seen this issue and if so how did you resolve it?

Shen88 · ‎03-26-2024

@DAsh69,

I've not come across this with ESXi 8.0 so far, but I've seen this in previous versions where the vmware.log file would have these errors as outlined in the KB - https://kb.vmware.com/s/article/1010045. So, increasing the value has resolved the issue. Did this work in your current case?

The migration exceeded the maximum switchover time of 100 second(s). ESX has preemptively failed the migration to allow the VM to continue running on the source host.

If you think your queries have been answered, Mark this response as "Correct" or "Helpful" and consider giving kudos to appreciate!

Regards,
Shen

DAsh69 · ‎03-27-2024

@Shen88

After posting this I went ahead and set up some VMs with an increased timeout. Initially I thought it was working but then not too long after I had a failure.

I'm gathering the logs and will send in via the SR that is open against it. I'll let you know...

Thanks
Dave

mattydub · ‎04-03-2024

We have the exact same issue having upgraded from ESXi 7.0.3 to 8.0.2b.

Having contacted VMware support, they made the same suggestion to increase the vMotion timeout, along with 2 other recommendations:

We can also ensure there is no time difference between the hosts https://kb.vmware.com/s/article/2039041 (vMotion fails at 63% with the error: The migration exceeded the maximum switchover time of 100 seconds).
Additionally, ensure to double-check vMotion network configuration https://kb.vmware.com/s/article/1030389 (vMotion fails with connection errors).

We have referenced this post and asked them to investigate further for an underlying issue, as we had no issues running at versions 7. Along with an SR with HPE, who are just blaming the OS.

Have you had any updates to your SR?

DAsh69 · ‎04-08-2024

@mattydub

I'll let you know later, I've had some annual leave so I'm catching up on where this has gotten to.

DAsh69 · ‎04-08-2024

@mattydub Can I ask, did you upgrade your ESXi hosts or do a completely fresh install? At the moment I have only done upgrades. Maybe I'm clutching at straws but it's another thing I want to rule out.

goranmw1 · ‎04-10-2024

We are experiencing the same problem after upgrading vSphere from 8.0 to 8.0 U2b. VMware is actively working on resolving the issue. Until the resolution, we must disable DRS to prevent VMs from being vMotioned and becoming unavailable due to file locks preventing power-on.

mattydub · ‎04-10-2024

We have had word that this is a bug that will be fixed in 8.0U3 (tentative release as of now is June) - no official bug ref number yet though.

As a workaround we have been asked to change the default swap file location from dedicated DS to Virtual Machine Directory.
No reboots required, it will apply as and when a VM vMotions. This obviously requires a check that you have enough available storage to accommodate the swap files. (Our dedicated Swap File DS is 15TB+

mattydub · ‎04-10-2024

@DAsh69 yes, ours is upgrades too - applied through Lifecycle Manager

DAsh69 · ‎04-10-2024

Thanks for the replies here guys. At least it's not just something odd with our set up.

I have been requested to do some testing with logging increased which I'll continue with until I hear different. However it is sounding more like VMware are now aware of an issue.

Gamester17 · ‎04-17-2024

We got the exact same problem that vMotion deadlock sometimes after upgrading from ESXi 7.0 U3 (7.0.3) to ESXi 8.0 U2 (8.0.2) as the original post above.

The upgarde was done as a complete clean reinstallation as we moved from booting ESXi on local disk to SAN-boot at the same time.

We have an open support case for 2-weeks and they now confimed that it is a bug in vSphere ESXi 8.0 U2 (8.0.2) hypervisor without a permanent fix so far.

VMware support technician mentioned that he heard the same problem symtoms reported by many other customers now as well, so it is definitely a known issue.

They say the problem does apparently only appers when have set default swap file location to use a specific datastore ("Use a specific datastore,").

VMware support said only workaround is change default "Swap file location" on all hosts to "Virtual machine directory" instead of "Use a specific datastore,".

https://kb.vmware.com/s/article/1004082

time.

We have an open support case for 2-weeks and they now confimed that it is a bug in vSphere ESXi 8.0 U2 (8.0.2) hypervisor without a permanent fix so far.

VMware support technician mentioned that he heard the same problem symtoms reported by many other customers now as well, so it is definitely a known issue.

They say the problem does apparently only appers when have set default swap file location to use a specific datastore ("Use a specific datastore,").

VMware support said only workaround is change default "Swap file location" on all hosts to "Virtual machine directory" instead of "Use a specific datastore,".

https://kb.vmware.com/s/article/1004082

VMware support also said that they expect that a fix will come with 8.0 U3 (8.0.3) that should be released sometime this summer.

We are currently not happy with this reply and are trying to push VMware for hot-fix patch bug-fix for 8.0 U2 (8.0.2) that can be applied ASAP. Apparently it does not yet meet criteria for back-porting as not enough customer have reported this as a problem for their production.

Update from VMware support said that they have an internal knowledge base article on this, so we also pushed for them to post a public KB for this too.

VMware support said that the bug does not exist in 8.0 U1 (8.0.1) and easier so a possible other workaround is to downgrade.

For us that suggested workaround is not an option as we use VMware Site Recovery Manager (SRM) with array-based replication and then you really need to set specific datastore:

https://docs.vmware.com/en/Site-Recovery-Manager/8.8/srm-administration/GUID-CC632E80-63AA-4CEF-9D0A...

VMware support also said that they expect that a fix for this bug will come with 8.0 U3 (8.0.3) that should be released sometime this summer.

We are currently not happy with this reply and are trying to push VMware for bug-fix patch for 8.0 U2 (8.0.2) that can be applied ASAP.

Update from VMware support said that they have an internal knowledge base article on this, so we also pushed for them to post a public KB for this too.

VMware support said that the bug does not exist in 8.0 U1 (8.0.1) and easier so a possible other workaround is to downgrade.

All

VMs hanging on migration at 68% when destination is ESXi 8.0

ESXI 8.02

vmotion