VMware Cloud Community
mzhao5
Contributor
Contributor

vMotion always fails 72%, Message: Failed to receive migration

Cluster:

vcsa 6.7u2, esxi 6.7u2, 4 nodes, 

All Flush VSAN,signal diskgroup for  each host

today, we added additional one flash disk to diskgroup to extend capacity,. after erveral hours rebalance. looks goold, vSan Healthy all good.

then  we upgrade vcsa from 6.7u2 -> 6.7u3, successed.

then, we upgrade esxi from 6.7u2 -> 6.7u3, as we didn't have DRS licences, we did manually to upgrade

1. host1,  we manually move all vms to another host, and put the hose in maintainance "ensure data accessibilty".  patch by UM.

   patching successed. host1 back online and exit maintainance, rebalanace .

2. then , same as host1. - successed.

3. host3, when we move vms to other hosts it shows as the picture. , alway failure at 72%. 

 

is there any clues for this stuation. thanks

 

vmotion.jpg

0 Kudos
8 Replies
a_p_
Leadership
Leadership

Please check whether the VM's vmware.log contains further details about the error.

André

0 Kudos
mzhao5
Contributor
Contributor

appreciate for your reply.

 

attached is that vm's vmware.log,  I don't know how to read that. trying to understand that log.

0 Kudos
kastlr
Expert
Expert

Hi,

seems to be caused by CPU feature mismatch between hosts.

2021-10-23T15:00:25.602Z| vmx| I125: [msg.checkpoint.migration.failedReceive] Failed to receive migration.
2021-10-23T15:00:25.602Z| vmx| I125: [msg.vpmc.unavailcountersA performance counter used by the guest is not available on the host CPU.
2021-10-23T15:00:25.602Z| vmx| I125: Msg_Post: Error
2021-10-23T15:00:25.602Z| vmx| I125: [msg.vpmc.unavailcounters] A performance counter used by the guest is not available on the host CPU.
2021-10-23T15:00:25.602Z| vmx| I125: [msg.checkpoint.migration.failedReceive] Failed to receive migration.
 

 


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos
mzhao5
Contributor
Contributor

Yes, I observed that, but sitll have no idea where the problem is.

0 Kudos
mzhao5
Contributor
Contributor

looks like this KB https://kb.vmware.com/s/article/81191, just not sure that.

 

0 Kudos
kastlr
Expert
Expert

Is your cluster using identical CPUs on each node?

If so, are some of your nodes on a different vSphere Version than others?

Your VM does use vpmc.enable=true, so the KB might describe the reason for your problem.


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos
mzhao5
Contributor
Contributor

host1,2,3

name: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz    codename: Skylake EP/EN/EX

host4.

name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz   codename: Cascade Lake

 

host1, host2   6.7.0 build-17700523

host3, host4  6.7.0 build-16075168

 

KB meantioned CPUS below, is that exactly impacting  cpus what my cluster using ?  I don't know how to macth  with them , 

[1] Intel® Xeon® Processor E3 v5 and v6 Family (codename Skylake, Kaby Lake)
    Intel® Xeon® D (code name Skylake-D)
    Intel® Xeon® Scalable Processor and 6th, 7th, and 8th Generation Intel® Core™ i7 and i5 (code name Skylake, Kaby Lake, Coffee    Lake and Whiskey Lake)

 

Thanks

0 Kudos
kastlr
Expert
Expert

Hi,

when I got you right the current situation looks like this.

  • Host 1 & 2
    • Skylake CPUs, updated to ESXi 6.7 P05 (2021/03/18)
  • Host 3
    • Skylake CPUs, still on to ESXi 6.7 P02 (2020/04/28)
  • Host 4
    • Cascade Lake CPUs, still on to ESXi 6.7 P02 (2020/04/28

I might be wrong, but I assume you don't have EVC enabled on your cluster.

If I'm right you'll definitively be affected by the KB article you mentioned, as the VMs running on Host 3 can't migrated to any other host.

Host 1 & 2 are now on ESXi 6.7 P05 that would prevent migration, Host 4 is equipped with Cascade Lake CPUs which already includes the microcode update.

So you should stop those VMs and run a cold migration.


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos