vadm168
Enthusiast
Enthusiast

Extremely slow vMotion performance

Environment:

ESXi 6.5U1/U2

vCenter 6.7U1b

I have several ESXi clusters under the vCenter and the ESXi clusters are physically located around the world.I don't do vMotion all the time but it has been working pretty well especially in some of the clusters on 10Gb networks. I upgraded my vCenter from 6.5 to 6.7U1b recently but have not tried vMotion until today I need to upgrade the ESXi from 6.5 to 6.7. However I then realize vMotion performance is so horrible it's impractical. What used to take less than a minute over 10Gb to migrate only the computer state, not even the storage (backend datastore is shared NFS) and now takes an hour to migrate just one VM. Digging into logs did not find anything useful. Note since there has been no network changes, it does not make sense the common causes such as mismatched MTU, etc is the root cause (I checked them anyway and they are fine).

Are there other logs to check other than VM's vmware.log and host's vmkernel.log? Anyone has similar issue?

Thank you.

7 Replies
vadm168
Enthusiast
Enthusiast

I'd like to add that it occurs to all clusters I've tried so far, not just one cluster.

0 Kudos
a_p_
Leadership
Leadership

That's interesting, because the vMotion network is configured on the host level, and only triggered by vCenter, so the vCenter version shouldn't have any influence, and from what I understand, the host have not been upgraded yet.

Do you see any related entries in the hosts' vmkernel logs?

André

0 Kudos
vadm168
Enthusiast
Enthusiast

Correct, VCSA was upgraded to 6.7U1b while ESXi hosts are still on 6.5U1/U2.

What's strange is for the VMs I've tested, they all took about an hour to complete. I notice the first 29-30 mins it stuck at 14%, then the percentage starts to move forward fast and took < 2 mins to reach 98%. And then it stuck there for 29-30 mins to finish. I can confirm on both source/destination hosts that the only time there was spike in network traffic was during that 1~2 mins in the middle. This is vMotion only, not storage vMotion. No network changes and it occurs the all the data centers/clusters at different physical locations for the VMs I've tested so far. Very strange....

Thanks,

0 Kudos
vadm168
Enthusiast
Enthusiast

Fascinating! I started another vMotion on another VM and waited for exactly half an hour from the start time (not a second off!) and vMotion moved forward from 14%. So sequence of events:

1. vMotion starts and it reached 14% quickly.

2. It stays at 14% for exactly half an hour

3. It moves forward quickly from 14% to 98%

4. It stops at 98% for another half an hour

5. vMotion completes.

I searched values in Advanced Settings but did not find anything that's set o 1,800 seconds (ie half an hour)...

0 Kudos
vadm168
Enthusiast
Enthusiast

More clues: in vcenter's vpxd log, it looks like those two 1800 seconds were spent on below but anyone know what they mean? What service was it tried to log in but failed?

* Right after the first 1800 seconds:

2019-03-04T22:16:02.693Z error vpxd[19613] [Originator@6876 sub=sms opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] [ConnectLocked] Failed to login to service: N7Vmacore16TimeoutExceptionE(Operation timed out)

2019-03-04T22:16:02.697Z error vpxd[19613] [Originator@6876 sub=sms opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] Received exception from SMS: N7Vmacore9ExceptionE(Operation timed out)

2019-03-04T22:16:02.700Z warning vpxd[19613] [Originator@6876 sub=VmProv opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] InvokeListenerPreCallback [StorageListeners] took 1800029 ms

* right after the second 1800 seconds:

2019-03-04T22:46:12.626Z error vpxd[19613] [Originator@6876 sub=sms opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] [ConnectLocked] Failed to login to service: N7Vmacore16TimeoutExceptionE(Operation timed out)

2019-03-04T22:46:12.629Z error vpxd[19613] [Originator@6876 sub=sms opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] Received exception from SMS: N7Vmacore9ExceptionE(Operation timed out)

2019-03-04T22:46:12.633Z warning vpxd[19613] [Originator@6876 sub=VmProv opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] InvokeListenerPostCallback [StorageListeners] took 1800033 ms

* this looks like a summary: 2 x 1800 secs ~= 3601 secs

2019-03-04T22:46:12.759Z warning vpxd[19613] [Originator@6876 sub=VpxProfiler opID=jry8o5o4-426463-auto-9528-h5:70045337-da-01] VpxLro::LroMain [TotalTime] took 3610111 ms

Thanks,

0 Kudos
richcopey
Contributor
Contributor

DId you ever get a resolution to this problem? We are having the exact same issue across all our vCenters. Exactly the same symptoms as you describe with the 2 x 1800 second halts at 13% and 98% during the vMotion task. Cannot find the answer anywhere. What used to be a 10 minute task to migrate 50 VMs and put a host into maintenance mode is now taking hours. Really frustrating!

0 Kudos
richcopey
Contributor
Contributor

Managed to resolve this after working with our storage guy. The error in the logs pointed to Storage Providers, and he had recently upgraded some of the NetApp plugins (VSC/VASA etc) which kind of coincided with our issues occurring.

A restart of the "vmware vsphere profile-driven storage service" on our vCenters that were affected has completely resolved the issue now, and vMotions are running like lightning again.