VMware Cloud Community
JoJoGabor
Expert
Expert

Vmotion Issues

In one of our datacentres we have suddenly started having issues with Vmotion. When we migrate any VM from one host to another, it drops off the network for up to 2m20s. Sometimes it works absolutely fine, sometimes the dropout is not as long.

The configuration of the hosts is 2x 10Gb NICs connected to a standard VSwitch, to cover MGMT, Vmotion and VM Traffics. Each of the 10Gb NICs on the host connects to a separate Nexus2000 switch FEX’d out of a NEXUS5000 switch, connected to Cisco 4500 cores. I have used esxtop to check which vmnic the VM is registered to, and which one it moves to on the target host. Testing with two hosts the VM eventually can ping out on all 4 NICs (2 NICs on 2 hosts) so I know the VLAN is trunking fine. Additionally at the same time VMs on the same VLAN on the target host can still ping out, so I know the ports aren’t flapping. The same behaviour is observed whether the vmnic chosen on VMotion happens to be on the same switch, or whether it registers to the opposite switch.

I got the ARP table on the physical switches when Vmotioning and when we get the failure the MAC is not re-registering on the new host for the same time (up to 2m20s)

There are no errors in vmkernel log either, but I suspect it’s a switch issue. I got a physical VMware host and tested moving the cable from one NIC to the other and again observed the MAC taking 2m20s to register on the new port.

We are using ESXi 5.0 Update 3

Happens on various VMs, running Windows Server 2008 and 2012 using either E1000 or VMXNET3.

It doesn't happen in another datacentre with the same build and configuration of VMware.

Any ideas?

45 Replies
JMachieJr
Enthusiast
Enthusiast

JoJo so it looks like our issue ended up being very similar to yours. It was caused by the same thing, teamed NICs on some physical servers. Just sucks that my network guys overlooked it even though I brought this up when you posted it. This would have all been resolved a long time ago if my network team actually knew we had some servers with NIC teaming configured.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr
Reply
0 Kudos
ipsingh
Contributor
Contributor

Environment:


We have our ESX farm which consists of multiple ESX hosts on HPbl460 blades on different HP C7000 chassis.
These chassis have Virtual connect(VC) which were connected via 2-4 uplinks to cisco switch 65XX. These uplinks were in port channel configuration.

Recently these were migrated to Nexus 5000 switch with fiber connectivity. Now we have 1 uplink only with 10GB Nexus ports on each VC.


Issue:
Earlier when we do vmotion from one ESX (hosted on chassis A) to other ESX (hosted on chassisB) it was seamless operation with 1 or 2 timeouts (RTO's). But now we observerd that there were major packet drops (40-60 or more) . This behavior is seen for many Linux VM's during vmotion which has caused  application outage .

I suspect it is related to HP virtual connect and nexus integration. Can you advise if this issue is only with Nexus ? and how to overcome

Reply
0 Kudos
JMachieJr
Enthusiast
Enthusiast

The first thing I'd have your network team look at is the MAC address table on the switch. Have them make sure the dynamic address learning isn't disabled when you are experiencing network timeouts after a successful vmotion.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr
Reply
0 Kudos
ipsingh
Contributor
Contributor

I checked with network team and they confirmed that dynamic address learning isnot disabled in our setup. AS per them

there is a  delay in convergence during vmotion which is because of delay in getting mac address from the server side. So VM is not advertising the MAC while doing vmotion.


Reply
0 Kudos
gerryluke
Contributor
Contributor

Hi, I meet this issue today, we have 2 Core switches which are 3750 , all VM LAN card is E1000.

We have two ESXI server and connect to different core switch. vMotion of VM is fine. After we

change core switch to c3850x and do VM migrate from one ESXI to another one, the VM will be offline.

I found this is ARP issue. I have check the core switch mac address-table with the VM mac address.

The master core switch do not update the mac address table after the VM changed ESXI server. When

we try to connect to the VM, package will go to wrong port. We need to clear the mac address in core switch

manually then the VM will go back online.

But I find that if the VM LAN card change to use VMXNET2.0, there is no this issue. Anytime we migrate the

VM to another ESXI, both core switches will update their mac address table immediately. I think new Cisco IOS

does not support VM E1000 for vMotion anymore. Because there is no any issue when doing vMotion in my old

3750 IOS with VM E1000 lan card. But After change to use c3850x with new IOS, vMotion got network issue with

E1000 lan card, but good for VMXNET2.0. But one thing I really don't understand why Cisco don't support VMXNET3.0.

I had tried VMXNET3.0 and E1000 but fail network and only VMXNET2.0 is successful for vMotion.

Hope this information helpful.

Reply
0 Kudos
JMachieJr
Enthusiast
Enthusiast

My entire network infrastructure is Cisco and we have 3850's that have no problem with E1000 or the VMXNET3 nic driver. I don't know what IOS version we are running though. I'm curious what IOS version you are running on the 3850's you are having an issue with.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr
Reply
0 Kudos