Re: Vmotion Issues

JoJoGabor · ‎09-09-2014

In one of our datacentres we have suddenly started having issues with Vmotion. When we migrate any VM from one host to another, it drops off the network for up to 2m20s. Sometimes it works absolutely fine, sometimes the dropout is not as long.

The configuration of the hosts is 2x 10Gb NICs connected to a standard VSwitch, to cover MGMT, Vmotion and VM Traffics. Each of the 10Gb NICs on the host connects to a separate Nexus2000 switch FEX’d out of a NEXUS5000 switch, connected to Cisco 4500 cores. I have used esxtop to check which vmnic the VM is registered to, and which one it moves to on the target host. Testing with two hosts the VM eventually can ping out on all 4 NICs (2 NICs on 2 hosts) so I know the VLAN is trunking fine. Additionally at the same time VMs on the same VLAN on the target host can still ping out, so I know the ports aren’t flapping. The same behaviour is observed whether the vmnic chosen on VMotion happens to be on the same switch, or whether it registers to the opposite switch.

I got the ARP table on the physical switches when Vmotioning and when we get the failure the MAC is not re-registering on the new host for the same time (up to 2m20s)

There are no errors in vmkernel log either, but I suspect it’s a switch issue. I got a physical VMware host and tested moving the cable from one NIC to the other and again observed the MAC taking 2m20s to register on the new port.

We are using ESXi 5.0 Update 3

Happens on various VMs, running Windows Server 2008 and 2012 using either E1000 or VMXNET3.

It doesn't happen in another datacentre with the same build and configuration of VMware.

Any ideas?

JPM300 · ‎09-09-2014

Hey JoJOGabor,

On your VSS what is your teaming settings, orig port ID? or IP hash?

I'm assuming your network is setup somewhat like this according to your discription

if so what happens If you stick to the left side of the spectrum aka (ESXi01, 10GB NIC1, and keep the vmotions on that side, does the same problem happen.

JoJoGabor · ‎09-09-2014

VSS is setup with originating virtual Port ID, Link Status Detection Only, Notify Switches: yes, Failback: Yes. Both NICs are active.

The diagram is amost correct, but there are 2x Nexus 5000s, each Nexus 2000 is attached to a single N5000, then the 5000s are crosslinked plus meshed to the 2 core switches

So I checked using esxtop to monitor when a Vmotion occurred that even when the VM stays within the same N2000/N5000 switch, the outage still occurs. ie vmnic0 on each host is connected to N2000-switch1, yet even if the VM is registered on ESXi-server1-vmnic0 and migrates to ESXi-server2-vmnic0, the outage can still occur.

JPM300 · ‎09-09-2014

Does the same thing happen if you spawn up a new VMkernel port group for vmotion and isolate it to vmnic1?

JoJoGabor · ‎09-12-2014

So I created a new portgroup for vmotion on teh same vswitch on the two hosts in question and disabled vmotion on the old Vmotion portgroup. On the new portgroups I set vmnic1 as the only active NIC. Still same result.

Note that most of the time if I test this out of hours, Vmotion works fine. So its as though its network load related, but NICs on the host are nowhere near maximum

JPM300 · ‎09-12-2014

Hmm that is really wierd JojoGabor

I did find this:

http://www.experts-exchange.com/Software/VMWare/Q_28441096.html

Which seems to point to two possible causes. 1.) Heavily load can cause this but like you said the load isn't really there. 2.) ARP Caching on the switches. Could it be ARP isn't catching up fast enough on the switches?

I also found this:

http://www.lincvz.info/2013/03/06/vmware-network-guest-issue-during-a-storage-vmotion-operation/

Points to VM's with a low memory limit

Lastly found this wierd issue with VDS not accepting all VLANS:
http://www.virtxpert.com/vms-on-a-cisco-nexus-1000v-vds-lose-network-connectivity/

Another thing you can try is add two vMotion networks so vMotion will load balance. What happens when you have two vMotion networks is if you have multiple vMotions running at the same time it round robbins between the two vMotion networks. Or if you hae VDS setup you can also enable NIOC and setup user defined IO control and give vMotion more of the pie out of the 10GB connections. Just some thoughts

The more I read about this or look into it the more I think one of your switches or nexus devices isn't updating or notifiying the switches fast enough on the vMotion. It seems like the vMotion takes place but then it takes around 2min for it to update the table.
The vSwitch &#8220;Notify Switches&#8221; setting | Rickard Nobel

I think looking into this avenue is your best bet as you know your VLAN's are good as the VM does come back online, it just takes awhile after vMotioning. The fact that it only happens when there is load is odd however.

Let us know if you stumble across the fix.

JMachieJr · ‎09-12-2014

JoJoGabor,

We are having the same exact issue on our network running VMware 5.0u3. It can take anywhere from 10s to 2mins for network connectivity to reestablish itself after a vmotion. It does the same thing whether it's on a 10gb or 1gb connection. We have been trying to figure out how to fix this for a while. I'm leaning towards a physical switch issue. I would very much appreciate it if you share any solution you find.

Thanks.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JoJoGabor · ‎09-12-2014

Interesting you are also getting it. I do think its an ARP issue. I've done some testing with a physical VMware host. I took the cables out of the 2 NICs, and starting swapping them over. I saw that the ARP address of the NIC was taking 2m20s (again) to update in the ARP table of the new switchport. However I dont know what could be causing that and the network guys just point the finger at VMware.

@JMachieJR - what swithces are you connected to out of interest?

JMachieJr · ‎09-12-2014

We are running Cisco Nexus 7009 and 5548 switches. We have been investigating an ARP issue as well due to the symptoms. I will make sure that if I make any progress on this I will share what I find. The VM's most affected are our production SQL servers so we have an added incentive to figure it out haha.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JoJoGabor · ‎09-12-2014

Have you noticed the issue happening with physical servers? Ie if you move the cable from one switchport to another? If the ARP updates correctly ping should respond within a second

JPM300 · ‎09-12-2014

It's been my expereince with these things tha the Network team always says it can't be the network or the physical switchs but then when you finally prove it has to be the physical switches and they take a deeper look they find some quarky issues or bug with them that causes the issue. It probably has to do with the notify switch feature in VMware and the switchces not getting the notification or something.

JMachieJr · ‎09-12-2014

Honestly we have not tried that yet.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JMachieJr · ‎09-12-2014

I agree they are quick to pawn it off on VMware. But what I have going for me is that a few months back there was an ARP bug in the version if IOS they were using. They had to update all the switches to fix it. I'm thinking it could be related. Only time will tell

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JMachieJr · ‎09-24-2014

Just a follow up on our research on this issue. What we are experiencing really fits with the Spanning Tree issue yet we are configured with PortFast as the KB recommends. I'm going to do a bit of research down this road to see if I can find some sort of resolution. Almost forgot to mention I did open a ticket with HP since they support our servers and VMware. They are telling me to disable spanning tree on our switches and that will fix the problem. We are in the process of getting our lab ready to do this. But I personally feel that we shouldn't have to disable STP to fix this.

VMware KB: STP may cause temporary loss of network connectivity when a failover or failback event oc...

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

Random_Gimp · ‎09-24-2014

Thanks for your reply, we are still having issues and VMware support haven't helped much yet.

My thoughts around STP (and I am not a network engineer) but if it was an STP issue and the port was shutdown, no other VMs registered to that NIC would have network connectivity. But in our environment I was testing that other VMs registered to the same vmnic that the Vmotioned machine was attached to still had network connectivity. My understanding that STP works at the switchport level and it would shutdown the port rather than an individual MAC on a trunked switchport.

Happy to be proven wrong as I have nowhere else to go on this one

JoJoGabor · ‎09-24-2014

Sorry - I posed under a colleagues username above, post was from me (the original poster)

JMachieJr · ‎09-24-2014

We have tested this by disabling STP on the ports in our lab this morning and it fixed the issue. My problem at this point is that I don't feel we should have to disable STP in order to run VMware. I am working with our networking team now to make sure all of the physical ports connected to our ESXi hosts are configured with "spanning-tree port type edge trunk" in order to verify all the ports on our NX Switches are using portfast and trunking. As of right now it's official that STP is what was causing our particular problem.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JoJoGabor · ‎09-24-2014

Hmmm interesting. Are you in a position where you can do the same test that I did, where you Vmotion a machine over, chech which vmnic it gets assigned to and verify that VMs registered on the same vmnic and on the same vlan to be sure, are still passing traffic?

I'm suspecting our symptom may be the same but the root cause is different.

JMachieJr · ‎09-24-2014

I have a few meetings starting in about 5 minutes. I'll try to do it this afternoon after my meetings and get back to you.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

JMachieJr · ‎09-24-2014

Looks like I was incorrect. It seems they didn't actually disable STP like I had thought. So I am back to square one again haha. The difficulties when you have to deal with a separate group of people to do certain things. A meeting was canceled so I was able to run the test you asked about. I did not lose connectivity to the other VM's on the same vmnic after a successful migration. Only the one VM. And I agree with you that would make me believe it's not an STP issue as well.

VCP-DCV | MCP | Linux+ Twitter: @James_Machie_Jr LinkedIn: https://www.linkedin.com/in/jmachiejr

All

Vmotion Issues