Solved: Re: Vmotion causes network failure

AaronKCollege · ‎10-06-2016

We have a a vSphere cluster of 6 hosts that has been running quite well for years. We have vmotioned VM's in the past without incident, but since it's not something we need to do on a regular basis, it's been a while. One of the hosts had one of it's OS disks go bad, so I started vmotioning systems off of it so I could put it maintenance mode before replacing the disk. Wasn't strictly necessary (the disk was hot swappable) but hey, better safe than sorry. However, when the VM's finished moving to another host...they stopped responding on the network entirely. Couldn't even ping the gateway. This has never happened before. Move them back to the original host, and they immediately start pinging again.This also happens if I shutdown the VM, THEN migrate it, and power it back up again. Some VM's, after being moved, will have intermittent network outages. 8 pings go fine, then 10 time out, then 8 or 9 are fine. Sometimes with a pattern like that, and sometimes not. Again, move them back to the original host and all is well.

The hosts are connected to a Cisco switch stack with two nics each, configured with cross stack etherchannel. The nic teaming is configured for Ip Hash with Switch notify. This configuration has been in production for years, and vmotion worked fine on many occasions. We have checked the MAC address table on the switch stack and it is being properly updated as soon as the vmotion finishes. Traffic sniffing shows that some packets from a moved VM are getting out to the network, but not all. For example, after moving one VM, wireshark on a system on the same VLAN saw ARP broadcasts from that VM requesting the MAC address of the network gateway. The other odd thing is that if I create a brand NEW VM, I can vmotion it from host to host to host all day long with zero problems. I also had another VM that exhibited issues, but when I switched out it's virtual network "card", the problems also ceased and I was able to vmotion it at will without any issues. Also, no network issues from any VM's on the hosts have been seen. Everything seems to work just great...as long as I don't try to move anything to a new host. From everything I've seen, it's almost as if something has "locked" each VM to it's respective host and moving it anywhere else causes the networking to fail. And before you ask, no...port security is not turned on on the switch.

Full disclosure, these hosts are still running ESXi 5.1 (though vCenter was upgraded this summer to 5.5). I haven't gotten around to upgrading, and now I'm in a bit of a pickle because even though I WANT to upgrade, if I can't vmotion the production systems to other hosts, it becomes rather difficult. Plus, even though our support contracts are paid up, I cannot get any support from Vmware on this issue. I was hoping that someone here may have run into a similar problem and could provide some guidance as to things I could investigate.

I *suspect* that this is something switch related. We recently (few weeks ago) had a bunch of network problems that were traced back to a new switch installed on the network doing Ip Device Tracking. I don't know what that could have done to the VMware cluster, but it's the only strange network occurrence we've had lately. Plus, since it's been probably months since we have vmotioned anything, I can't make a correlation between those network issues and this one. The only thing that makes me suspect it is that our monitoring system (which is on the production vmware cluster) suddenly started having trouble reaching one particular VM on a DEV vmware host in another building. It's intermittent, and it has no trouble reaching other VMs on that same host. I feel like the two issues are related, but I can't put my finger on it.

AaronKCollege · ‎10-14-2016

Thought I would post this here for posterity. Cisco looked at the switch stack and it turns out that one of the two switches had succumbed to a memory leak bug in the code related to IP Device Tracking. It was "up", and all the links according to both the switch stack and the vmware hosts were "up" but traffic down any connection on the affected switch wasn't working at all. Last night we reloaded that particular switch and now vmotion works just fine again.

View solution in original post

AaronKCollege · ‎10-06-2016

Quick Update. I just discovered something interesting in the configuration of the Cisco switch stack that these hosts are connected to. The hosts are set for "Route based on IP Hash", which they need to be since the ports on the switch are configured for Etherchannel. However, I've discovered that the switches etherchannel load-balancing setting is set to "src-mac" instead of "src-dst-ip" which this article seems to indicate it needs to be: Sample configuration of EtherChannel / Link Aggregation Control Protocol (LACP) with ESXi/ESX and Ci... Since most of the issues I've seen seem to revolve around MAC addresses, and I *have* seen inconsistencies with what vmotioned systems can or cannot ping, this seems like as close to a smoking gun as I've gotten. BUT, if this is a problem...how is anything working at all? To my knowledge, this switch's configuration hasn't changed since they were installed and we haven't had an issue until recently. And why would vmotion cause it to have a problem (and vmotioning it BACK fixes it?) but other systems are humming along fine with no issue?

SebastianGrugel · ‎10-06-2016

Hello AaronKCollege

Do you use vDS ? or Standard switch ?

Maybe your your vDS is out of sync...

The vNetwork Distributed Switch configuration on some hosts differed from that of VMware vCenter Ser...

Did you saw this KB ?

Troubleshooting virtual machine network connection issues (1003893) | VMware KB ?

Sebastian

vExpert VSAN/NSX/CLOUD | VCAP5-DCA | VCP6-DCV/CMA/NV ==> akademiadatacenter.pl

AaronKCollege · ‎10-07-2016

It's a standard switch. I've looked through the KB you linked and most of those are geared towards troubleshooting a VM that has no network connectivity at all. In our case, all of our VM's currently have network connectivity, but lose it if they are vmotioned to a different host. But then they regain it if moved back to their original one. I'm really suspecting the mismatch in the NIC load-balancing setting between the hosts and the switch. The hosts are using IP Hash, but the switch is set to use Source Mac Address. The only thing that's keeping me from just changing that setting on the switch right now is that I can't figure why things are working at ALL with that mismatch. The only thing I can think of is that each host only has two nics in it's etherchannel so maybe we've just gotten lucky? Or it could be that there are systems that can't talk to each other, but normally don't so we haven't noticed?

vXav · ‎10-07-2016

One thing you can try to pin down the problem:

Take a test VM with a continuous ping to the gateway (check it works).
Move it to a host where it doesn't work (Check that the ping fails)
Run ESXTOP > N > look what vmnic the VM is using (Say vmnic5)
Edit the vSwitch the VM is on and set vmnic5 as unused,
Check in ESXTOP that the VM switched to the other NIC (say vmnic3)

If the ping succeeds, then something's wrong on the port vmnic5 is connected to, most likely on that switch as it's likely that vmnic5 and 3 go to different vSwitches.

I had a similar issue last month when the network team made some dodgy changes on the LAN switches and we found out they messed something in the config and there was a problem with some STP compatibility which created a loop.

Blog - Linkedin

AaronKCollege · ‎10-07-2016

Although that's a good idea, it won't work in our case because we are using "Route based on IP Hash" so each VM uses all of the nics in the team instead of being pinned to just one. We have actually discovered today that the second switch in the stack suddenly stopped sending or receiving traffic on *all* of it's ports all at the same time a few weeks ago. We're going to get Cisco tech support involved to find out why.

AaronKCollege · ‎10-14-2016

Thought I would post this here for posterity. Cisco looked at the switch stack and it turns out that one of the two switches had succumbed to a memory leak bug in the code related to IP Device Tracking. It was "up", and all the links according to both the switch stack and the vmware hosts were "up" but traffic down any connection on the affected switch wasn't working at all. Last night we reloaded that particular switch and now vmotion works just fine again.

All

Vmotion causes network failure