So lately I've been running into some disconnect issues when vMotioning VMs. It wont happen every time, in fact I'd say it only happens about once in every 20 or 30 vMotions. We have a total of about 45 VMs across 2 Hosts (Dell R910s). The networking configs are exactly the same across the two. Most vMotions work perfectly fine, reporting no errors.
So this morning I was upgrading some RAM in them. I vMotioned all the VMs over to Host B, upgrade the RAM on Host A, then vMotion everything to Host A and upgrade the RAM on B. Everything works fine and dandy. So just as I'm wrapping up, I vMotion about 5 or 6 VMs over to Host A (which at the time had no running VMs on it yet). They appear to vMotion fine (no errors or anything), but suddenly I'm getting calls of apps being down, and I see I can't ping the servers (4 of the 6 I vMotioned wouldn't ping). Since I had this happen to me a couple weeks ago, I just quickly vMotioned them back, and they can ping again. I didn't get a chance to console into the VM to see if it showed the NIC and being disconnected.
After a bit of research and googling, all I can see is some references to the vSwitch running out of ports (as I think 32 ports was the default in 3.x). All of my VMs use the default vSwitch 0 which has 120 ports. I ran an esxcfg-vswitch -l, and the numbers seem to indicate that between the two, there are only 45 ports taken up (which sounds correct, as I have about 45 VMs). So unless ESXi isn't properly releasing the ports, I don't think that would be it.
Thoughts? It seems odd, and doesn't seem to be a 'vswitch out of ports' thing... Anything I can check?
Can you check the arp tables on the physical switch? See if the MAC for the VMs that are not working are associated with the right switch port after the vMotion.
Make sure that the physical switch ports are in portfast or portfast trunk mode.
Hope this helps.
When you find a VM has lost network connectivity after a vMotion, edit the VM settings and untick network connectivity, save this configuration and then re-enable network connectivity. This is a problem I often see with VMs that use anything other then VMXNET3 network cards, especially in a vSphere 4.x environment.
You may also just try initiating a ping out from the VM.
Here is a KB with some other troubleshooting tips: http://kb.vmware.com/kb/1003969
Problem Solved, I think!
So I was messing around with this a bit last night, as I had a chance to troubleshoot the network issue on a VM that vMotioned and had failed, but wasn't a mission critical VM. The NIC was still connected in the VM, so I tried re-connecting it via. the VM properties, which didn't help. I couldn't ping out from the VM, but strangely enough I could ping to other VMs on the same vSwitch on the same host.
Long story short, stupid human error . I made an error hooking the cabling back up after the RAM upgrade. Despite the fact I have the cables colour coded and meticulously labeled, I forgot about a minor change we had made to accomodate a new VM that had special networking requirements. This error was only only Host A though, so this would explain some of the wierdness. I could vMotion the failed VM back to host B and it worked fine, but when I would vMotion it again back to Host A, it would fail again. I guess something in the background was forcing the VM to use VMnic7, which in this case was plugged into the wrong physical switch on Host A, hence the troubles when it lived on host A.
I haven't been able to confirm this quite yet, but it makes perfect sense otherwise. Also explains some of the randomness why some VMs would be fine, and others are disconnected.
Good stuff. Glad you were able to track down the issue and it was something simple and easy to fix.
Thanks for reporting back. It is always good when we hear of the resolution.