vSphere vNetwork

 View Only
Expand all | Collapse all

Occassional network dropouts between specific machines

  • 1.  Occassional network dropouts between specific machines

    Posted Feb 01, 2010 07:20 PM

    We have several Windows Server 2008 x64 guest systems that occassionally lose connectivity to specific physical Windows 2008 servers on the same IP subnet. When this happens, a specific VM will be unable to communicate with a specific physical machine. Both machines will still be able to communicate with other physical and virtual servers on the same subnet and other subnets, just not each other. When this happens, a ping results in "destination host unreachable". The same virtual and physical servers are not always involved each time it happens.

    This behaviour was happening on ESXi 3.5, we were hoping an upgrade to vSphere 4.0 would fix it, but unofrtunately it has not. vSphere is at 4.0 u1, and VMware tools are up-to-date in the guests. I've checked for duplicate MACs and duplicate IPs, but can't find any.

    When the error occurs, unchecking the NIC in VMware tools or in "edit settings", applying, waiting for Windows to lose sight of the network, then turning the NIC back on fixes the issue.

    Clearing ARP caches, flushing DNS, etc, etc, does not fix the issue.

    3 node vSphere 4 cluster, behaviour happens on all physical nodes and virtual guests, happens once every few days. Seems to happen most often after a vmotion, but can occur without it. Virtual Machine network is on a NIC team, no vlan tagging. Approximately 10 virtual machines, so no where near the 56 port limit on the vswitch.

    We have another 2 node ESX 3.5 cluster with a mix of Windows 2003 and 2008 servers that is not exhibiting this problem.

    Anyone seen this behaviour before or have any ideas?



  • 2.  RE: Occassional network dropouts between specific machines

    Posted Feb 01, 2010 09:32 PM

    I have run into something similar. For us, it looked like the fix was to turn off TCP chimney on the guest, disabling TOE. This was expecially visible after a vMotion for us. It seems that the TOE hasn't been integrated with vMotion very well. I haven't had the problem since changing that setting.

    I only have VMXNET3 adapters.

    I hope it helps.

    Happy virtualizing!

    JP

    Please consider awarding points for correct and/or helpful answers



  • 3.  RE: Occassional network dropouts between specific machines

    Posted Feb 02, 2010 04:04 PM

    Upgrading to VMXNet3 has not helped, upgrading virtual hardware to version 7 has not helped, desabling TOE has not helped. It happens on vmotion right when the cutover happens when the VM is brought online on the new node. I've tested it multiple times, and had 3 or 4 different pings running on the console to various other servers on the same subnet during vmotion. All ping sessions usually drop one ping (or have at least a long delay ping), but in general at least one of the ping sessions doesn't come back - it'll give one "Request timed out", and then switch to "Destination Host Unreachable". Disable/re-enalbe the NIC or switch virtual network to something different and then back, and the ping starts getting replies again.



  • 4.  RE: Occassional network dropouts between specific machines

    Posted Feb 02, 2010 06:32 PM

    Grasping at straws...

    Could there be some switch security around MAC addresses that is causing the problem? It sounds something like a MAC address table update issue.

    Also, can you try running the continuous ping from another VM on the target machine, same port group? This should help narrow it down to an issue in the virtual networking vs. the physical networking.

    JP

    Please consider awarding points for correct and/or helpful answers



  • 5.  RE: Occassional network dropouts between specific machines

    Posted Feb 04, 2010 03:44 PM

    It's looking more and more like an HP Virtual Connect issue. I've seen the problem now happen between blades in the same HP blade enclosure. Running wireshark on both the VM and the physical box, when the ping fails I see the ARP request leave the VM, I see the ARP request come into the physical box, the reply leave the physical box, but I never see the ARP reply come back to the VM.



  • 6.  RE: Occassional network dropouts between specific machines

    Posted Feb 04, 2010 04:31 PM

    That's interesting. I wonder where the arp reply is getting dropped. Can you use the port monitor feature of VC to monitor the traffic? I haven't used it yet, but it's supposed to allow you to mirror port traffic to a physical/uplink port...

    Happy virtualizing!

    JP

    Please consider awarding points for correct and/or helpful answers



  • 7.  RE: Occassional network dropouts between specific machines

    Posted Feb 04, 2010 04:38 PM

    A said looks like MAC address relocation issue. Are the vSwitches set to notify switches? Where does the physical switching infrastructure think the VM's MAC address is connected during a failure period?

    Please award points to any useful answer.



  • 8.  RE: Occassional network dropouts between specific machines

    Posted Feb 12, 2010 02:24 PM

    We now have it isolated down to one enclosure, and specific blades. So it's not a general configuration or virtual connect problem, it's a problem specific to either those blades, slots, or that virtual connect interconnect module. If we set up NIC teaming and virtual connect in such a way as to force all traffic out to the Cisco backbone switches, the problem does not occur. If we set up the NIC teaming in such a way as to force all traffic to stay on a single virtual connect module, and in the same virtual connect network profile, the problem occurs between specific blades (but not others).

    We are shuffling things around to narrow down the issue, but because of the nature of the production systems, outages are few and short.



  • 9.  RE: Occassional network dropouts between specific machines

    Posted Feb 10, 2010 02:47 AM

    Hi,

    I was googling network issues and came across this post.

    I am having a very similar issue. I am seeing a total network freeze where no virtualised server responds for a period of upto 30 seconds. This might occur 10 times a day. The servers do not repond via the VM console either and then they all sprint back to life like nothing every happened.

    We have four Windows 2008R2 64 bit servers and we aren't running VMOTION.



  • 10.  RE: Occassional network dropouts between specific machines

    Posted Feb 11, 2010 05:38 PM

    I had a similiar issue with a 2008 server in my ESX cluster. At one point I thought it had to do with my physical (and older) Cisco switch (3550-12T in this case), but still see the issue happening as recent as yesterday. I see it happen when vMotioning it to another ESX host.

    Edit: my ESX hosts are HP DL380G5s.

    Message was edited by: mcvosi



  • 11.  RE: Occassional network dropouts between specific machines

    Posted Feb 11, 2010 08:29 PM

    I have been reading the knowledge articles and there is a network drop out issue with some Cisco gear. I spoke to VMWARE yesterday. I have to say their support is absolutely brilliant! The network issues and dropouts we were experiencing were infact nothing to do with network at all. We had an iscsi device that went offline and VMWARE became incredibly busy trying to find and fix it. The device did come back online again but there was a known bug in the version of esxi I was running. Anyway looking at tail -f /var/log/vmkernel showed that the VMWARE was busy busy.

    We have updated our hosts and rebooted and the problem is gone. The faulty iscsi device has been removed also.