4 Replies Latest reply on Jun 30, 2011 1:32 PM by alvinswim

    VMs Intermittently Loosing Connectivity with other VMs on the same VLAN

    alvinswim Hot Shot



      I had posted this question before as part of a similar situation another member had been experiencing. I'll repaste my situation as well as include the link to that post.


      Here's the deal:


      We are experiencing odd network connectivity between VMs sometimes on different hosts and sometimes even on the same hosts. Vm's on the same network segment on either the same blade or different  blade-same chassis or different-blade-different chassis loose network connectivity with each other.


      The VM wouldn't see SOME guests on the same net segment, but they could communicate out of the segment (like through a vpn tunnel to our  office via rdp or to a db server on the db network segment) it would be  able to ping a few other vm's on the same net but not all. sometimes  flapping the vm net interface brought it back, sometimes moving to  another chassis/server would work. The most drastic would be to fully  reboot the vm, but sometimes that wouldn't work either. and we've seen this behaviour for the last few months.


      First we thought it was a glitch with 4.1, so we upped every host we have to update1 in late April. everything was fine for a while, then it came back


      Here's a little bit of background, we've run ESX 3.5u3, 4.0 all on the same blade servers over the last 2-3 years. Same setup, no changes and everything had worked fine. we upgraded to ESX 4.1 in November/December, didn't have any issues until late march we started seeing vm's loosing connectivity.


      First we saw vm's loosing all network after vmotion, that was fixed by vmotioning to another host. and when that didn't work we'd flap (connect/disconnect) the vm interface and it would return to life.


      The problem here is that we're not seeing any pattern. For the meantime we've suspended DRS vmotioning. so that we don't get any network handoffs, or change in state with the vm net, but it still happens.


      we've even disabled the physical VMhost Nic's (removing from the vswitch) and that eleviated the problem, sometimes we'd bring vmnic0 out and the traffic would be normal again, sometimes we'd have to swap it out to vmnic1. And we've even observed bringing a nic back in would make the cnnectivity problemn show up again. we don't want to run on a single nic vswitch environment in prod.


      Initially we thought some of our nic's are failing in on that fabric, but its not consistent and would happen on different hosts and different on different nics on those hosts.


      we've also replaced all our blade-sw to core cisco-sw connections with brand new cables...


      We've checked and double checked our blade switch configs as well as core switch configs.. But we can't see why this is happening because we've had this exact same setup for the last 2-3 years.


      about the hardware, we have 2 x Dell M1000 blade chassis, both with dual Dell M6220 Blade switches, connected to two Cisco 2960G core switches.


      the VM hosts are 12xM600 Blades and 4xM610 Blades with dual fabrics VM Network and iSCSI network. (the M600's are in 1 cluster and the M610s are in another cluster because of the drastic difference in the cpus)


      We have 3 VLAN segments for our vm networks and our servers have 24 or 48 vswitch ports, some of the newer blades defaulted to 56 or 128 ports. We suspected we could have been running out of vswitch ports, but esxcfg-vswitch doesn't seem to show that.


      Our vswitches are set to Port ID load balancing. Our Blade switch to core switches are connected via portchannel (2 ports per blade switch)


      At this point it looks like its a vmware thing or it could be a little bit of dell/cisco/vmware... But now that I know someone else is experiencing this, I tend to want to look towards vmware...


      Any help is appreciated, and if you all need more information I'll be glad to post it up..



      thanks ahead of time


      oh here's the link to the other article somewhat related to my issue: