VMware Cloud Community
alvinswim
Hot Shot
Hot Shot

VMs Intermittently Loosing Connectivity with other VMs on the same VLAN

Folks,

I had posted this question before as part of a similar situation another member had been experiencing. I'll repaste my situation as well as include the link to that post.

Here's the deal:

We are experiencing odd network connectivity between VMs sometimes on different hosts and sometimes even on the same hosts. Vm's on the same network segment on either the same blade or different  blade-same chassis or different-blade-different chassis loose network connectivity with each other.

The VM wouldn't see SOME guests on the same net segment, but they could communicate out of the segment (like through a vpn tunnel to our  office via rdp or to a db server on the db network segment) it would be  able to ping a few other vm's on the same net but not all. sometimes  flapping the vm net interface brought it back, sometimes moving to  another chassis/server would work. The most drastic would be to fully  reboot the vm, but sometimes that wouldn't work either. and we've seen this behaviour for the last few months.

First we thought it was a glitch with 4.1, so we upped every host we have to update1 in late April. everything was fine for a while, then it came back

Here's a little bit of background, we've run ESX 3.5u3, 4.0 all on the same blade servers over the last 2-3 years. Same setup, no changes and everything had worked fine. we upgraded to ESX 4.1 in November/December, didn't have any issues until late march we started seeing vm's loosing connectivity.

First we saw vm's loosing all network after vmotion, that was fixed by vmotioning to another host. and when that didn't work we'd flap (connect/disconnect) the vm interface and it would return to life.

The problem here is that we're not seeing any pattern. For the meantime we've suspended DRS vmotioning. so that we don't get any network handoffs, or change in state with the vm net, but it still happens.

we've even disabled the physical VMhost Nic's (removing from the vswitch) and that eleviated the problem, sometimes we'd bring vmnic0 out and the traffic would be normal again, sometimes we'd have to swap it out to vmnic1. And we've even observed bringing a nic back in would make the cnnectivity problemn show up again. we don't want to run on a single nic vswitch environment in prod.

Initially we thought some of our nic's are failing in on that fabric, but its not consistent and would happen on different hosts and different on different nics on those hosts.

we've also replaced all our blade-sw to core cisco-sw connections with brand new cables...

We've checked and double checked our blade switch configs as well as core switch configs.. But we can't see why this is happening because we've had this exact same setup for the last 2-3 years.

about the hardware, we have 2 x Dell M1000 blade chassis, both with dual Dell M6220 Blade switches, connected to two Cisco 2960G core switches.

the VM hosts are 12xM600 Blades and 4xM610 Blades with dual fabrics VM Network and iSCSI network. (the M600's are in 1 cluster and the M610s are in another cluster because of the drastic difference in the cpus)

We have 3 VLAN segments for our vm networks and our servers have 24 or 48 vswitch ports, some of the newer blades defaulted to 56 or 128 ports. We suspected we could have been running out of vswitch ports, but esxcfg-vswitch doesn't seem to show that.

Our vswitches are set to Port ID load balancing. Our Blade switch to core switches are connected via portchannel (2 ports per blade switch)

At this point it looks like its a vmware thing or it could be a little bit of dell/cisco/vmware... But now that I know someone else is experiencing this, I tend to want to look towards vmware...

Any help is appreciated, and if you all need more information I'll be glad to post it up..

thanks ahead of time

oh here's the link to the other article somewhat related to my issue:

http://communities.vmware.com/thread/317191

Reply
0 Kudos
4 Replies
MauroBonder
VMware Employee
VMware Employee

sorry, what is model of you network adapater ? You just check in HCL ?

Good luck

*Please, don't forget the awarding points for "helpful" and/or "correct" answers. *Por favor, não esqueça de atribuir os pontos se a resposta foi útil ou resolveu o problema.* Thank you/Obrigado
Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

Broadcom 5708 and 5709

they're on the HCL.. or they should be cause Dell lists the servers as VMware compliant

Reply
0 Kudos
MauroBonder
VMware Employee
VMware Employee

i checked and 5709 not supported by ESX4.X. 5708 is supported. Try set how "standby" this adapter 5709 and check if works fine.

BroadcomBCM5709Network
ESX 3.5 U5
ESX 3.5 U4
ESX 3.5 U3
ESX 3.5 U2
ESX 3.5 U1
ESX 3.5
ESXi 3.5 Embedded U5
ESXi 3.5 Embedded U4
ESXi 3.5 Embedded U3
ESXi 3.5 Embedded U2
ESXi 3.5 Embedded U1
ESXi 3.5 Embedded
ESXi 3.5 Installable U5
ESXi 3.5 Installable U4
ESXi 3.5 Installable U3
ESXi 3.5 Installable U2
ESXi 3.5 Installable U1
ESXi 3.5 Installable
ESX 3.0.3 U1
ESX 3.0.3
ESX 3.0.2 U1
*Please, don't forget the awarding points for "helpful" and/or "correct" answers. *Por favor, não esqueça de atribuir os pontos se a resposta foi útil ou resolveu o problema.* Thank you/Obrigado
Reply
0 Kudos
alvinswim
Hot Shot
Hot Shot

The problem is that we are having this issue with our 12 M600 Blades that have the Broadcom 5708s.. only the 4 new blades that have the Broadcom 5709s eth chipset.

However, I do see this on the HCL

BroadcomBCM  5708SNetwork
ESX / ESXi 4.1 U1
ESX / ESXi 4.1
ESX / ESXi 4.0 U3
ESX / ESXi 4.0 U2
ESX / ESXi 4.0 U1
ESX / ESXi 4.0
ESX 3.5 U5
ESX 3.5 U4
ESX 3.5 U3
ESX 3.5 U2
ESX 3.5 U1
ESX 3.5
ESXi 3.5 Embedded U5
ESXi 3.5 Embedded U4
ESXi 3.5 Embedded U3
ESXi 3.5 Embedded U2
ESXi 3.5 Embedded U1
ESXi 3.5 Embedded
ESXi 3.5 Installable U5
ESXi 3.5 Installable U4
ESXi 3.5 Installable U3
ESXi 3.5 Installable U2
ESXi 3.5 Installable U1
ESXi 3.5 Installable
ESX 3.0.3 U1
ESX 3.0.3
ESX 3.0.2 U1
ESX 3.0.2
ESX 3.0.1
BroadcomBCM 5709SNetwork
ESX / ESXi 4.1 U1
ESX / ESXi 4.1
ESX / ESXi 4.0 U3
ESX / ESXi 4.0 U2
ESX / ESXi 4.0 U1
ESX / ESXi 4.0

I might have had a typo earlier and meant to say that we have the 5708S and 5709S

Everything server wise is standard Dell because they are all blades and are also listed on the HCL

thanks

Reply
0 Kudos