VMware Cloud Community
bsherman54
Contributor
Contributor

New VM's can't ping on Virtual Distributed Switch

I've got an interesting problem where new VM's created can't ping ANYTHING if they are given a portgroup on VM creation.  If the VM is moved to another portgroup (vlan based) and the IP changed, it can then ping out.  Move it back and set it back to original settings, it goes dark again.

I'm completely at a loss.  On our switch, I can look up it's ARP and it's complete but no layer 3 action.

Our setup is a C7000 with Virtual connect's going back to Nexus 5020's in a VPC config. 

vDS is set up as a primary nic with a secondary nic in failover mode.  Load balancing is done by originating port with beacon probing.

Any insight would be helpful.  Also, if there's a better way to set up load balancing, I'd like to hear it as I'm not sure if IP hash is relevant here as the etherchanneling is done at the virtual connect level.

0 Kudos
10 Replies
chriswahl
Virtuoso
Virtuoso

If you're using a single vDS with a any port group that has VLAN tagging, you must be tunneling VLANs (trunking) at the Virtual Connect (VC) layer. Are you using a VLAN tag on the "problem" port group, or is it using a native VLAN?

Additonally, beacon probing has no real value when there are only two uplinks. http://kb.vmware.com/kb/1005577

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
bsherman54
Contributor
Contributor

Yes, it's trunking at the VC level.  I've got 150+ vm's and they are spread out over 30 or so different vlans.  Its just when I create a new VM and assign it a VLAN (port group) at creation, it can't connect out at all...but if it's moved after creation, it works on that port group.

One other thing is if it's left alone for an hour plus...it then starts working which makes me think it's an arp issue somewhere.

0 Kudos
bsherman54
Contributor
Contributor

And yes, all port groups are tagged.

0 Kudos
chriswahl
Virtuoso
Virtuoso

Hmm. So you are creating a VM and assigning it a port group that has other active, working VMs on it, then configuring the IP/mask/gateway and it cannot ping other VMs on that port group or its default gateway? But it works when you leave it idle for an hour?

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators
0 Kudos
bsherman54
Contributor
Contributor

Yessir...see my quandary?

0 Kudos
mitchellm3
Enthusiast
Enthusiast

Interesting...

We too use a mix of FlexFabric/Flex-10 with uplinks to Cisco Nexus equipment as well as vDS.  For the most part, HP firmware updates aside, we have been running quite solid with this configuration.  We have both ESXi 4.1 and ESXi 5 farms.  With both versions our setup is a little different than yours:

We have the c7000 connecting back to Nexus 7000's using LACP.  The vDS is setup with both uplinks as active and the load balancing is route based on physical nic load.  So far this works great.

Your post caught my eye because I just upgraded the one farm from 4.1 to 5.0.  After the upgrade everything was working great.  I then upgraded my vDS from 4.1 to 5.0 and it worked fine except that I missed that I had 2 VMs with "flexible" type vmnics.  Well I started seeing errors and they couldn't connect to the network.  Since they were no longer on the network, I replaced the flexible adapters with VMXNET3 adapters thinking that would fix the problem.  I couldn't get them to connect with the new adapters.  So I tried a different VLAN and all worked well.  Put them back on the origional VLAN and no connectivity again.  So I built a new VM and put it on the VLAN that we were having the issues with and no connectivity.  It seemed like that the 2 VLANs that had the 2 VMs with flexible adapters just fizzled out during the upgrade of the vDS.  I even deleted and recreated the portgroup from the vDS with no luck.

The VMware support guy tried saying that the cisco switches were caching the mac addresses and causing problems.  I thought that was way off base.  Nothing changed except that I upgraded the vDS.  I had a second farm, ESXi 4.1, running on the same 2 c7000 chassis with VMs on the same VLANs that were working fine.  While I was waiting for VMware to get back with me on analyzing the log files, I took the time to reboot each ESXi host and VOILA!!! my VLANs were working again.

Weird stuff!

0 Kudos
bsherman54
Contributor
Contributor

See...that makes me think that it's either the VirtualConnects OR the vDS.  If you had to reboot EACH blade, then that makes me lean more towards the VC, but what would change between 4.1 and 5 that the VC would act this way.

I agree with you that it's not the Nexus switches caching the mac...that just seems way too far fetched.  Out of curiosity, how do you have your active/active links configured because that's the exact same version of VC we're running.  Two links in a VPC going to one VC...so 4 links, each pair connected to one vNIC.  I started to go to IP Hash, but then I started experiencing weirdness where machines would drop and then come back....but I'll try to figure that one out later, I gotta get this VLAN issue fixed because rebooting each of the blades every time a new VM is created is just out of the question.

0 Kudos
usulsuspct
Contributor
Contributor

Curious if you had any luck resolving this?  This sounds like it could be similar to an issue we are having with our new esxi/vds 5 cluster.  We are running on HP Gen 7 blades connected via 2 10gb uplinks via virtual connect 3.30.

Vm's will be running fine on a host however during a vmotion to some (not all) hosts in the cluster end with the guest offline with VMware complaining of a policy violation and blocking the port.  VMware thinks the vm is trying to change its Mac address during the vmotion (to a Mac of all 00).

This is all vlan tagged traffic, we are running active/active with load based teaming relying on nioc for priority.

0 Kudos
NV1
Contributor
Contributor

Hi Folks

This problem sounds similar to the one we have just encountered.

We have HP BL620c G7 Blades with ESXi 5.0.0, 515841. They have Emulex NC5531i 10Gb 2-port Flex Fabric Converged
network adapters. These hosts have stopped passing traffic on VLAN71. If I remove the  Physical adapters from the
VDS and reconnect them the VLAN starts to work again. VLAN71 has been working correctly on all hosts for about
6 weeks before this issue occures. The only difference between the BL620c hosts that are working with this VLAN
and the ones that are not appears to be the load on the hosts. The ones that are not working have more load than
the ones that are working. However none of the hosts have over 61% memory allocated to VM's or are using more than 25% CPU.

These hosts are using the following firmware and drivers on the NICs:

Firmware: be2net device firmware 4.0.360.15

Driver: be2net driver 4.0.355.1

C7000 Enclosure:

OA Firmware: 3.31

VC Flex-10 Enet Module Firmware: 3.18



We have moved the VM's on VLAN71 to different hardware that does not have this problem but are
very concerned that other VLANs will also fail at some point.

Any ideas would be much appreciated.

Cheers

0 Kudos
mitchellm3
Enthusiast
Enthusiast

Just an update...We upgraded another 4.1 farm to 5.0 and everything was fine.  This time we made sure there were not "flexible" nics on the dvSwitch before we upgraded it to 5.0.  After the upgrade all VMs pinged so we thought we were good.  Then someone deployed a new VM on the farm and it didn't have network connectivity.  Once I rebooted a host and moved it over to it, it worked just fine.  I ended up having to reboot all our hosts.  The thing that made this a bigger issue is that when I cleared off VMs from hosts, I moved them to hosts that hadn't been rebooted yet...some of them lost their networks.  After seeing that, as long as I manually moved them to hosts that were already rebooted, they worked just fine.  I'm just glad this farm wasn't fully utilized yet.

0 Kudos