Re: VLAN Trunking on ESX 3.x

Mark_Bradley1 · ‎03-21-2007

I have just spent some time getting VLAN trunking working on ESX 2.5.3 in our lab environment. Once I had this working in the version of ESX that I am most comfortable in I figured due a new install of ESX 3 and set up VLAN trunking.

I am however having connectivity issues in ESX 3.

We have been testing have 2 NICs bonded, each one connected to it's own switch to allow some redundancy.

I have 2 VM's on 2 different VLANs, I run a ping to their gateways and the other VM.

When I disconnect one of the network cables that have the trunking set up. I initially loose one ping, then a few seconds later the VM's stop pinging altogther, both their gateways and each other. They do not recover until I replace the cable.

Obviously we want to be able to survive a hardware fault in either a NIC or a switch if we are using trunking.

There are alot more configuration options in ESX 3 so I suspect that this is a config step I have missed... currently I have Load Balancing set to Route based on source MAC hash adn Network Failover Detection set to Beacon Probing. The reason I chose these is that my 2.5.3 server that had working trunking was set to out-mac and had beaconing enabled.

Any help will be greatly appreciated.

Thank you in advance.

acr · ‎03-21-2007

I had issues getting it to work with ESX 3, but after the Cisco Config was done, i just played with the IP and MAC Hash.. They both worked for me, although ended up using the IP Hash...

Mark_Bradley1 · ‎03-21-2007

That is the strange thing... these is exactly the same server I had working trunking with 2.5.3. There have been no changes to the network switch side.

vmmeup · ‎03-21-2007

Use route based on IP Hash.

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

biekee · ‎03-21-2007

What brand of physical switches do you use? I saw some weird things with HP Procurve switches and nic teaming.

bk

bggb29 · ‎03-21-2007

I ran into a very odd observation with trunking and esx3 today. I could not get the trunk vlans to allow a guest to do anything on the network. On the vswitch one of the nics was showing a odd network 0.0.0.xx

xx = I cannot remember the octets.

I shutdown the port on the pswitch and all of a sudden network connectivity, renabled the pswitch port and it still works. Only have 2 vlans currently on the system.

The switches are cisco 3750's stacked with 2 etherchannels into the core. The vswitch is 2 discrete connections not etherchanneld but one port from the esx host into each switch.

Have not rebooted either the esx server or pswitches to see if the behaviour reoccurs.

vmmeup · ‎03-21-2007

I am using HP Procurves and the only problem I have is when I reboot one of my esx servers the ports fo into blocked mode. If I enable them they work fine again until I reboot the server again.

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

Mark_Bradley1 · ‎03-21-2007

We are using Cisco switches, would have to check with the network team for the exact model. I know this config worked with 2.5.3 and VLAN Trunking so I was hoping it would just be a case of configuring ESX 3 with the same VLAN's...

I now have this config:

Load Balancing - Route based on ip hash

Network Failover Detection - Beacon Probing

Notify Switches - No

Rolling Failover - No

Failover Order - Not configured

Pull NIC1 - All pings fail - never recovers

Replace NIC1 - pings recover after approx 45-60 seconds

PUll NIC2 - No Loss of ping

Replace NIC2 - No loss of ping

So with the IP hash it is better but not perfect

Mark_Bradley1 · ‎03-22-2007

I have the settings set at the vSwitch level with no over-ride set for each port group.

When the cable is removed from vmnic0 I get a complete failure. vmnic1 can be removed and replaced with no loss of ping.

vmmeup · ‎03-22-2007

Try without beacon probing.....

Sid Smith ----- VCP, VTSP, CCNA, CCA(Xen Server), MCTS Hyper-V & SCVMM08 [http://www.dailyhypervisor.com] - Don't forget to award points for correct and helpful answers. 😉

Mark_Bradley1 · ‎03-22-2007

I have tried to the following...

Load Balancing - Route based on IP hash

Network Failover Detection - Link Status Only

Notify Switches - Yes

Rolling Failover - No

Pulled the first cable - No loss of ping

Replaced the first cable - lost pings to VM's for approx 20-30 seconds

Pulled the second cable - no loss of ping

Replaced the second cable - lost pings to Gateways for 20-30 seconds

This is similar to what I was seeing in ESX 2.5.3 prior to enabling beaconing... but beaconing does not seem to improve the situation in ESX 3.

Mark_Bradley1 · ‎03-23-2007

We are using two 6509 Cisco switches with CatOS. Trunking is set up on both switches. Then the ESX server has a connection to each switch. I have set a vSwitch with both NICs included and then set up the port groups for the required VLANs. We want to be able to use two physical switches to allow for hardware redundancy.

postfixreload · ‎03-23-2007

Is ether channel setup for the switch? You do need IP hash on the esx side, and make sure portfast is turned on on the switch side

meistermn · ‎03-25-2007

My be that helps

http://virtrix.blogspot.com/2006/11/vmware-switch-load-balancing.html

jhanekom · ‎04-24-2007

I know this is quite a late response, but my 2c's:

Beacon probing is broken in many situations. The exact situations are not well documented, but the fact that it's broken and that it will be "fixed" by 3.0.2 has been stated by VMware.

The 20-30 second timeouts you're seeing are most likely due to Spanning Tree. With Cisco switches, spanning tree blocking/learning is applied to trunked ports even if you have portfast enabled.

You need to set the "portfast trunk" option on the necessary ports. On CatOS, it would appear that that command is:

\[Quote]set spantree portfast 5/1 enable trunk[/Quote]

...substituting 5/1 for your particular blade and port number, of course.

In my particular environment (very old Nortel switches), they had no way of doing "portfast trunk", so we opted for active/standby adapters in the vswitches in stead. This also solved the "my VMs go down when I reconnect the failed cable" problem as it's unlikely we'll have another cable failure within 30 seconds of the originally failed cable being reconnected.

While this theoretically limits bandwidth, we're not even close to using a full Gigabit link yet, so we're ok on that front for now.