Solved: Failback when using Etherchannel and FCoE

orchestrationio · ‎08-02-2014

I have an environment that I'm able to poke around in for learning purposes but I don't manage it. Since I can't make any changes to the environment and experiment, I wanted to ask some questions here. I'm trying to figure out what the failback option virtual portgroups should be set to when using FCoE and etherchannel.

The environment consist of:

FCoE
2 CNA cards
Etherchannel
vPLEX
Metro Cluster
Nexus 5000

Each host connects to a single vDS and each portgroup is configured to use:

Load balancing policy: Route based on IP hash
Active uplinks: the two 10 GbE CNAs
Standby & Used uplinks: none
Physical switch is configured for etherchannel.

The vSphere 5.5 networking guide says the following for the failback option on vDS portgroups:

Failback Select Yes or No to disable or enable failback. This option determines how a physical adapter is returned to active duty after recovering from a failure. If failback is set to Yes (default), the adapter is returned to active duty immediately upon recovery, displacing the standby adapter that took over its slot, if any. If failback is set to No, a failed adapter is left inactive even after recovery until another currently active adapter fails, requiring its replacement.

This mentions active & standby links and since using etherchannel requires that all links be active with no standby links, does this mean that the failback option doesn't apply to etherchannel configurations? Or does it apply to the way load is re-distributed when a link comes back after being down?

For example, say we have a portgroup with two uplinks (uplink1 & uplink2), load balancing policy set to IP hash and the physical switch ports are in an etherchannel configuration. During normal operation the IP hash calculations (http://blogs.vmware.com/kb/2013/03/troubleshooting-network-teaming-problems-with-ip-hash.html) are performed and load is spread across both uplinks. Let's say that the physical switch that uplink2 connects to goes down. Do the ESXi hosts connected to this vDS immediately take uplink2 out of the equation which means uplink1 would be the only result from the IP hash calculations? What's the impact of the failback option when uplink2 is back up? If failback is set to No, would uplink2 still not be in the equation? If so, how would you get it back in the equation? Or when the link is back up is load immediately sent down it regardless of the failback option? It feels like I'm missing something here.

Also, I read that setting Failback to No is recommended when using IP-based storage because it can result in port flapping which can cause performance issues with the storage. But I also read that traffic from an etherchannel appears as a single MAC address/connection. Would this recommendation apply in an etherchannel configuration? Are port and MAC flapping the same thing? Or is port flapping just when the link goes up and down and MAC flapping is when the MAC flips between two ports?

Sorry if these are basic questions, but not being a networking guy and not being able to experiment makes it hard. Thanks for any help.

chriswahl · ‎08-12-2014

The member link state would not show up unless the link is up on both sides.

I agree with the portfast statement. Any links going to an ESXi server should not participate in spanning tree. The goal is to transition into a forwarding state immediately upon the link turning up. This should be set regardless of the use of a LAG.

LACP will dynamically build the LAG and validate that the configuration on both sides is correct. I prefer LACP (dynamic LAG) over any sort of static LAG, but this requires a Distributed Switch on the vSphere side. If you have the VDS, use LACP. However, LACP will not solve the spanning tree issue (you still want to avoid using spanning tree on the LAG going to the ESXi host).

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators

View solution in original post

chriswahl · ‎08-10-2014

FCoE doesn't use IP (it's a combination of FC + FCoE + Ethernet), so you can safely rule that out of the equation here. In fact, your CNA likely just exposes an HBA that is completely separate from your VMware port groups anyway.

Static LAGs (Link aggregation groups) such as an EtherChannel will not use a member port (links that are part of the LAG) when they are not showing an "up" status. In a 2 member LAG, the traffic patterns in a 1 link loss scenario simply use the last remaining link. The MAC address does not move, because - as you correctly stated - the MAC address lives on the logical address of the LAG, not on any one physical link.

I would use Failback in this scenario, unless you wanted to involve an admin to physical inspect the status of the LAG post-repair and manually put the link back into active status once repairs are completed and validated.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators

orchestrationio · ‎08-11-2014

Thanks for the response. Sorry I didn't make this part clear, but in terms of the IP + etherchannel, if failback was set to Yes, is possible that once the link was detected as up the ESXi host could start sending traffic down it and the switch port not be completely up yet and the traffic wouldn't be delivered? I read about this happening but it wasn't specific to the load balancing policy. They mentioned making sure you had things like portfast enabled on the ESXi facing switch ports so maybe if this is done, the impact would be mitigated. Would using LACP make any difference in this scenario? Thanks again.

chriswahl · ‎08-12-2014

The member link state would not show up unless the link is up on both sides.

I agree with the portfast statement. Any links going to an ESXi server should not participate in spanning tree. The goal is to transition into a forwarding state immediately upon the link turning up. This should be set regardless of the use of a LAG.

LACP will dynamically build the LAG and validate that the configuration on both sides is correct. I prefer LACP (dynamic LAG) over any sort of static LAG, but this requires a Distributed Switch on the vSphere side. If you have the VDS, use LACP. However, LACP will not solve the spanning tree issue (you still want to avoid using spanning tree on the LAG going to the ESXi host).

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators

orchestrationio · ‎08-12-2014

Thanks a lot Chris.

All

Failback when using Etherchannel and FCoE