Teamed NIC network not failing over

zenomorph · ‎11-07-2014

We've been running ESXi Ent/Plus with our HP DL380/580 servers for a few years now and typically for network resilience and traffic performance we configure either vSwitch or Distributed vSwitch in our Clusters with 2 teamed NIC (route based IP) configured and port channel on the physical switch.

What I want to get some insight or shed some light is typically how everyone else does this and when there's faults in the network cards typically is there any downtime experienced.

Based on our experience so far typically with the above teamed NIC configuration prior to prodn we do testing of removing network cable from either NIC from the team and things work find there's fail over in the vSwitch due to the teamed NICs and minimal ping packet drop.

However once come to prodn and every now and then we get physical hardware issues on the NIC cards and in most cases we find the fail over or the load balancing doesn't work and we get packet drops or the VM's cannot communicate at all and even though we have a second NIC in the vSwitch the traffic doesn't fail over - in the end we need to either remove the faulty NIC from the vSwitch or vMotion the Vm's to another host.

Has anyone had the same/similar experience and is there any actual way to overcome this such that there's no downtime in the VM's?:smileycry:

MKguy · ‎11-07-2014

we do testing of removing network cable from either NIC from the team and things work find there's fail over in the vSwitch due to the teamed NICs and minimal ping packet drop.

Yes, if a link goes physically down, then failover will work fine.

every now and then we get physical hardware issues on the NIC cards

Can you elaborate on these "hardware issues" a bit? Does it cause the physical link of the NIC to go down or not? The vSwitch will only initiate a failover to the other NIC if the physical link is down (unless you use beacon probing which you probably shouldn't for other reasons). If there are transparent configuration or other issues where the link stays up, it will keep sending traffic down that link.

This is not only an issue on the ESXi side, but on the physical switch as well. With static etherchannel, your physical switch as well will keep sending frames down every link that is physically up, regardless of the operational state on the other hand which it has no idea about.

You should investigate these "hardware issues" a bit more closely. What NIC and physical switching gear are you using? Do you have the latest firmware and driver for the NICs?

This following is only a side note because it wouldn't help you in your specific case. Failover for non-LAG configurations works the same, only if the NIC is physically down.

On a side note, I wouldn't exactly recommend using static etherchannel. If you have distributed vSwitches then use LACP channeling instead, and even that only in very specific cases. LAG channels typically increase complexity unnecessarily and in most cases bring little real performance gains. Check out these excellent posts by Chris Wahl on the pros and cons:

http://wahlnetwork.com/2014/01/13/vsphere-need-lag-bandaids/

http://wahlnetwork.com/2014/02/05/revenge-lag-networks-paradise/

-- http://alpacapowered.wordpress.com

All

Teamed NIC network not failing over