Solved: Network design verification question

kgottleib · ‎10-25-2013

Attention VMware networking gurus:

I recently was asked to trouble a networking issue at a customer site. Here is what I discovered:

- The customer has a single vSwitch which is configured for IP Hash load balancing, and so were all port groups within the vSwitch except for the VM production network which was configured with the default "port ID" setting.

From my understanding the IP Hash setting is used when aggregated links \ etherchannel configurations are in place on the switch. and if the links are aggregated then Port ID would be used.

This configuration has been in place for quite some time and it is working up until recently. But the recent issue I believe was the result of vmnic2 being set to unused in the parent vSwitch but in the port group set to active. A VM lost connectivity, and I believe it was due to failing over to vmnic2 in the port group.

There is a KB about the unused vmnic and I am ready to recommend a remedy for this, but I need some advice regarding the mis-match of the IP HASH config on the vSwitch while the residing port group is set to Port ID.

Please advise, thanks in advance.

MKguy · ‎10-25-2013

With IP hashing (or LACP), you need to have all links as active. This is because the physical switch at the other side of the channel has no information about such configurations and will always try to forward traffic on the respective physical link it deems appropriate for the applied hash. If that link is "unused" for a port group on the ESXi host, the connected vNICs will not receive traffic arriving on that uplink.

- The customer has a single vSwitch which is configured for IP Hash load balancing, and so were all port groups within the vSwitch except for the VM production network which was configured with the default "port ID" setting.

This is a bad configuration as well and should actually cause issues too. Either ALL your uplinks and thus ALL connected port groups are part of a channel or not. Again, the physical switch forms one channel per physical link and not per logical VLAN/port group and assumes the other end is configured like that as well.

Long story short: With the IP-hash load balancing policy/etherchannel all physical uplink vmnics need to be set to active for the whole vSwitch and all port groups on it. All port groups need to be set to the IP-hash policy.

-- http://alpacapowered.wordpress.com

View solution in original post

MKguy · ‎10-25-2013

With IP hashing (or LACP), you need to have all links as active. This is because the physical switch at the other side of the channel has no information about such configurations and will always try to forward traffic on the respective physical link it deems appropriate for the applied hash. If that link is "unused" for a port group on the ESXi host, the connected vNICs will not receive traffic arriving on that uplink.

- The customer has a single vSwitch which is configured for IP Hash load balancing, and so were all port groups within the vSwitch except for the VM production network which was configured with the default "port ID" setting.

This is a bad configuration as well and should actually cause issues too. Either ALL your uplinks and thus ALL connected port groups are part of a channel or not. Again, the physical switch forms one channel per physical link and not per logical VLAN/port group and assumes the other end is configured like that as well.

Long story short: With the IP-hash load balancing policy/etherchannel all physical uplink vmnics need to be set to active for the whole vSwitch and all port groups on it. All port groups need to be set to the IP-hash policy.

-- http://alpacapowered.wordpress.com

kgottleib · ‎10-29-2013

Thanks for confirming what I suspected simply from a logical view of the setup. I was actually hoping for a deeper dive that would shed light as to why the Port ID config was still working underneath the aggregated links even though it should be set to IP hash, but I'm happy with the answer regardless.

The reason I am stating this is because there are other engineers on my team who aren't thinking this through logically and were arguing that "because the scheme was working why should we change it to IP HASH" Yes, there are still guys out there who think like this.. scary huh? but without a deeper explanation of what is happening with the packets between the Aggregated links and vSwitch I'm not certain I can convince them. Can you point me to a VMware networking white paper of some sort that might have details on this?

It could be that the only result of using PORT ID on the vSwitch when the link on the physical switch are aggregated is that the traffic will always stick to the same vmnic until it completely fails.

MKguy · ‎10-30-2013

I don't think more detailed explanations are going to convince people like this who already fail to understand the basic concept/implications of such configurations.

To stick to some hard references: It's strictly unsupported to run IP-hash etherchannel load balancing with any other load balancing mechanism except for IP-hash or with standby/unused NICs.

Point your colleagues to these references if they have doubts:

http://kb.vmware.com/kb/1001938

- the virtual switch must have its load balancing method set to Route based on IP hash

- The only load balancing option for vSwitch or vDistributed Switch that can be used with EtherChannel is IP HASH.

- Do not configure standby or unused uplinks with IP HASH load balancing.

Also: http://www.yellow-bricks.com/2010/08/06/standby-nics-in-an-ip-hash-configuration/

-- http://alpacapowered.wordpress.com

chriswahl · ‎11-01-2013

I was actually hoping for a deeper dive that would shed light as to why the Port ID config was still working underneath the aggregated links even though it should be set to IP hash, but I'm happy with the answer regardless.

Each side of the network (the host and the upstream switch) gets to decide how traffic is placed on the wire. In this case, you're using a static Link Aggregation Group (LAG) in the form of EtherChannel.

The reason the "Port ID config" worked is because the host was still able to send traffic to the upstream switch, but not using a hashing algorithm. This isn't optimal, but it works. The upstream switch will accept the traffic because it has no authority over which member port receives traffic - the entire LAG is one large logical port and will accept traffic on any member port.

Because the host adapter was marked "unused" - but not actually unplugged - the upstream switch still thought the member port was available and sent traffic to it. The host would drop the traffic due to the "unused" state.

VCDX #104 (DCV, NV) ஃ WahlNetwork.com ஃ @ChrisWahl ஃ Author, Networking for VMware Administrators