I'm having an odd problem with link aggregation between an ESXi 5 host and a 3COM Baseline 2924-SFP Plus switch. This is a standalone ESXi host with a dual-port NIC card. The networking is set up with the management network and the VM Network on the same virtual switch. I set up the ESXi host for static link aggregation as per the Etherchannel link aggregation article in the knowledgebase: Sample configuration of EtherChannel / Link Aggregation Control Protocol (LACP) with ESXi/ESX and Cisco/HP switches (1004048).
On the 3COM switch, I created a static link aggregation group of two ports. The parameters on the switch for these ports are:
STP - enabled
Port Fast - disabled
Root Guard - disabled
Port State - Forwarding
Port Role - Designated
RSTP Link Type - Auto
Duplex Mode - Full
Flow Control - disabled
I'm not a switch maven, so I don't know what all of these parameters mean, but it's set up the same as all of the other ports on the switch.
The issue is that the management network becomes unavailable after a period of time. I can't connect to the host through any method and can't ping it. However, the virtual machines are operating normally and fully accessible. The other day, we did a test by disconnecting one of the ports from the switch, and the host became available again. Then I removed and recreated the LAG and was once again able to connect to the ESXi host through the VSphere client. I left a command line window open for a couple of hours constantly pinging the host IP address, and never saw a failed ping. However, a few days later I tried connecting to the host through VSphere again and got a connection error, and it won't ping again either.
I'm at a loss as to what's going on. Any ideas?
Note: Discussion successfully moved from VMware ESXi 5 to VMware vSphere™ vNetwork
hypercat wrote:
I'm at a loss as to what's going on. Any ideas?
The basic result is that the Link Aggregation is not working as it should and likely is incorrect all the time, but because of the specific ways the traffic is balanced across the links it could look like it is functional for "some" traffic and not for other.
Before actually starting any troubleshooting on the setup I would like to ask for the reasons of using the IP Hash load balancing option, in the sense that if you have some specific need for the small advantage this gives?
I don't understand the point of your question, but the answer is pretty much I'll take any advantage I can get. Perhaps you can explain in more detail what you mean by it being a "small advantage." It's entirely possible that I don't fully understand the pros and cons of link aggregation as related to VMWare.
hypercat wrote:
Perhaps you can explain in more detail what you mean by it being a "small advantage."
It is a quite small advantage over the default policy called Port Id.
With the Port Id NIC teaming policy you will get a decent load balancing over the VMs, you will get fault tolerance, you have the possibility to connect to two physical switches for increased network redundancy and you need no specific Link Aggregation setup at the pSwitches.
For IP Hash the only advantage is that a single VM could, if having multiple connections with different outside IP hosts, use both vmnics (physical nic ports) at the same time, where on Port ID a single VM only uses a single vmnic. The disadvantage is that it does not allow standard based connections to more than one physical switch and it is critical that both vSwitch and physical switch are correctly configured as a static LAG.
In your case it does seems like that physical switch is not setup the way it should to work with vSphere IP Hash, making this strange connection losses happens. From that my first question was if you does in fact really need the IP Hash or if going back to Port Id would be a good enough setup?
Being a 3Com, it's probably a switch that predates a lot of the implementation guides out there.
TO make this easier, can you post a screenshot of your VMware vSwitch configuration, and a copy of the config on the switch relating to the team?
I can't connect to this host right now, but IIRC, there is no option to route based on virtual port ID. Is it possibly because this is ESXi 5.0, not 5.1? Or because this is a standalone server without VCenter?
hypercat wrote:
I can't connect to this host right now, but IIRC, there is no option to route based on virtual port ID. Is it possibly because this is ESXi 5.0, not 5.1? Or because this is a standalone server without VCenter?
The Port ID load balancing option is available in both 5.0 and standalone version, it is actually the default setting.
I looked at another ESXi 5.x host where I'm also having similar aggregation issues. I see that it does indeed have the option you mentioned. However, following the article I used about static link aggregation, it says to use the Route based on IP Hash option. It even says: "Note: The only load balancing option for vSwitches or vDistributed Switches that can be used with EtherChannel is IP HASH." Is this just an old article, or perhaps inaccurate for 3COM switches (as it refers specifically only to Cisco and HP switches). And if so, is there a newer/better tech article I can use to understand the options when setting up link aggregation for ESXi 5.x?
hypercat wrote:
However, following the article I used about static link aggregation, it says to use the Route based on IP Hash option. It even says: "Note: The only load balancing option for vSwitches or vDistributed Switches that can be used with EtherChannel is IP HASH."
The text is actually correct since it only relates to (static) Etherchannel on the physical switch, where you must use IP Hash.
However, if just make sure the VLAN tagging is correct on the switch ports you could just keep them without any Link Aggregation and use Port ID with the advantages above.
I think my stubborn brain is finally understanding what you're saying. I guess maybe my problem is that I'm too accustomed to thinking about hardware aggregation where both ends of the aggregation have to be set up exactly the same way. So, if I am in fact getting it finally, what you're saying is to set up the virtual switch to do link aggregation with routing based on the Port ID, and then the hardware switch doesn't have any aggregation at all. So, the virtual switch handles the routing and the physical switch just has two independent ports hooked up to the NICs on the ESXi host. Right?
Does this mean that I need to have two virtual NICs on the virtual machine and set them up for some sort of aggregation in order for that machine to benefit from the link aggregation, or is that a moot point also? I've always assumed that any virtual machine only needed one NIC to communicate with the virtual switch and would still get the benefit of any aggregation between the host and the physical network.
I did as you suggested, removing the link aggregation on the physical switch and setting the virtual switch to route based on virtual port ID. However, now I seem to be having some browsing and authentication issues. Browsing seems to be very slow, particularly on Windows XP workstations, or when two or more users are browsing the same set of folders at the same time. Also, users are getting some messages indicating that they don't have permissions to access resources that they clearly do have permissions to access. If they close and reopen their computer browsing window, they can get access to those same folders. Any ideas on these issues?
That kind of issues seems very strange by them selfs, but should not really be affected by any still potential error in the switch setup. Error messages like access denied in Windows filesystem should be on a much higher level.
But, just to verify the physical and virtual setup again, you have Port ID on your vSwitch and all portgroups? No active / standby settings or other?
Could you post the configuration from the physical switch on the ports connecting to the ESXi host?
Well, it may be unusual to see those symptoms, but what I finally did, because it was getting worse and worse, was to remove one of the physical NICs from the vSwitch. So, now the vSwitch has only one physical NIC allocated to it, and everything is working fine. However, the reason I wanted to do aggregation in the first place on this system was because this server is utlimately going to be serving out very large image files to a number of different users through a SQL database that runs on a different server. I want to be sure that the server hosting the image files has very good response time. So, in other words, I want to be able to make this aggregation work.
Here is the configuration on the physical switch for those ports:
Port State: Enabled
Flow Control: Disabled
Speed: Auto (1000M)
Duplex: Auto (Full)
Spanning Tree:
Port: 23
STP: Enable
Port Fast: Disable
Root Guard: Disable
Port State: Forwarding
Port Role: Designated
Speed: 1000M
Path Cost: 100
Priority: 128
RSTP Link Type: Auto
Designated Bridge ID: 32768-00:22:57:f5:91:c0
Designated Port ID: 128-23
Designated Cost: 0
Forward Transitions: 1
There is only one VLAN.
