Bug: LACP NIC Teaming not working possible due to ...

PTXtreZ · ‎04-15-2016

Hello Everybody

We wanted to make the community aware of a bug of sorts we found while upgrading a customer's Switches to virtual chassis Environment. This customer has more than 10 ESXi 6.0 Hosts with Vcenter Management. He uses dual port Intel 10GB Nic's for his networking backbone. He wanted to have redundant fault tolerant connection for his cluster, but was using his switches as basic dumb switches using IP Hash for teaming when required. Since budget was an issue we ended up Going with Dual Dell 8024F switches in a stack configuration so that each host could run a dual uplink, one on each switch, on an LACP bond to provide the fault tolerance he needed should one switch fail.

We followed all the guides in the Knowledge base and for the most part everything went fine. All Host are members of a Distribuited Virtual Switch which allows an LACP uplink to each host. Enhanced LACP functions are enabled. No Vlan's or 802.1q Trunking are being used on the Vmware side, and all the switch ports and LACP Lags are on the same Vlan. All LAG's were configured as LACP Active.

The issue we saw is that 3 of his servers refused to work with the LACP configuration. We verified everything and all the servers were using the same Intel nics with the same firmware and the same vmware Intel driver type and version. The configurations were identical and yet the LAG would not work for any reason. After a couple of hours of looking at the servers with the issues we noticed that only servers which had an unusually long physical NIC (vmnic) name had this problem. Instead of the usual vmnic2 we were looking at vmnic_100600. We asked the client and it seems he replaced or moved the NIC's on those 3 hosts due to the issues he had with his old switches which caused the name changes.

We debugged the LACP proces and saw the following.

Fri Apr 15 02:55:17 2016[DEBUG]:147, LACP service is starting...

Fri Apr 15 02:57:49 2016[DEBUG]:147, Init LAG for portset DvsPortset-0(group 1683830441, mode 1)

Fri Apr 15 02:57:51 2016[DEBUG]:147, Add uplink vmnic_100600 into lag of portset DvsPortset-0

Fri Apr 15 02:57:51 2016[DEBUG]:147, Failed to get port index from uplink name: vmnic_100600

Fri Apr 15 02:57:51 2016[DEBUG]:147, Failed to initialize lacp port instance for uplink vmnic_100600

Fri Apr 15 02:57:51 2016[DEBUG]:147, Failed to add lacp port for uplink vmnic_100600

Fri Apr 15 02:57:51 2016[DEBUG]:147, Failed to update port

Fri Apr 15 02:57:51 2016[DEBUG]:147, Failed to process data

This seemed to indicate the issue was on the hosts side. Since the unusual name was the common factor we tried to rename the vmnics to the usual format by editing /etc/vmware/esx.conf, as its suggested on several websites but we could not get the names to change on the UI or on vcenter. In the end we reinstalled the faulty hosts since it was easier than following the procedure outlined here. VMware KB: ESXi/ESX host loses network connectivity after adding new NICs or an upgrade‌. After the reinstall we were able to get LACP working normally on the hosts which now show normal vmnic names. We are not direct customers of Vmware so we can't just open a ticket requesting a bug fix, but since we did not find any reference to the Vmnic name being a possible cause for LACP to fail anywhere we thought we would share our experience in hopes it will help others in the Vmware community as we have been helped in a similar manner.

Cheers!

All

Bug: LACP NIC Teaming not working possible due to long vmnic name on ESXi