VMware Cloud Community
Gabriel_Chapman
Enthusiast
Enthusiast

ESXi 4.1u1 and Etherchannel refuse to work

Ok here is one that has me totally stumped.

3 Identical hosts (IBM 3690 x5), with IBM branded QLogic CNA's both ports are 10G  ethernet and set as an etherchannel into a pair of VSS'd Cisco 6509 switches.   Currently using dvswitches with half a dozen vlans, route ip hash, and pretty much the standard set of defaults. This is a new cluster that has only been operational for a little over a month. Roughly two weeks ago out of the blue, all network connectivity drops on the VM's in this cluster, the only work around was to remove one leg of the etherchannel, which restored all inbound and outbound traffic.

Moved all the VM's off of a single host in the cluster for some testing which shows some pretty odd results. The only way to restore all connectivity to the host is to remove the leg attached to vNIC7. In the attached image, it shows that if I switch the cables taking vNIC6 and swapping with vNIC7 traffic inbound is restored, but traffic outbound fails, the reverse happens if I go back to the original physical configuration.

The odd thing we see is that when vNic6 is connected to ports 1/2/3 or 2/2/3 all connectivity flows without issue. If we add vNIC7 all connectivity fails.
vNic7 on port 1/2/3 and vNic6 on 2/2/3 results in all outbound traffic routing through the vSwitch to cease, the opposite occurs if we put vNic7 onto port 2/2/3 and vNic6 onto port 1/2/3, traffic in works but outbound does not. vNIC7 on its own in any of the two active ports results in no traffic.
I have a ticket open with vmware but wanted to see if anyone else has a clue as to what would cause this. I have a second cluster of 3850M2's with single port 10G Intel NIC's that are unaffected and have worked for almost 2 years without an issue. The switch configs are identical across the board. Firmware and drivers for the QLogic CNA's are all up to date and on the VMWare HCL approved list.
Ex Gladio Equitas
Reply
0 Kudos
6 Replies
depping
Leadership
Leadership

can you post the config as well?

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

Reply
0 Kudos
tschutte
Contributor
Contributor

Perhaps this is a simple hardware failure of the CNA? Have you tried swapping it for a spare?

Reply
0 Kudos
Gabriel_Chapman
Enthusiast
Enthusiast

I'm assuming you want switch configs?:

interface TenGigabitEthernet1/2/5
description VMWare ESXi3 - HBA1
switchport
switchport trunk encapsulation dot1q
switchport mode trunk
switchport nonegotiate
channel-group 33 mode on
!

interface TenGigabitEthernet2/2/5

description VMWare ESXi3 - HBA2

switchport

switchport trunk encapsulation dot1q

switchport mode trunk

switchport nonegotiate

channel-group 33 mode on

!

Group  Port-channel  Protocol    Ports

33     Po33(SU)         -        Te1/2/5(P)     Te2/2/5(D)  

On the dvSwitch, its Route IP Hash, Link Status for failover and Notify/Fallback Yes, pretty much the default dvSwitch config.

I'll have to plead ignorance on parts of the network side, as I'm more storage/server oriented.

As for the CNA's, I originally had two in each host, but after speaking with VMWare support we pulled one from each host as max config is 4 10GB or 2 10GB and 8 1GB. Originally thought that this might have been the problem, but almost immediatley after pulling one of the cards and running both 10G connections out of the card, the issue presented on all 3 hosts. The issue presents on all 3 of the currently active cards, and I've never seen an actual "batch" of bad NIC's (HD's on the other hand yes)

On a side note, Duncan, thanks for the wonderful book. It's been very helpful.

Ex Gladio Equitas
Reply
0 Kudos
Blaishon
Contributor
Contributor

Gabriel,

Have you had any movement on this issue? I have a client that seeing a very similar problem, from what I can tell. I'm not fully up to speed yet on what they've seen and tried.

Doug...

Reply
0 Kudos
Gabriel_Chapman
Enthusiast
Enthusiast

Nope, no movement yet, other than the issue refusing to show up for a few weeks now. The real odd thing is that its almost as if certain virtual machines being on the cluster result in the issue showing up. I have several production systems on a second cluster that is not affected (different NIC's, Hosts but on the same dvSwitch).

I'm running packet captures that I need to send to IBM once the issue arrises again.

Ex Gladio Equitas
Reply
0 Kudos
Gabriel_Chapman
Enthusiast
Enthusiast

Update:

Apparently these cards are not supported on *this* particular server. Yes, IBM redbooks say the card supports ESXi 4.1, except the ServerProven site at IBM says that the 7148 (x3690 x5) server does not support ESXi 4.1 with this particular card. Go figure. One would have thought that IBM support would have caught this very early on, and not after level 3 engineering had to get involved.

I've had to move to the 4X as expensive Broadcom 10GB single port adapters for full support. Yes my CNA's which cost $750 each have to be replaced with 2 cards at $3600 a pop. So now I have to go back to the well and beg for 20k to get some new cards. Suffice to say I am not very happy with IBM right now.

One more thing, the cards that do support ESXi 4.1 have a nice little astrix besides them noting that IBM will no longer be marketing these with this server, meaning they will no longer be an option if you order a new x3690. So word of warning, these awesome boxes that are marketed for vmware use, dont support the current relase when it comes to 10GB ethernet connectivity.

http://www-03.ibm.com/systems/info/x86servers/serverproven/compat/us/xseries/lan/matrix.html

Ex Gladio Equitas
Reply
0 Kudos