Ok here is one that has me totally stumped.
3 Identical hosts (IBM 3690 x5), with IBM branded QLogic CNA's both ports are 10G ethernet and set as an etherchannel into a pair of VSS'd Cisco 6509 switches. Currently using dvswitches with half a dozen vlans, route ip hash, and pretty much the standard set of defaults. This is a new cluster that has only been operational for a little over a month. Roughly two weeks ago out of the blue, all network connectivity drops on the VM's in this cluster, the only work around was to remove one leg of the etherchannel, which restored all inbound and outbound traffic.
Moved all the VM's off of a single host in the cluster for some testing which shows some pretty odd results. The only way to restore all connectivity to the host is to remove the leg attached to vNIC7. In the attached image, it shows that if I switch the cables taking vNIC6 and swapping with vNIC7 traffic inbound is restored, but traffic outbound fails, the reverse happens if I go back to the original physical configuration.
I'm assuming you want switch configs?:
description VMWare ESXi3 - HBA2
switchport trunk encapsulation dot1q
switchport mode trunk
channel-group 33 mode on
Group Port-channel Protocol Ports
33 Po33(SU) - Te1/2/5(P) Te2/2/5(D)
On the dvSwitch, its Route IP Hash, Link Status for failover and Notify/Fallback Yes, pretty much the default dvSwitch config.
I'll have to plead ignorance on parts of the network side, as I'm more storage/server oriented.
As for the CNA's, I originally had two in each host, but after speaking with VMWare support we pulled one from each host as max config is 4 10GB or 2 10GB and 8 1GB. Originally thought that this might have been the problem, but almost immediatley after pulling one of the cards and running both 10G connections out of the card, the issue presented on all 3 hosts. The issue presents on all 3 of the currently active cards, and I've never seen an actual "batch" of bad NIC's (HD's on the other hand yes)
On a side note, Duncan, thanks for the wonderful book. It's been very helpful.
Have you had any movement on this issue? I have a client that seeing a very similar problem, from what I can tell. I'm not fully up to speed yet on what they've seen and tried.
Nope, no movement yet, other than the issue refusing to show up for a few weeks now. The real odd thing is that its almost as if certain virtual machines being on the cluster result in the issue showing up. I have several production systems on a second cluster that is not affected (different NIC's, Hosts but on the same dvSwitch).
I'm running packet captures that I need to send to IBM once the issue arrises again.
Apparently these cards are not supported on *this* particular server. Yes, IBM redbooks say the card supports ESXi 4.1, except the ServerProven site at IBM says that the 7148 (x3690 x5) server does not support ESXi 4.1 with this particular card. Go figure. One would have thought that IBM support would have caught this very early on, and not after level 3 engineering had to get involved.
I've had to move to the 4X as expensive Broadcom 10GB single port adapters for full support. Yes my CNA's which cost $750 each have to be replaced with 2 cards at $3600 a pop. So now I have to go back to the well and beg for 20k to get some new cards. Suffice to say I am not very happy with IBM right now.
One more thing, the cards that do support ESXi 4.1 have a nice little astrix besides them noting that IBM will no longer be marketing these with this server, meaning they will no longer be an option if you order a new x3690. So word of warning, these awesome boxes that are marketed for vmware use, dont support the current relase when it comes to 10GB ethernet connectivity.