spascoe64
Contributor
Contributor

Mixed Cluster 4.1/5.1 & Cisco Nexus 5020 in the mix; strange vmotion failures

I'm an independent consultant, working with one of my clients.  We are in the process of upgrading our farm from 4.1 to 5.1 and ran into a performance problem as follows:

Servers are Dell R710 with dual QLA8152 CNA cards in them (2 ports each).  We have two Cisco Nexus 5020 switches, each Server HBA has one port plugged into each Nexus (crossing ports and nexi)

The farm is configured with a vCenter 4.1 Distributed Switch with 4 uplinks (each CNA).  Long ago, we enabled Jumbo frames on the switches, ports, and Nexi.

At this point, I've got the vCenter upgraded to 5.1, but have not upgraded the vDS (it is still at 4.1).  I've upgraded two of the Hosts to esxi 5.1 and patched.  Am running latest firmware and drivers on everything.

My Nexuses are not running current IOS,but are on 4.2(1)NT(1a), and I noticed last week, that the 'show interface ethernet x/x' command shows the MTU as 1500.  That got me concerned, until I found information that this is 'just a cosmetic' issue and isn't really true.

Fast forward to today:  After upgrading the second server, I was running some basic vmotion tests on non-critical VM's  Doing a single vmotion works everytime, but when I attempt to vmotion 5 vm's at once, I got very inconsistent results where sometimes one would fail to vmotion, and sometimes all would fail.

The vmkernel logs were showing that the connection between the hosts was failing.  (Note, I didn't see any notifications of physical failures in the Nexus.)

After multiple attempts to resolve the problem, I got a different failure as documented by http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203689...

I went in an set my VMotion MTU on all servers to 1500 and the problem appears to have disappeared.

So, at this point, I'm left wondering:

1) Does ESX 5.1 behave differently when connected to the Nexus where it is reporting an MTU of 1500 than does ESX 4.1?

2) Should I upgrade the NX-OS to the current version.

Any thoughts would be appreciated.

0 Kudos
5 Replies
GRiveros
Contributor
Contributor

Interesting, if possible try with v6 of NX-OS? v4 is too old may be. Regards.

Sent from my iPhone.

0 Kudos
sflanders
Commander
Commander

Mismatched MTU sizes are known to cause issues (as outlined in 2036890). Have you verified that the MTU issue on the 5020 is really cosmetic? Have you tried a vmkping with a jumbo packet to ensure (see kb.vmware.com/kb/1003728)?

Hope this helps! === If you find this information useful, please award points for "correct" or "helpful". ===
0 Kudos
spascoe64
Contributor
Contributor

I have checked the MTU, and actually pushed it back down to 1500 across the board. I have tried a vmkping as described and it successfully pings.

I believe I've narrowed it down to a combination of ESXi 5.1 using the qla8152 (latest drivers/firmware) and the Nexus 5020 on the networking side of the CNA. The SAN Side doesn't appear to be having any difficulties.

I'm hoping to get a case open with Cisco in order to figure out how to diagnose any flow control or packet issues on its end, but it kind of feels like at this point that the issue is with the qlogic card.

0 Kudos
spascoe64
Contributor
Contributor

I found this forum thread today.  I am experiencing the same issues: http://communities.vmware.com/message/2181032

Based on my research, I tried to turn off the 'ql12xenablemsix' module parameter for the qla2xxx module, and it has had some positive impact, but I am noticing during tests, that sometime I can vmotion multiple machines, and sometimes only one.

0 Kudos
spascoe64
Contributor
Contributor

Instead of repeating myself here, I answered this here: http://communities.vmware.com/message/2232588#2232588

0 Kudos