VMware Cloud Community
manfriday
Enthusiast
Enthusiast

Major issues with HP DL580 G5 and Intel X520-DA2

Hi,

We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...

We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.

We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..

SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.

We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.

Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.

Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.

Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.

When this occurs on a host without the NMI driver, I just get a message saying:

"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."

My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.

They have been unable to find the issue.

I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.

When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.

We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).

Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.

I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.

Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.

If anyone has an idea, I'd be quite grateful to hear it. Thanks!

Jason

84 Replies
markzz
Enthusiast
Enthusiast

Just like to add we are running the latest versions.

NC523SFP

~ # ethtool -i vmnic12
driver: qlcnic
version: 5.0.741
firmware-version: 4.8.22
bus-info: 0000:81:00.0
~ #

And it's still terribly unstable.

Although if I keep my vNic MTU at 1500 i only suffer the link loss issues. If I push my vNic 9000 (the vSwitch is running an MTU of 9000) the NC523's port with these MTU values after some time won't pass packets.

0 Kudos
markzz
Enthusiast
Enthusiast

Just a very quick update on this debarcle of a situation.

After a few weeks battling with HP support who seem more confused with this issue than I am, and VMWarewho were really no help at all I managed to get hold of our Enterprise account manager..

He has been very helpful and we have made significant progress by involving the local Australian Technical team and making some rather radical changes.

HP Agreed to send over 2x NC552SFP (10Gb Emulex) which replaced the 2x NC523SFP's and 2x NC365T (intel 1Gb) which replaced the 2x NC375T.

This combination has now been running for 5 days.

The new NIC's have not reported any failures, link loss, anything at all..

To put this into perspective, these servers have been running for over 12 months, there has never been a period of 5 days where they have not experienced link loss or someother nic port failure.

The onboard NC375i have been behaving better but they are still the one thing I'm not confident of. I've seen a couple of vmotions fail. I've not done any monitoring or diagnosie yet but it seems when these ports hit 90% untilisation they seem to pause and no longer pass packets. (sounds like some other qlogic nic's)

I did read a forum entry where it was stated HP and QLogic recognise there is an issue with the onboard NC375i chipset and have a replacement riser available which resolves this issue..

At this point I'm going to be kind to HP and begin discussing replaceing the NC523SFP's and NC375T's in our other 585g7's..I'm not sure if this will be a swapout or purchase situation, either way I'm just happy to see some improvement in stability.

0 Kudos
caledunn
Contributor
Contributor

i've also been testing with the x520-DA2, nc550sfp and nc552sfp cards and without making a single configuration change they have yet to go down. 
i have several vsphere clusters i've been testing with and I have at least one esxi host with the nc523sfp cards in each cluster and the others with the intel or emulex cards.  The hosts with the nc523sfp cards go down every few days but the others stay up.  The only thing that changed was replacing the cards.  We are moving forward with purchacing more nc552sfp cards and that will be our solution to the problem.  We will just eat the cost on the 20+ cards we have.  We are hoping we can reuse them with our windows servers. I would have responded earlier but i wanted to give it a couple of weeks to make sure the emulex and intel cards were stable.   I'll let you know if i run into any problems with the new cards.  I'll add that vmware support was actually pretty helpful for us.  They didnt offer a solution but they helped troubleshoot and narrow down where the issue is and kept the ticket open.  

0 Kudos
markzz
Enthusiast
Enthusiast

I'd like to update this post with where we are at and the stability of the NC375T, NC523SFP, NC375i.

I would like start by saying our end game solution with these qLogic network cards has simply been to replace them. I do however have some LAB servers which still use the NC375T NIC.

It would be wonderful to report there was an achievable solution which stabilised these network cards, BUT THERE IS NOT.

qLogic have continued to release firmware and drivers in an attempt to resolve the various performance and stability issues, none the less it does not appear they have achieved an acceptable result..

My advice is to just avoid these network cards.

Emulex, Brocade and Intel chipset cards are available from HP. These may be marginally more expensive but they work. If I was asked for a recommendation.

NC552SFP are stable and fast

NC365T are again stable and fast.

Intel cards are always expensive but they simply work..

(my opinions are my own, my experience is what I share)

0 Kudos
cypherx
Hot Shot
Hot Shot

Are these cards related to the QLogic QLE3242 dual port 10GbE adapter?

We were having trouble maintaining new NFS storage connectivity across these adapters in ESXi5 U3, 1489271.  So far the fix (I hope its a fix) was to update the firmware and driver to these versions:

driver: qlcnic

version: 5.1.178

firmware-version: 4.16.34

Originally we had driver 5.0.727 and firmware 4.9.x.

I found another thread on here with the same nic and poor iSCSI stability when using jumbo frames.  Going back to 1500 mtu would stablize it for them, but then they upgraded to firmware 4.12.x and jumbo was stable for them.  That post was quite some time ago so now as you can see 4.16.34 is out.  I also installed the QLogic CIM provider on each host and the vcenter server plugin so I can also now view and manage these cards.

I made this change only a week ago but so far so good.  Here's knocking on wood.....

0 Kudos