manfriday
Enthusiast
Enthusiast

Major issues with HP DL580 G5 and Intel X520-DA2

Hi,

We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...

We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.

We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..

SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.

We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.

Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.

Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.

Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.

When this occurs on a host without the NMI driver, I just get a message saying:

"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."

My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.

They have been unable to find the issue.

I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.

When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.

We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).

Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.

I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.

Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.

If anyone has an idea, I'd be quite grateful to hear it. Thanks!

Jason

84 Replies
vmproteau
Enthusiast
Enthusiast

You're a real black cloud Rumple 🙂 I'll sound the alram and see what the decision makers say...not sure our timeline will allow for a replacement. Looking forward to a very uneasy datacenter migration. At least I'll know what to monitor for and maybe get some warning. Appreciate all the information  and insight though.

0 Kudos
Rumple
Virtuoso
Virtuoso

Trust me…I wasn’t happy about it either…when we hit it, we had just migrated from one datacenter to a new datacenter with all new network gear, new ESX environment on 10G…then things started falling over…

/me was not the popular boy in town let me tell you…

What also bit us was when it was setup by the other consultant they forgot that you can have 4x10G cards…or 2x10G cards and 1G together…in the NC522 you cannot disable any of the ports on the 10g cards so even though they plugged in 2x 10G ports…vmware would see 4x ports…so while the 1G would work…it was unsupported configuration and upon reboot, there is always the possibility that depending on memory load order, your 10g ports could get knocked out…

Sigh…

0 Kudos
manfriday
Enthusiast
Enthusiast

I was able to disable ports on the NC522's without any problem.

I just disabled the pci device for that port in the server bios.

0 Kudos
vmproteau
Enthusiast
Enthusiast

Wow..I hadn't heard about the 4-10GbE maximum but, just found it http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102080....

I was only planning to connect 1 port on each card to start but also planned to utilize the 4-1GB Onboard NICs.

  1. Is it then true that if the 2 unused 10GbE ports are disabled, technically I will have 2-10GbE ports with respect to thsi KB and allowed to utilize the 1GB?
  2. Manfriday - What server model are you using?
0 Kudos
Rumple
Virtuoso
Virtuoso

I worked with hp and qlogic and in the dl380 the only think that showed up in bios was port 1 in the device list and I could disable entire card easy enough, I could not disable port 2 on each card and use port 1 for connectivity

0 Kudos
Rumple
Virtuoso
Virtuoso

If you can get the 2 unused ports on the nic's to disable then perfect...

Qlogic and hp both indicated it could not be done...and in the device section I only saw port 1

My suspicion was that with port 2 unplugged it never showed in bios but I worked with vmware and they showed all 4 ports enumerating...

0 Kudos
MichaelW007
Enthusiast
Enthusiast

Hi Rumple,

Sorry to hear you're having so much trouble with your systems. I'm the author of longwhiteclouds.com. I'm running the Intel X520-T2 and I'm not having any problems at all. The cards have been rock solid. I understand that the SFP version of the same card type is also pretty rock solid. The customer that I had with the NC522SFP's is also now stable after the last driver and firmware updates.

Have you considered switching to vSphere 5? The maximums for NIC ports are much better than on 4.x. On vSphere 5 you can have up to 6 x 10Gb/s Ports AND 4 x 1 Gb/s Ports. Just in case you decide to go down this parth the config maximums document is at this location:  http://www.vmware.com/pdf/vsphere5/r50/vsphere-50-configuration-maximums.pdf

I hope you get a new driver that works, or having some success with vSphere 5. IMHO vSphere 5 is well worth the upgrade.

Rumple
Virtuoso
Virtuoso

Since we replaced all 14 of the hp 10g 522 cards with the X520 single port sfp versions we have not had a single incident.

0 Kudos
JonesytheGreate
Contributor
Contributor

We have 20 DL380 G7s in 2 separate datacenters.  Each server has 2 NC523SFP dual port cards.  We are connecting one port on each nic to 2 Nexus 5548.  We are etherchanneling, and are using one vmkernel with Active / Active nics.  Randomly we are seeing one nic drop for about 2 seconds which triggers a redundancy lost alarm.  We have been working with HP because this http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02964542&aoid=35252 didn't solve the problem.  We are using the 4.0.727 driver with firmware 4.8.22.  When the problem happens we see in the message logs "firmware hang detected".

We then ordered 2 NC522SFP to put into one of the servers and that just ended up worse.  When the nic flapped on this one, the network connection would not come back up until I bounced the server.

We have involved HP, VMware, and Cisco, and all fingers seem to point to HP firmware.  Please tell me that I am not the only one out here having this issue.  Unless I can come up with some other ideas, we are now looking into the Intel® Ethernet Server Adapter X520-DA2.

Any help would be appreciated,

Matt

0 Kudos
Rumple
Virtuoso
Virtuoso

We were experiencing the issues you indicated when we were running the 522NFP nic’s in ether channel mode and with same Nexus line (maybe the smaller 5520 series) and ended up replacing all 14 nic’s out with the single port Intel X520-SR1 (non HP branded) and have not had a single issue since we did that over 2 months ago…previous to that, we’d have a server fall over every day or 3.

We has a single port on each HP Nextgen SFP connected and when one failed and it would take out the entire server when it died. The switch guys were seeing a mass amount of port flooding happening prior and during the outage. As you found, only a reboot of the server brought it back.

Check out this thread as well

http://wahlnetwork.com/2011/08/16/identifying-and-resolving-netxen-nx_nic-qlogic-nic-failures/

0 Kudos
damicall
Contributor
Contributor

same problem here Matt..

1 x NC523 latest firmware & vmware driver - both ports connected to Cisco 3750x latest ios, DL380G6, vSphere 4.1 348481, few vm's lightly loaded host.

006248: Dec 30 19:57:23.911: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to down
006249: Dec 30 19:57:23.945: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet3/1/1, changed state to down
006250: Dec 30 19:57:24.918: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to down
006251: Dec 30 19:57:25.086: %LINK-3-UPDOWN: Interface TenGigabitEthernet3/1/1, changed state to down
006252: Dec 30 19:57:36.628: %LINK-3-UPDOWN: Interface TenGigabitEthernet3/1/1, changed state to up
006253: Dec 30 19:57:36.628: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to up
006254: Dec 30 19:57:38.725: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to up
006255: Dec 30 19:57:38.742: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet3/1/1, changed state to up

0 Kudos
MichaelHopwood
Contributor
Contributor

We have had problems with the NC522SFP for about 18 months now. Each time we upgrade the firmware and/or drivers the problems morph but never go away. We continue to see transmit timeouts, excessive Xoff pause frames, port resets, and PSOD.

Even our new ESXi 5.0 hosts with the most current NC522SFP firmware and drivers still have the problems.

We still have about 60 hosts with NC522SFP adapters.

  • HP ProLiant DL380 G6, G7, and DL580 G7 servers
  • NC522SFP ports connected to separate Cisco Nexus 5000 switches
  • ESXi 4.1 U1, U2, and ESXi 5.0
  • NC522SFP firmware = 4.0.579
  • ESXi 5.0 nx_nic driver = 5.0.601

We have open and active cases with HP and VMware. Both have acknowledged a problem, but as of today we still don’t have a fix. I have lost all confidence in the in the NC522SFP.

Time to move on...

0 Kudos
JonesytheGreate
Contributor
Contributor

Yeah, we started with the 523 and then tried out the 522 (made things worse).  Just yesterday I replaced 4 NC523SFP with Intel X520-DA2 cards in two of our servers.  I will post in about a week if the cards are stable.

0 Kudos
damicall
Contributor
Contributor

That would be great I hope it goes well. I think we will need to go down this path also..

=====

also has anyone tried the firmware that vmware state on the HCL?

Model:NC523SFP 10Gb 2-port Server AdapterVID:1077
Device Type:NetworkDID:8020
Partner Name:HPSVID:103c
Firmware Version:4.6.31 (firmware); 4.0.702 (driver)SSID:3733
Number of Ports:2

CollapseESXi 5.0qlcnic  version  5.0.727async

Footnotes  :Download driver from http://www.vmware.com/download/vsphere/drivers_tools.html
CollapseESX / ESXi 4.1 U2qlcnic  version  4.0.727
0 Kudos
damicall
Contributor
Contributor

Hi Guys,

Any updates?

Thx

0 Kudos
JonesytheGreate
Contributor
Contributor

It has been a week and a half and we have had no issue with the intel nics.  Today I am replacing the remaining NC523SFP and shipping them back.

Best of all, HP decided to close my ticket with them this weekend, without contacting me.

I edited this post because before I mentioned turning on vmdq.  I have tested on two systems, and the performance seems worse when you actually configure it instead of using it with the default setting.  I recommend not messing with the vmdq setting.

Message was edited by: JonesytheGrea…

0 Kudos
JonesytheGreate
Contributor
Contributor

One last update.  We have replaced both of our datacenters with the Intel x520-DA2 cards and after updating the drivers to the most current version, I have had no more issues.  Ditching the Qlogic cards was the solution.

0 Kudos
david2009
Contributor
Contributor

ManFriday,

Your comment:

"They are saying that you HAVE to use their SFP's. I am not. I am using cisco SFP's, which I figured would work fine.

Their support page is pretty clear though. They dont say things like "it's not supported" or "not certified".

They flatly declare it WILL NOT WORK."

Can you plese send me the link that say this?  I would like to check this out.

Just want to update eveyone on the SR 11057191404 that was opened by ManFriday.  It is still open and under investigation by both Cisco and VMWare.

So this is a big issue.

0 Kudos
markzz
Enthusiast
Enthusiast

We have also unfortunatly purchased the NC523SFP cards.

We have been running these cards for about a year, they have been trouble from the start.

Although there have been various firmware and driver updates these cards have intermitently suffered Link Loss issues. Generally the cards recover with in a few seconds.

A week or so ago we experienced the same link loss but this time on both cards at the same time. Of course this means production outage..

I took the plunge and upgraded one host to ESXi 5, applied the new firmware and drivers

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=...

I'd be lying if I said this had improved the situation. It's in fact much worse.

We don't suffer the Link Loss issues anymore, the cards appears fine they just don't transmit packets, OH and some how also CPU utilisation of the Host flat lines during this issue. At times the Host recovers, somethimes I have to reboot the host to get it back.

We are using the NC522SFP cards in our g6 hosts, they have been stable for the past 2 years but did not startout that way..

I'm also trialing the Emulex rebranded card the NC552SFP, so far so good..

We will need to make some hasty decisions on this issue this week, it's no longer a workable solution. The NC523SFP's need to go.

I'll get hold of a Intel X520-DA2 and trial it along side the NC552SFP.

0 Kudos
markzz
Enthusiast
Enthusiast

an update.

there is a later driver for the NC523SFP (or qLogic QLE3242) available from the qLogic, the driver is available from VMWare.

This obviously means HP don't support the driver but qLogic and VMWare do..

I'll do some testing and report back.

The driver

http://downloads.vmware.com/d/details/dt_esxi50_qlcnic_5_0_741/dHRAYndlaCpiZHAlJQ==

0 Kudos