VMware Cloud Community
manfriday
Enthusiast
Enthusiast

Major issues with HP DL580 G5 and Intel X520-DA2

Hi,

We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...

We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.

We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..

SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.

We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.

Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.

Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.

Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.

When this occurs on a host without the NMI driver, I just get a message saying:

"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."

My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.

They have been unable to find the issue.

I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.

When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.

We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).

Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.

I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.

Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.

If anyone has an idea, I'd be quite grateful to hear it. Thanks!

Jason

84 Replies
damicall
Contributor
Contributor

Back to the topic which was HP NIC's & instability.

Found this in my error logs on esxi after I noticed a NC523 link go up and down this morning.

Apr 11 18:59:46 vmkernel: 88:15:28:15.794 cpu11:4409)<6>qlcnic 0000:07:00.0: Firmware Hang Detected
Apr 11 18:59:46 vmkernel: 88:15:28:15.794 cpu11:4409)<6>qlcnic 0000:07:00.0: Disabled bus mastering.
Apr 11 18:59:46 vmkernel: 88:15:28:15.795 cpu11:4409)IDT: 1565: 0x82
Apr 11 18:59:46 vmkernel: 88:15:28:15.795 cpu11:4409)IDT: 1634: <vmnic12[0]>
Apr 11 18:59:46 vmkernel: 88:15:28:15.871 cpu0:4412)<6>qlcnic 0000:07:00.1: Firmware reset request received.
Apr 11 18:59:46 vmkernel: 88:15:28:15.871 cpu0:4412)<6>qlcnic 0000:07:00.1: Disabled bus mastering.
Apr 11 18:59:46 vmkernel: 88:15:28:15.871 cpu0:4412)IDT: 1565: 0xc2
Apr 11 18:59:46 vmkernel: 88:15:28:15.871 cpu0:4412)IDT: 1634: <vmnic13[0]>

and using vmware driver VMware ESX/ESXi 4.x Networking Driver (qlcnic) Version 4.0.727

http://downloads.vmware.com/d/details/dt_esxi40_qlogic_qlcnic_40727/ZHcqYnQqQGhiZEBlZA

and some additional info, as people are seeing the same

http://wahlnetwork.com/2011/08/16/identifying-and-resolving-netxen-nx_nic-qlogic-nic-failures/

http://kb.vmware.com/selfservice/search.do?cmd=displayKC&docType=kc&externalId=2012455&sliceId=1&doc...

Reply
0 Kudos
JonesytheGreate
Contributor
Contributor

damicall,

We were seeing the exact error that you describe where every time there would be a failure you could see that a firmware hang was detected.  I passed log after log to HP and didn't get anywhere.  We finally replaced all of our dual 523SFPs with dual Intel 520 cards and we have not had an issue since.  If you want to save yourself a headache, I would change the cards out if you can.

I have upgraded two servers so far to ESXi 5 and am not having any issues.  We are using cisco cables and SFPs and there has been no issue with them and the Intel cards.  

For networking I am using 2 onboard 1GB Broadcom nics for management and vmotion and I have 2 10GB Intel 520 that handle my VM traffic.  The host does see 2 extra 10GB nics out there because I cannot disable them in the bios.  So far this configuration has been very stable.

Jonesy

Reply
0 Kudos
david2009
Contributor
Contributor

"For networking I am using 2 onboard 1GB Broadcom nics for management and vmotion and I have 2 10GB Intel 520 that handle my VM traffic.  The host does see 2 extra 10GB nics out there because I cannot disable them in the bios.  So far this configuration has been very stable."

The reason you have a stable system is because you are using ESXi 5 and NOT ESX 4/4i.

This configuration is UNSTABLE in ESX 4/4i because the host see 4x10GB NIC and 2x1GB on-board NIC, thus violating the maximum network configuration in ESX 4/4i

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

Well David2009 I'm not entirely sure I agree with the idea ESXi 5  is more stable than ESX4.

I can produce the same result in both version.

I was looking at 2 servers with identical hardware, one was running ESXi 5 and the other ESX4.

My ESX4 server was stable except for the odd link loss issue but I never entirely lost the path due to NIC redundancy.

The ESXi 5 server was the one that was suffering path loss and would require a reboot to get it back on line..

I noticed I'd made a mistake when I was setting up the ESX4 server.

I'd enable jumbos on the vSwitch which was used for IP Storage traffic but I'd forgotten to enable jumbos on the associated vNIC.. Of course I thought bugger, so I corrected the configuration.. About 2 hours later the ESX4 server lost access to the NFS stores..

By correcting my mistake I'd discovered how to break the ESX4 host..

I've checked the physical switches and end point device, Jumbos are enable all the way down the line, This is not a configuration issue but may be related to another (ANOTHER) issue with these QLogic NIC's or possibly the port count.

Our Hong Kong associates have advised a similar issue has been occuring to them, HP has admitted the NC375t is likely the actual cause and advising the NIC's be replaced with the intel version "NC365t".

I know for a fact these NC365T utilise 128MB of System ram which I guess is no big deal. We have these NIC's in a few of the DL585g6 server. Oddly these DL585g6's have never had any NIC issues.

Reply
0 Kudos
JonesytheGreate
Contributor
Contributor

David2009,

I am upgrading from ESXi 4.1 update 2 to ESXi 5.  This configuration on 5 was stable for me on 4.1.  If I could have disabled the extra 10GB, I would have.  The other part of the discussion is that some people are having problem with cables that are not Intel cables.  That is not the case with us; we are using Cisco cables for our 10GB SFP connections.

Reply
0 Kudos
JonesytheGreate
Contributor
Contributor

Markzz,

Are you using iSCSI?  Is it possible that something got messed up with the binding of nics to hba?

Reply
0 Kudos
caledunn
Contributor
Contributor

I have this same issue with 10 esxi hosts on version 4.1 update2.  They run on dl380 g7s and have two nc523sfp each.  Only two 10gb links are being used. we are using cisco cables.  About every 24hrs or so I get the  "firmware hang Detected" on at least one esxi host.  I've tried disabling the 4 onboard 1gig nics so esxi only sees 4 10gb nics.  Again I'm only using 2 10gb nics for all traffic and I had the same failures.  i opened a ticket with HP they sent me the same exact card so I'm not sure its going to fix anything.  I find it hard to believe I have 20 bad cards.  I also opened a ticket with Vmware and they seemed to think it maybe something to do with the firmware\driver or maybe the network configuration.  since i have the latest firmware and driver loaded i started to simplify the network.  I'm currently testing an explicit failover setup instead of using "iphash" and portchannel with our nexus switches.  i have a few tasks setup to vmotion vms back and forth 4 times a hour.

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

Hi JonesytheGrea…

No iSCSI used here.. All IP Storage is NFS.

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

Hi caledunn

Your testing sound complex and comprehensive..

I agree with your point regarding the NC523SFP cards.

You may have 1 or 2 faulty cards but not all 20. I therefore wonder if HP have updated the hardware version of the cards and therefore changed something.

I've got 2 NC523SFP cards here currently which HP have sent me. These are hardware revision 0d.. I've not compared these cards to the current NC523's in the prod servers.. I'll get to that one tomorrow.

After upgrading the firmware and driver on the NC523 I have not seen another link loss (firmware hang). I now seem to suffer an issue where the cards won't transmit packets.. Great job qLogic..

Reply
0 Kudos
damicall
Contributor
Contributor

All our cards are

HP P/N 593715-001  REV 0B   (white sticker on SFP+ Slot)

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

I'll check mine tomorrow evening.

Reply
0 Kudos
caledunn
Contributor
Contributor

My test past the 24hr mark on the esxi hosts in the test cluster BUT I just got the "firmware hang detect" message on an esxi host in our exchange cluster.  I had removed iphash and the port channel on that host also. so it doesnt look like the config is the issue.  I think the next thing Im going to do is get a hp nc550sfp and x520-DA2 and test with them.

Interesting about the rev on the nc523sfp because the ones I have are 593715-001 Rev:0C and hp sent me Rev:0C.  I'm going to double check some of my servers and see if they are Rev:0C.  I still have 13 unboxed that i believe are all Rev:0C also so i'm not optimistic.

Reply
0 Kudos
caledunn
Contributor
Contributor

I doublechecked the 10gb cards in the server that failed thie morning and they are at Rev: 0C which is the same Rev that hp sent me.

Reply
0 Kudos
caledunn
Contributor
Contributor

Last night one of the esxi hosts in the vsphere cluster I'm using to test finally failed with the "Firmware hang detected".  So the changes helped but instead of every 24hrs it lasted for almost 5 days.  i guess the next step is to try a new card.  We are getting 10 nc552sfp cards this week.  we sent what we hadnt unboxed yet back to swap for the nc552sfp cards.

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

Friday evening I installed the new NC523SFP's HP support sent to me.

The original cards are REV:0A. The new cards are REV:0D.

Unfortunatly one of these cards appeared to promptly fail, regardless I continued the test with both cards installed but only one functioning.

After about 20 hours the same issue occured and I could not transmit packets over the 10Gb interface.

Reply
0 Kudos
caledunn
Contributor
Contributor

Mark,

I was thinking about adding a second riser and moving the second card over so i could then have both in slot 1 on the riser.  Currently i have both nc523sfp's on the same riser in slot 2 and slot 3. But I'm not sure this would make a difference.  How do you have yours setup?

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

caledunn

We use the NC522SFP in a number of DL585g6's.

They were quite troublesome in the early days but have been stable since their last firware and driver update.

If I recall the firmware released late last year is stable

Driver and firmware information.

~ # ethtool -i vmnic2
driver: nx_nic
version: 5.0.601
firmware-version: 4.0.579
bus-info: 0000:08:00.0
~ # ethtool -i vmnic0
driver: nx_nic
version: 5.0.601
firmware-version: 4.0.579
bus-info: 0000:02:00.0

OH another thought.

I've had reports the NC375T are part of the problem and these should also be replaced with NC365T's

The DL585g6's mentioned use NC364T addin nics (the NC364T is the early version of the NC365T both use an Intel chipset)

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

caledunn We do use some DL385g7's, generally they are purchased with the 2nd riser for expansion.

I would put the 2nd 10Gb nic on the 2nd riser for redundancy and to distribult the load across the buses.

In our DL585's the expansion board which adds 3 more PCIe interfaces is infact called a riser. I do the same with this where 1 of the 10Gb cards is installed in it.

To be honest it doesnt' seem to make any difference.. The NC523's still fail.

Your plan with the NC522 is sensible, BUT they run HOT HOT HOT..

I'd suggest you set your system bios to maximum cooling..  and be sure the servers are getting plenty of nice cold air on them.

Reply
0 Kudos
caledunn
Contributor
Contributor

Is the latest esxi driver for the nc523sfp card still 4.0.727 or is it now 4.0.739?  If I go to the hardware compatibility guide I noticed its still listed as 4.0.727 but when I go to the esxi driver cd I only see 4.0.739. If I go to HP advisory page and click the download link it does take me to a page for the 4.0.727.

hardware compatibility guide:

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=19311&deviceCat...

Hp advisory:

http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02964542&lang=en&cc=us&taskI...

vmware 4.0.727 page:

https://my.vmware.com/web/vmware/details/dt_esxi40_qlogic_qlcnic_40727/ZHcqYnQqQGhiZEBlZA

Vmware driver cd:

https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESX4x-QLOGIC-QLCNIC-40739&productId=230#dt...

Reply
0 Kudos
markzz
Enthusiast
Enthusiast

If I remember correctly

The recommended driver for ESXi 5 is 5.0.727

The recommended driver for ESX /i 4 is 4.0.727

But there are later version drivers available

For ESXi 5 qlcnic-esx50-5.0.741-635278.zip (Version 5.0.741)

I'm not sure if there is a later version than 4.0.727 for ESX 4.

Reply
0 Kudos