We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...
We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.
We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..
SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.
We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.
Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.
Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.
Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.
When this occurs on a host without the NMI driver, I just get a message saying:
"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."
My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.
They have been unable to find the issue.
I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.
When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.
We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).
Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.
I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.
Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.
If anyone has an idea, I'd be quite grateful to hear it. Thanks!
You're a real black cloud Rumple 🙂 I'll sound the alram and see what the decision makers say...not sure our timeline will allow for a replacement. Looking forward to a very uneasy datacenter migration. At least I'll know what to monitor for and maybe get some warning. Appreciate all the information and insight though.
Trust me…I wasn’t happy about it either…when we hit it, we had just migrated from one datacenter to a new datacenter with all new network gear, new ESX environment on 10G…then things started falling over…
/me was not the popular boy in town let me tell you…
What also bit us was when it was setup by the other consultant they forgot that you can have 4x10G cards…or 2x10G cards and 1G together…in the NC522 you cannot disable any of the ports on the 10g cards so even though they plugged in 2x 10G ports…vmware would see 4x ports…so while the 1G would work…it was unsupported configuration and upon reboot, there is always the possibility that depending on memory load order, your 10g ports could get knocked out…
Wow..I hadn't heard about the 4-10GbE maximum but, just found it http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102080....
I was only planning to connect 1 port on each card to start but also planned to utilize the 4-1GB Onboard NICs.
I worked with hp and qlogic and in the dl380 the only think that showed up in bios was port 1 in the device list and I could disable entire card easy enough, I could not disable port 2 on each card and use port 1 for connectivity
If you can get the 2 unused ports on the nic's to disable then perfect...
Qlogic and hp both indicated it could not be done...and in the device section I only saw port 1
My suspicion was that with port 2 unplugged it never showed in bios but I worked with vmware and they showed all 4 ports enumerating...
Sorry to hear you're having so much trouble with your systems. I'm the author of longwhiteclouds.com. I'm running the Intel X520-T2 and I'm not having any problems at all. The cards have been rock solid. I understand that the SFP version of the same card type is also pretty rock solid. The customer that I had with the NC522SFP's is also now stable after the last driver and firmware updates.
Have you considered switching to vSphere 5? The maximums for NIC ports are much better than on 4.x. On vSphere 5 you can have up to 6 x 10Gb/s Ports AND 4 x 1 Gb/s Ports. Just in case you decide to go down this parth the config maximums document is at this location: http://www.vmware.com/pdf/vsphere5/r50/vsphere-50-configuration-maximums.pdf
I hope you get a new driver that works, or having some success with vSphere 5. IMHO vSphere 5 is well worth the upgrade.
We have 20 DL380 G7s in 2 separate datacenters. Each server has 2 NC523SFP dual port cards. We are connecting one port on each nic to 2 Nexus 5548. We are etherchanneling, and are using one vmkernel with Active / Active nics. Randomly we are seeing one nic drop for about 2 seconds which triggers a redundancy lost alarm. We have been working with HP because this http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02964542&aoid=35252 didn't solve the problem. We are using the 4.0.727 driver with firmware 4.8.22. When the problem happens we see in the message logs "firmware hang detected".
We then ordered 2 NC522SFP to put into one of the servers and that just ended up worse. When the nic flapped on this one, the network connection would not come back up until I bounced the server.
We have involved HP, VMware, and Cisco, and all fingers seem to point to HP firmware. Please tell me that I am not the only one out here having this issue. Unless I can come up with some other ideas, we are now looking into the Intel® Ethernet Server Adapter X520-DA2.
Any help would be appreciated,
We were experiencing the issues you indicated when we were running the 522NFP nic’s in ether channel mode and with same Nexus line (maybe the smaller 5520 series) and ended up replacing all 14 nic’s out with the single port Intel X520-SR1 (non HP branded) and have not had a single issue since we did that over 2 months ago…previous to that, we’d have a server fall over every day or 3.
We has a single port on each HP Nextgen SFP connected and when one failed and it would take out the entire server when it died. The switch guys were seeing a mass amount of port flooding happening prior and during the outage. As you found, only a reboot of the server brought it back.
Check out this thread as well
same problem here Matt..
1 x NC523 latest firmware & vmware driver - both ports connected to Cisco 3750x latest ios, DL380G6, vSphere 4.1 348481, few vm's lightly loaded host.
006248: Dec 30 19:57:23.911: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to down
006249: Dec 30 19:57:23.945: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet3/1/1, changed state to down
006250: Dec 30 19:57:24.918: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to down
006251: Dec 30 19:57:25.086: %LINK-3-UPDOWN: Interface TenGigabitEthernet3/1/1, changed state to down
006252: Dec 30 19:57:36.628: %LINK-3-UPDOWN: Interface TenGigabitEthernet3/1/1, changed state to up
006253: Dec 30 19:57:36.628: %LINK-3-UPDOWN: Interface TenGigabitEthernet1/1/2, changed state to up
006254: Dec 30 19:57:38.725: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet1/1/2, changed state to up
006255: Dec 30 19:57:38.742: %LINEPROTO-5-UPDOWN: Line protocol on Interface TenGigabitEthernet3/1/1, changed state to up
We have had problems with the NC522SFP for about 18 months now. Each time we upgrade the firmware and/or drivers the problems morph but never go away. We continue to see transmit timeouts, excessive Xoff pause frames, port resets, and PSOD.
Even our new ESXi 5.0 hosts with the most current NC522SFP firmware and drivers still have the problems.
We still have about 60 hosts with NC522SFP adapters.
We have open and active cases with HP and VMware. Both have acknowledged a problem, but as of today we still don’t have a fix. I have lost all confidence in the in the NC522SFP.
Time to move on...
Yeah, we started with the 523 and then tried out the 522 (made things worse). Just yesterday I replaced 4 NC523SFP with Intel X520-DA2 cards in two of our servers. I will post in about a week if the cards are stable.
That would be great I hope it goes well. I think we will need to go down this path also..
also has anyone tried the firmware that vmware state on the HCL?
|Model:||NC523SFP 10Gb 2-port Server Adapter||VID:||1077|
|Firmware Version:||4.6.31 (firmware); 4.0.702 (driver)||SSID:||3733|
|Number of Ports:||2|
|ESXi 5.0||qlcnic version 5.0.727||async|
|ESX / ESXi 4.1 U2||qlcnic version 4.0.727|
It has been a week and a half and we have had no issue with the intel nics. Today I am replacing the remaining NC523SFP and shipping them back.
Best of all, HP decided to close my ticket with them this weekend, without contacting me.
I edited this post because before I mentioned turning on vmdq. I have tested on two systems, and the performance seems worse when you actually configure it instead of using it with the default setting. I recommend not messing with the vmdq setting.
Message was edited by: JonesytheGrea…
One last update. We have replaced both of our datacenters with the Intel x520-DA2 cards and after updating the drivers to the most current version, I have had no more issues. Ditching the Qlogic cards was the solution.
"They are saying that you HAVE to use their SFP's. I am not. I am using cisco SFP's, which I figured would work fine.
Their support page is pretty clear though. They dont say things like "it's not supported" or "not certified".
They flatly declare it WILL NOT WORK."
Can you plese send me the link that say this? I would like to check this out.
Just want to update eveyone on the SR 11057191404 that was opened by ManFriday. It is still open and under investigation by both Cisco and VMWare.
So this is a big issue.
We have also unfortunatly purchased the NC523SFP cards.
We have been running these cards for about a year, they have been trouble from the start.
Although there have been various firmware and driver updates these cards have intermitently suffered Link Loss issues. Generally the cards recover with in a few seconds.
A week or so ago we experienced the same link loss but this time on both cards at the same time. Of course this means production outage..
I took the plunge and upgraded one host to ESXi 5, applied the new firmware and drivers
I'd be lying if I said this had improved the situation. It's in fact much worse.
We don't suffer the Link Loss issues anymore, the cards appears fine they just don't transmit packets, OH and some how also CPU utilisation of the Host flat lines during this issue. At times the Host recovers, somethimes I have to reboot the host to get it back.
We are using the NC522SFP cards in our g6 hosts, they have been stable for the past 2 years but did not startout that way..
I'm also trialing the Emulex rebranded card the NC552SFP, so far so good..
We will need to make some hasty decisions on this issue this week, it's no longer a workable solution. The NC523SFP's need to go.
I'll get hold of a Intel X520-DA2 and trial it along side the NC552SFP.
there is a later driver for the NC523SFP (or qLogic QLE3242) available from the qLogic, the driver is available from VMWare.
This obviously means HP don't support the driver but qLogic and VMWare do..
I'll do some testing and report back.