We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...
We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.
We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..
SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.
We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.
Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.
Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.
Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.
When this occurs on a host without the NMI driver, I just get a message saying:
"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."
My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.
They have been unable to find the issue.
I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.
When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.
We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).
Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.
I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.
Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.
If anyone has an idea, I'd be quite grateful to hear it. Thanks!
You mention HP and VMware support but did you also try Intel for support??
Did you install the HP OEM version of ESXi 4.1? The HP version has the HP Management agents already installed. There were problems reported on the non-HP versions of ESXi 4.1 for some models. I understand the G6 and G7 had some issues with the non-HP versions. If you have don't have the HP version you can still install the HP Management agents via CLI. There is a special package for ESXi so don't use the ones for ESX.
I have not. Thinking this was a pretty good idea, I went to their website to get their support number.
I came across a rather disturbing little blurb on their support page for the X520's.
They are saying that you HAVE to use their SFP's. I am not. I am using cisco SFP's, which I figured would work fine.
Their support page is pretty clear though. They dont say things like "it's not supported" or "not certified".
They flatly declare it WILL NOT WORK.
Doesn't make much sense to me, but at this phase I am willing to try just about anything.
I ordered a few intel SFP's for testing. I will let folks know how that goes.
I am using the standard vmware distribution of ESXi 4.1, with the HP management agents installed via the CLI.
That should be fine, right?
That should be fine. The HP Agents have been known to do bad things when installed or configured incorrectly.
1. Try : play arround with the ixgbe parameters - try without MSI-X support (esxcfg-module -s InterruptType = 0 ixgbe && esxcfg-boot -b). vmkload_mod -s ixgbe show enough buttons to press .
2. Try other slots - the DL580 G5 has a strange PCIe sub-bus layout with too low amount of lanes, also try to get a dedicated IRQ assigned.
Here the PCIe device layout: (assuming the 580 G5 has the PCIe sub IO board)
The 580G5 is limited to 28 PCIe lanes out of the North Bridge. Of these, only 24 lanes (3ea x8) go to the slots:
1ea x8 PCIe shared through a switch to slots 1, 2 & 3 (sub IO board)
1ea x8 PCIe shared through a switch to slots 4, 5 & 6
1ea x8 PCIe shared through a switch to slots 7, 8, 9, 10 & 11
(slots 8-11 are x4 PCIe slots. The rest are x8.)
To maximize system IO bandwidth you need to equally load all three PCIe switches.
3. There is a patched version of the IPMI driver out - do you use it ?
4. I wonder since when HP support to build in 3rd party NICs - i would update to a NC55x card (Emulex CNA).
Thanks for the input guys. I have not abandoned this thread, I have just been waiting to see if my implementation of some of your suggestions had helped me or not..
I took your advice, Saturnous, and changed slots for my 10gb cards. I had been hopeful that the situation was resolved, but alas, it was not.
It did seem to help the situation, as I went for about 20 days on several of the servers without a problem.
However, it did occur again.
I finaly got an answer from VMWare, that may make a bit of sense.
We have 4 10gb nics in the system, even though only 2 are being used.
In addition, we have two 1gb copper nics, being used for management.
This violates the config maximums. Apparently when you have 4 10gb nics you cannot have ANY 1gb nics in use.
The VMWare rep said that he has seen situations where this config maximum was violated and nics would occasionaly just stop forwarding traffic.
Which sounds like what we are seeing.
I have gone into the BIOS of the HP servers, and disabled the unused nic ports, so ESXi only sees two 10gb nics and two 1gb nics.
Vmware says this should be a good config.
We'll see. I'll report back after a week or so and let people know if this was the solution or not.
Thanks for your help
We've recently hit these same type of issues with the nextgen cards and was wondering if disabling the un-used 10g nics resovled the issue for you?
This issue has been plaguing us for over a year. Every time I thought I had it licked, it would rear it's ugly head again a few weeks later.
These intermittent problems are always the worst.
I worked with VMWare support extensively and FINALLY they got enough information form us, that they believed it was a problem with the intel driver.
Intel apparently provided a new debug driver for version 4.0 and 4.1 just this week, which I am supposed to install and test.
I am looking at doing that the first part of next week.
I will update the thread when I get more info.
If you think you are having the same issue, you might want to contact VMWare support. You can reference my SR# 11057191404
Hope that helps
I've opened a ticket and also found this latest advisory for the NC522SFP that also talks about it
Thanks for the update. I think 10G is gonig to be a pita for a while, no matter what brand we go with...
We're building our a datacenter with 20-DL380 G7s each with 2-NC522SFP 10GbE cards (Attaching to a pair Nexus 5548). Since it was on the HCL with no footnote, I assumed it was stable but have since found various KB and posts indicating the contrary.
Can someone verify if this is resolved? Anyone using this who has had no issues at all? Trying to determine if I need to do anything drastic prior to build.
We will be using ESXi 4.1 U1 (latest build).I'll be applying the latest firmware and drivers for the card.
We are in the process of replacing all the NC522 cards with the Intel X520 cards…we are still experiencing pause framing taking down the hosts with the latest firmware from HP and the latest vmware driver.
So far the X520’s have been solid….
That's not great. I originally had ordered NC523SFPs for these but, they were backordered and a decision was made to replace them with available NC522SFPs. Comparing the feature sets, there is very little difference betwen the 2 so, I expected to see similar issue posts for it. Perhaps there are physical differences that make it more stable.
I don't envy what you've gone through but, considering the issue takes down Hosts, I'm a little surpised that I don't see more posts about it. I'm curious how wide spread it is and if might be a result of specific environmental circumstances (i.e. DL580 G5, 522 and downstream switch combination, etc).
Not doubting any of this.....just not loving the idea of replacing 40-cards and would like to make sure before starting that ball rolling.
this was the last advisory…the exact issues we see are pause frames on the cisco side…we lose all connectivity…and even bouncing physical ports on nexus 5000 switches or unplugging cables to cause etherchannel to failover doesn’t work.
Only recourse is a reboot of the host.
I tried turning fans to maximum, making sure I wasn’t in slots 1/3 and ensuring I had latest firmware of everything.
The network guys see massive traffic on one of the nic’s just before it all shuts down…
PS – when doing firmware update as epr the HP advisory, make sure you pick RedHat Enterprise 5 x64…the x86 is a different firmware filename and isn’t recognized.
We are also on the latest and greatest vmware patch levels…
Thanks for the advisory link and the firmware update note. Irritating but appreciated. There is vague note about workload...
Note: There is a low probability of this occurring when operating under a normal network workload.
...however I'm sure an ESX Host would generally be considered above normal workload".
BTW...Had you applied this most recent advisory recommended updates prior to the replacement work? I assumed by your post history that you had.
We have only about 100 vm’s across 7 hosts…cpu utilization…3-5%...
Glad I wasn’t the one to build this out…explaining that ROI would be a tough sell 🐵
Yup..we had put that newest firmware on there about 3 days before one of the damn servers dropped…
I have 2/5 servers changed over…feel like I am sitting on a timebomb…
Usually it takes 1-4 weeks before the problem shows up on average. So far I have had 3 different hosts across 2 different pairs of nexus 5000’s take a dive….