We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...
We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.
We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..
SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.
We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.
Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.
Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.
Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.
When this occurs on a host without the NMI driver, I just get a message saying:
"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."
My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.
They have been unable to find the issue.
I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.
When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.
We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).
Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.
I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.
Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.
If anyone has an idea, I'd be quite grateful to hear it. Thanks!