VMware Cloud Community
manfriday
Enthusiast
Enthusiast

Major issues with HP DL580 G5 and Intel X520-DA2

Hi,

We are experiencing mjor issues with our Hp DL 580 G5 and Intel X520-DA2 nics. You might want to grab a cup of coffee. This could take a while...

We currently have 5 DL580 G5's running ESXi 4.1 with all of the latest patches. All of these hosts are running the latest firmware revisions. All of these hosts are exhibiting the problematic behavior.

We HAD been using the HP branded NetXen cards (NC522SFP) but had a lot of issues with those cards. If you do a search on the message board here, you should be able to find plenty of information on the troubles these cards can cause..

SO, in order to save myself some aggravation, I decided to go with Intel X520-DA2 nics. At first, everything seemed OK. However, we have been experiencing strange issues since switching over to these cards.

We have two standard vswitches set up. vSwitch 0 has a pair of 1gb copper for uplinks (vmnic0,vmnic1). It handles the management traffic, as well as vMotion.

Everything else in trunked in on a pair of 10gb fiber, plugged into the Intel x520's. These serve as uplinks for vSwitch1 (vmnic2, vmnic4), which handles all of the VM data, as well as iSCSI traffic to a pair of EqualLogic arrays. We are using the EqualLogic Multipathing Plugin.

Now for the problem.. Every so often, VMNIC2 freaks out. It still appears to be in a "connected" state, but it no longer passes any traffic. VM's that were using that nic for an uplink lose network connectivity. They cannot ping out, nor do they respond to pings. Removing VMNIC2 from the vSwitch uplinks restores network connectivity, as they fail over to VMNIC4.

Shortly after this happens, the host will PSOD, as requested by the HP NMI driver. For grins, I tried uninstalling the HP NMI driver from some of thos hosts.

When this occurs on a host without the NMI driver, I just get a message saying:

"cpu0:4120) NMI: 2540: LINT1 motherboard interrupt (1 forwarded so far). This is a hardware problem; please contact your hardware vendor."

My incredible deductive reasoning skills led me to believe this was a hardware problem, so I contacted my vendor.

They have been unable to find the issue.

I ran hardware diagnostics on several servers. On one server, I went so far as to run over 3000 interations of the hardware diagnostics over two weeks, and no problem was ever discovered.

When the NMI driver is not installed, the host will not PSOD. However, it will not behave properly again until it is rebooted.

We are, of course, plugged into two switches. One is a Cisco 6509, and the other is a nexus 5000. I thought perhaps there was a problem with one of the switches, so I swapped all of the network cables (so what was plugged into the 6509 is now plugged into the 5000, and vice versa).

Hoever, the problem occured again, and it was still VMNIC2 that freaked out. It did not follow the switch.

I have logged a support ticket with vmware. It has been open since about Dec. 13th I think.

Also, I logged a support ticket with HP around the same time. Nobody seems to know what to do.

If anyone has an idea, I'd be quite grateful to hear it. Thanks!

Jason

84 Replies
milton123
Hot Shot
Hot Shot

Have you test this?

Please let us know.

milton123 
0 Kudos
markzz
Enthusiast
Enthusiast

I've effectively been testing the driver for about 12 hours. Unfortunatly the result is the same.

The NFS stores are periodically going off line but they do appear to recover after a few seconds. You could say it's improved but not at all workable

An extract from the logs

Lost connection to server fasdc01nfs10gb mount point
/vol/esx_aggr3_file_01/esx_aggr3_file_01_qtree mounted as
7327dc8f-d2c7c3a1-0000-000000000000 (sannfssata01).
error
10/04/2012 10:16:57 AM
ServerName

Restored connection to server fasdc01nfs10gb mount point /vol/esx
_aggr3_file_01/esx_aggr3_file_01_qtree mounted as 7327dc8f-d2c7c3a1
-0000-000000000000 (sannfssata01).
info
10/04/2012 10:17:12 AM
ServerName

0 Kudos
damicall
Contributor
Contributor

just a thought; your server has not exceeded the configuration maximum's has it? i.e how many NIC's in total do you have in this system?

0 Kudos
markzz
Enthusiast
Enthusiast

Good though damicall, I'd not thought of that one..

The server has 16 NIC's (as in ports). I'm not sure what the supported number of NIC's is with ESXi 5 but it was 20 under ESX 4 so I can assume we are within a supported configuration

0 Kudos
manfriday
Enthusiast
Enthusiast

10gb nics complicate the config maximums a little bit.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=102080...

Looks like the limit in 5.0 is "six 10gb and 4 1gb ports"

0 Kudos
MichaelW007
Enthusiast
Enthusiast

You need to be very careful about the number of 1G vs number of 10G NIC's. 2 x 10G and 8 x 1G (Which I have in my hosts), but also this is normally only for 1500 MTU. I run my lab environment at Jumbo MTU 9000 all the time however, but I wouldn't recommend this for a production environment.

From: Jason Morris <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Reply-To: communities-emailer <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Date: Mon, 9 Apr 2012 18:29:43 -0700

To: Michael Webster <michael.webster@itsolutions2000.co.nz<mailto:michael.webster@itsolutions2000.co.nz>>

Subject: New message: "Major issues with HP DL580 G5 and Intel X520-DA2"

VMware Communities<http://communities.vmware.com/index.jspa>

Major issues with HP DL580 G5 and Intel X520-DA2

reply from Jason Morris<http://communities.vmware.com/people/manfriday> in VMware vSphere™ vNetwork - View the full discussion<http://communities.vmware.com/message/2023118#2023118

0 Kudos
markzz
Enthusiast
Enthusiast

We have 16 NIC's in our Dl585 g6's

12 1Gb NIC's and 4x 10Gb NIC's 

The 1Gb are the on board + 2x NC365T (Intel Chipset)

the 10Gb are NC522SFP's (the older qLogic) 2 of these port run with Jumbo's

No problem with these server

0 Kudos
MichaelW007
Enthusiast
Enthusiast

That is currently an unsupported host NIC configuration as you are exceeding the maximums. Just because something works doesn't mean it's supported. All the maximums are is what has been tested and the main driver for the number of NIC's is CPU cores and memory buffers. Provided you have lots of cores and memory configurations exceeding the maximum's can work, even if they aren't officially tested and supported. VMware is going to be doing some more testing of different combinations of NIC's in future releases from what I hear.

From: Mark <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Reply-To: communities-emailer <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Date: Mon, 9 Apr 2012 21:49:20 -0700

To: Michael Webster <michael.webster@itsolutions2000.co.nz<mailto:michael.webster@itsolutions2000.co.nz>>

Subject: New message: "Major issues with HP DL580 G5 and Intel X520-DA2"

VMware Communities<http://communities.vmware.com/index.jspa>

Major issues with HP DL580 G5 and Intel X520-DA2

reply from Mark<http://communities.vmware.com/people/markzz> in VMware vSphere™ vNetwork - View the full discussion<http://communities.vmware.com/message/2023144#2023144

0 Kudos
markzz
Enthusiast
Enthusiast

Hi Michael

The DL585g7 in question have  48 cores and 320GB of memory.

Although I've not read the entire document relating to supported NIC configurations I would have thought this NIC configuration was OK due to the Core and Memory..

Your advice on this would be appreciated..

0 Kudos
MichaelW007
Enthusiast
Enthusiast

You would think, but unfortunately that's not the way the supported maximum's work. I recently had a customer with 1TB RAM and 160 Cores per host. They had 4 x 10G NIC Ports and 8 x 1G NIC ports. This is also not a supported configuration. Unless it's listed as a supported combination and as having been tested it is not supported. You may still get lucky and it might still work. But you may have difficulty if you log a call and they determine the root cause could be related to running too many NIC's per host. In my experience VMware support will still try and help on a best efforts basis, but may end up asking you to remove some NIC's from the host. Hopefully the limits and combinations are changed in the next release.

--

Michael Webster, VCDX

Director

IT Solutions 2000 Ltd

Mob: 021 500 432 | longwhiteclouds.com | twitter.com/vcdxnz001

From: Mark <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Reply-To: communities-emailer <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Date: Mon, 9 Apr 2012 23:00:21 -0700

To: Michael Webster <michael.webster@itsolutions2000.co.nz<mailto:michael.webster@itsolutions2000.co.nz>>

Subject: New message: "Major issues with HP DL580 G5 and Intel X520-DA2"

VMware Communities<http://communities.vmware.com/index.jspa>

Major issues with HP DL580 G5 and Intel X520-DA2

reply from Mark<http://communities.vmware.com/people/markzz> in VMware vSphere™ vNetwork - View the full discussion<http://communities.vmware.com/message/2023208#2023208

0 Kudos
david2009
Contributor
Contributor

"4 x 10G NIC Ports and 8 x 1G NIC ports" IS NOT A SUPPORTED CONFIGURATION and I can confirm with you on this.  Michael is ABSOLUTELY correct on this.  This partly explains a lot of issues you are seeing so far.

I've had many conversations  with VMWare TAC on this issue and they have OFFICIALLY confirmed this.

Let say that your system have 4x10G and 8X1G NIC ports.  Even though you use only 2x10G and 2x1G ports and nothing is plugged into the remaining ports.  THIS IS NOT GOOD ENOUGH.  YOU MUST DISABLE THE REMAINING PORTS IN THE SYSTEM BIOS SO THAT VMWARE ESX CAN NOT SEE THEM DURING BOOTUP TIME.  IF VMWARE SEES THOSE UNUSED PORTS, NOW YOU HAVE AN UNSUPPORTED CONFIGURATION AND AN UNSTABLE SYSTEM.  Simple as that .  With this configuration, the more load you put on the ESX systems, the more unstable it becomes.

There was a similar discussion on another post:  http://communities.vmware.com/message/2005863#2005863

0 Kudos
markzz
Enthusiast
Enthusiast

Hi David

Thanks for your response.

I must say "4 x 10G NIC Ports and 8 x 1G NIC ports" is it's self a very limited configuration.

The article  https://www.vmware.com/pdf/vsphere5/r50/vsphere-50-configuration-maximums.pdf contradicts this.

As you see the article only gives one example. In this example they talk about 6x 10Gb nic's and 4x 1Gb nic's is a maximum configuration.

I'm aware there's no set formula to calculate this maximum configuration as each nic type appears to utilise differing levels of resources but as a rough guide I've always though of it as "each 10Gb port = 4 1Gb ports".

Obviously there are no hard and fast rules here.

There are also some inaccuracies in the "configuration-maximums" document eg. the nx_ nic driver stated as a 10Gb QLogic is in fact a 1Gb qLogic NetXen driver. The Qlogic 10Gb driver is a qlcnic.

We have

2x NC523SFP NIC's

2x NC375T NIC's

Onboard 375i ports (4 Ports)

This configuration functions in ESX 4.1 but is unstable with ESXi 5.

Initially with ESX 4.1 the NIC's would suffer Link Loss. Although I only updated the firmware and driver about a week ago I've not seen any Link Loss issues since the updates were applied. This maybe a possitive move forward by QLogic.

I'm in the process of reducing the NIC count.

As a test I have removed one of the NC523 NIC's.

Although it's only been a few hours so far it's stable.

If this continues to be stable I can alter our servers by removing one of the NC375T cards. If this isn't stable I'll have to look at redesigning the solution.

Also I have requested a trial of 2 NC552SFP's. These should arrive tomorrow,  I'll update the thread post testing.

0 Kudos
david2009
Contributor
Contributor

I am just telling you what VMWare TAC informed as an official response.  The TAC case is 12150432902.  My TAC case is with ESXi 4.1 NOT 5.x.

If ESXi 4.1 sees 4x10Gig and additional 1Gig at boot time, then you will have an unstable system

0 Kudos
manfriday
Enthusiast
Enthusiast

Hi David,

Here is a link to the Intel web-page:

http://www.intel.com/support/network/adapter/pro100/sb/CS-030612.htm

What are the SFP+ optical module requirements for the Intel® Ethernet Server Adapter Series?

  • Intel® Ethernet SFP+ SR Optics and Intel® Ethernet SFP+ LR Optics
  • Other SFP+ modules are not allowed and cannot be used with these adapters.

I just realized you asked me this like a month ago.. Sorry, somehow it slipped by unnoticed in my inbox until just now.

Embarassing.

0 Kudos
manfriday
Enthusiast
Enthusiast

Oh, but while I am talking about intel SFP's, I noticed a new behavior with the Intel X520-DA2 nics w/ non-intel SFP's under version 5.

In ESXi 4 the cisco and Advantage Optics SFP's did avctually seem to work, despite not being supported by Intel.

IN ESXi5, the port is actually DISABLED if it has a non-Intel SFP plugged in.

Neat!

0 Kudos
MichaelW007
Enthusiast
Enthusiast

It's not just the number of NIC's it's also the type. Different driver versions have different overheads to the hypervisor. The best way to either find out of it's a supported configuration or get it supported is to log a support request with VMware and have them bless the configuration.

--

Michael Webster, VCDX

Director

IT Solutions 2000 Ltd

Mob: 021 500 432 | longwhiteclouds.com | twitter.com/vcdxnz001

From: Mark <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Reply-To: communities-emailer <communities-emailer@vmware.com<mailto:communities-emailer@vmware.com>>

Date: Tue, 10 Apr 2012 10:11:53 -0700

To: Michael Webster <michael.webster@itsolutions2000.co.nz<mailto:michael.webster@itsolutions2000.co.nz>>

Subject: New message: "Major issues with HP DL580 G5 and Intel X520-DA2"

VMware Communities<http://communities.vmware.com/index.jspa>

Major issues with HP DL580 G5 and Intel X520-DA2

reply from Mark<http://communities.vmware.com/people/markzz> in VMware vSphere™ vNetwork - View the full discussion<http://communities.vmware.com/message/2023687#2023687

0 Kudos
markzz
Enthusiast
Enthusiast

David

I don't mean to sound argumentative at all and appreciate the input.

Currently the server is running with one 10Gb "NC523SFP" removed..

It's not lost the plot yet so we may have a winner.

Another odd think I saw.

I had 7 NFS targets connected on the server, when I tried to add another NFS target it failed complaining the number of available NFS connectons had been exceeded.

Odd since ESXi 5 supports 256 NFS targets!

0 Kudos
Rumple
Virtuoso
Virtuoso

By default its set to 8...you need to increase that in the options.

That would suggest none of the other recommended settings you usually change with nfs are set as well.

If its a netapp, get vsc 2.1.1 installed and it will help you configure your nfs settings..

0 Kudos
markzz
Enthusiast
Enthusiast

Thanks Rumple. I was reading the maximum document when I saw that NFS target number.

The bad news.

With one NC523SFP removed from the Host server it worked fine for about 20 hours before falling over.

I've had to reboot the host to resolve the issue.

0 Kudos
damicall
Contributor
Contributor

So what is your NIC count now?

You may still need to remove more to get a supported and stable system.

0 Kudos