VMware Cloud Community
jftwp
Enthusiast
Enthusiast
Jump to solution

Dell R810's with 'too many' 10gig NICs possibly causing PSOD?

We are running ESXi embedded 4.1 U1 on Dell R810 servers with 128GB of ram and dual 8-core Intel 7560 processors.  They have 4 10GB NICs (2 dual-port cards) and 4 onboard 1GB NICs.  The 10GB NICs are all Intel X520-2 (dual port) using the ixgbe driver.

They have been PSOD'ing randomly over the past few weeks (but not more than once per week, and only 3 of the 18 servers with this identical configuration have PSOD'd).  So we contacted Dell.  After all due analysis on their side, checking logs, BIOS versioning, firmware, and other settings, etc. they point the finger back to VMware (KB 1020808 in particular).

Dell claims their hardware config itself is absolutely fine after initially suspecting a wattage (from the 4 10gig NICs) issue which they have since thrown out.  So we're at a loss, short of removing the dual port 10GB NICs and replacing them with single port 10GB NICs (per this article) since we currently have only 2 of the 4 in use; just wanted more for future.

Based on Dell's feedback/logic, these same PSODs could happen on ANY server vendor hardware platform running 4.1 U1, if those servers were equipped with 4 10gig NICs and 4 1gig NICs.

We're all non-Jumbo-frame (for now) / 1500 MTU.  That said, the article specifies a 'recommended' maximum of:

For 1500 MTU configurations:

  • 4x 10G ports OR
  • 16x 1G ports OR
  • 2x 10G ports + 8x 1G port

The latter is the closest our builds/configs come to (which is 4x 10G ports + 4x 1G port).  We are going to remove 2 of the 10G ports by removing both dual port 10G cards in there now, and replace those with single port 10G cards, and see if the PSODs continue.  If they do, Dell is in for a world of hurt.  If they don't, well, I guess ESXi 4.1 U1 can only 'handle' 2x10G ports with up to 8x 1G ports, as they article infers.

Any comments/debate is welcome!  Anyone out there running ESXi builds (any server manufacturer) with FOUR 10GB cards/ports in them?  Thanks.

0 Kudos
1 Solution

Accepted Solutions
ats0401
Enthusiast
Enthusiast
Jump to solution

Your problem is the 4 10GB nics.

Replace them with single port HBA's.

If you *only* had the four 10GB nics then it would work, but anything over that, including 1GB nics will not work.

I've seen a client with this issue before and vmware support claimed with 4 10GB NICS, adding any extra, including even just

one additional 1gb nic will cause issues.

View solution in original post

0 Kudos
13 Replies
idle-jam
Immortal
Immortal
Jump to solution

the fastest way is to create a support call and let them analyze the logs alternatively if you have the time you can do the isolation method by removing each devices until the server does not crash anymore .. (like removing the NIC, follow by memory and etc)

mcowger
Immortal
Immortal
Jump to solution

Can you post the PSOD?

--Matt VCDX #52 blog.cowger.us
ewilts
Enthusiast
Enthusiast
Jump to solution

> Anyone out there running ESXi builds (any server manufacturer) with FOUR 10GB cards/ports in them?

Although we don't have any yet, this is a standard HP BL680 blade.  I think the BL620s might have 4 10g ports as well but I'm too lazy to look it up right now.

We run our blades with 2 10g connections but pump them all through 4 10g ports on a Virtual Connect module but ESXi doesn't see the VC modules.

Is there a reason you can't remove the 1G connetions instead the 10s?  Why not configure with just the 4x10Gbps?

0 Kudos
c33jbeckwith
Contributor
Contributor
Jump to solution

To add a little additional information that was requested.  The PSOD was

LINT1 motherboard interrupt.  This is a hardware problem.  Contact your hardware vendor.

Additionally, the front LED on the Dell R810's in each instance were amber with an E171F error.

Removing one card at a time and retesting would not be a realistic test as the PSOD's were extremely intermittant (3 instances of exact same issue all on different hosts over 5 week period).

Removing the 4 x 1Gb NIC's is not an option as they are onboard.  We could potentially disable them on the board, but we are using 2 x 1Gb NIC's for management only traffic through a standard vSwitch.  All VM traffic as well as vMotion is being run through a dvSwitch connected to 2 x 10Gb NIC's.

0 Kudos
mcowger
Immortal
Immortal
Jump to solution

I gotta disagree with Dell here.  Given that error, and that it is one thats directly generated by the hardware (the PSOD is only reporting the error, not causing it), this is a hardware failure.  Specifically, its a form of NMI, or non masktable interrupt directly generated by the CPU itself.

I'd push harder on dell to determine the cause/replace a card or two, but its likely a bad PCI card.  Dell's own website says that error E171F is: "The system BIOS has reported a PCIe fatal error on a component that  resides in PCIe configuration space at bus ##, device ##, function ##."

--Matt VCDX #52 blog.cowger.us
0 Kudos
ats0401
Enthusiast
Enthusiast
Jump to solution

Your problem is the 4 10GB nics.

Replace them with single port HBA's.

If you *only* had the four 10GB nics then it would work, but anything over that, including 1GB nics will not work.

I've seen a client with this issue before and vmware support claimed with 4 10GB NICS, adding any extra, including even just

one additional 1gb nic will cause issues.

0 Kudos
jftwp
Enthusiast
Enthusiast
Jump to solution

Thanks ATS.  I gave you 'correct' since your reply is very close to what Dell is recommending (based on VMware's max config spec for 4.x).  You also indicated that 4x10 is okay as long as there are NO ADDITIONAL NICs within the server and I posed the question as to whether anyone out there is successfully running a 4x10 config and you've seen situations in which 4x10 PLUS additional NICs can cause problems per VMware support.  We will be removing our 2 dual-port 10GB cards and replacing with 2 single port 10GB cards in all servers and anticipate this will resolve the periodic PSODs which always have that same message as illustrated by c33jbeckwith earlier in the thread.

All, if I post nothing further (after we do our swaps), then this would validate the solution.  Thanks.

0 Kudos
SteveEsx
Contributor
Contributor
Jump to solution

Thats interresting

I have 5 x Dell PowerEdge R710 servers where I use the 4 embedded 1gb ports plus an additional intel 4x1gb port card plus 2 x 10gb dual port cards (but only 1 of the 4 10gb ports in active use). These servers have been stable for the last 6+ months with ESXi 4.1. The only reason I am using so many ports is that i have not had time to make the new vlan network design yet where I will only use the 10gb ports + 1 gb mgmt drac port.

Maybe I should remove the redundant intel dual port 10gb cards then...

Or has this been solved in vmware esxi5 or is it a motherboard design issue?

0 Kudos
PGITDept
Contributor
Contributor
Jump to solution

Do you know if this is resolved in 5 or with a firmware update?

0 Kudos
jftwp
Enthusiast
Enthusiast
Jump to solution

We actually haven't experience the problem since reseating memory (this was key since memory 'may have shifted overhead during flight' ha) and upgrading BIOS to at least v2.4 ---- if those 2 things were done/checked, we hadn't seen PSOD's since.

0 Kudos
c33jbeckwith
Contributor
Contributor
Jump to solution

Another item to note is listed in this article:

http://enterpriseadmins.org/blog/virtualization/vmware-esxi-psod-on-dell-server/

Specifically, disable the c-state settings in the BIOS.  In particular:

http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf) specifically states on page 15 to “Disable C1E halt state in the BIOS.”

I have a hard time putting my finger on a specific fix which ended up correcting our issue, mostly because it took so long for the issue to re-occur on another machine.  As jftwp states, we have not experienced the issue after reseating our DIMM's, flashing the BIOS and disabling C-States.

0 Kudos
PGITDept
Contributor
Contributor
Jump to solution

Thanks guys. 

So you are running 2 x Intel 520 x2's with all 4 ports in use and all 4 onboards and an extra 4 port NIC without any issues now?

We need to get onto 10Gb and as we already use 2 ports for our SAN with this PSOD we were holding off on moving Smiley Happy

0 Kudos
c33jbeckwith
Contributor
Contributor
Jump to solution

We are currently running 2 x 1 port 10Gb cards in our R810's.  i have been told that this limitation has been "fixed" in vSphere 5, but I would suggest independently verifying that  😉

0 Kudos