VMware Cloud Community
Glenn7
Contributor
Contributor

ESXi 5.0 & DELL R720 Network Connectivity Loss

Ok there have been a lot of threads about trying to get the Dell R720 working with the Broadcom 5720 daughter card.  You can inject the drivers post build or you use the DELL recovery cd http://ftp.dell.com/FOLDER00609866M/1/ to build the server.

The problem I am highlighting in this post occurs when the servers are commissioned.  We put 4 DELL R720 servers in production.  They are configured using 4 ports on the 5720 etherchannelled and using vDs. Within 2-3 weeks at different times all 4 hosts experienced a network failure and caused production outages.  All Vm's were unresponsive & offline and a reboot of each physical host resolved the issues temporarily but would reoccur at a later date.  Calls to both vmware and Dell were not very productive and it took 2 months to finally work out the problem, involving a lot of time & effort on my part.  Basically specific to the DELL R720 server the below criteria must be enforced to ensure the hosts do not experience random network loss.

The above was implemented on all 4 hosts they have been stable for the last 8 weeks.

0 Kudos
79 Replies
JurijC
Contributor
Contributor

I've used that exact method, the 5.1 bundle was under ESXi downloads, ISO is first on the list, the offline bundle is a little further down on the list, IIRC. I don't remember exactly where the drivers bundle was, but I suppose it is somewhere in the Drivers and tools section. If I managed to do this despite having heard of PowerCLI and custom ISOs for the first time on that same day, it shouldn't be a problem for a more experienced VMware user.

Edit: The drivers ZIP offline bundle was inside the ZIP downloaded from the site.

0 Kudos
SCMHenry
Enthusiast
Enthusiast

Ooops.. not sure how I missed that!   Thanks!

0 Kudos
Hairyman
Enthusiast
Enthusiast

http://en.community.dell.com/support-forums/servers/f/1466/t/19466431.aspx?PageIndex=2

Dell Custom 5.1 ISO, no network card issue yet, installing to dual SD cards

no issues so far

edit:

Custom ISO finished installing, no issues so far, using DHCP to start with

0 Kudos
SCMHenry
Enthusiast
Enthusiast

NOTE:  The recently released DELL customized/recovery ISO for ESXi 5.1 does not contain the most up-to-date net.tg3 drivers!

Unfortunately, at this time, Dell does not appear to offer an offline bundle depot package that would facilitate custom building an ISO with the latest net.tg3 drivers embedded.

=========== >% clipped from Dell support site %<==================

The drivers included in this ESXi image by Dell as part of customization are:

Broadcom Network Adapter Drivers  & its Versions(Available at vmware.com) ================================================================

tg3 - 3.123b.v50.1

bnx2 - 2.2.1l.v50.1

bnx2x - 1.72.54.v50.2

cnic - 1.72.50.v50.1

bnx2fc - 1.72.51.v50.1

bnx2i - 2.72.10.v50.2

misc-cnic-register - 1.72.1.v50.1

Storage Controller Driver  & its Version(Available at vmware.com)

=====================================================

mpt2sas - 14.00.00.00.1vmw

Brocade CNA Drivers &  its Versions(Available at vmware.com)

====================================================

bfa - 3.1.0.0 bfa - 3.1.0.0

Intel Network Adapter Drivers & its Versions(Available at vmware.com)

===========================================================

igb - 3.4.7.3

Qlogic HBA Drivers & its versions(Available at vmware.com)

==================================================

qla2xxx - 911.k1.1-26vmw

qla4xxx - 624.01.43-1vmw

ima-qla4xxx - 500.2.01.31-1vmw

qlcnic - 5.0.746

qlge - 2.0.0.54

Emulex HBA Drivers & its versions (Available at vmware.com)

===================================================

be2iscsi - 4.1.334.3

ima-be2iscsi - 4.1.334.3

0 Kudos
ga_navtech
Contributor
Contributor

I downloaded the Dell HV 5.1 from the DropBox link. I installed this over the top of a customised version I had made from the standard VMWare 5.1 HV ISO. This appears to have fixed our network problem!

I updated the original VMWare ISO with the latest Broadcom drivers and used this initially. Everything seemed to go OK, adapters recognised and first virtual server installed without incident. However we then noticed that when we accessed certain shared folders on the VM (specially folders that had lots of files and folders in ~1500) the whole host network would fall over - this includes the management network and all VMs. About 90 seconds later the network would come back up again. We could repeat this issue every time we tried to access the same folders. Interestingly we could also repeat the same problem by pasting commands (taken from this thread!) into a SSH client session.

We spoke to Dell who recommended go back to 5.01 as this was the only official version 5.x Dell hypervisor. Wasn't keen to go backwards so tried the 5.1 version I got via the link on this thread.

I performed an upgrade and kept the datastore intact. After a reboot all is good.

One point to mention - the recommended fix:

tg3 enabled = 1 options = 'force_netq=0,0,0,0'

survived the upgrade and is still active. This fix alone did not work with my own VMWare ISO + drivers.

My server is a G12 R720 with the quad 5720. Firmware on the NIC card is 2.2.20. Driver details:


driver: tg3
version: 3.123b.v50.1
firmware-version: FFV7.2.20 bc 5720-v1.25
bus-info: 0000:01:00.0

Note that these drivers are more up to date than one I found to include in my own customised ISO.

I hope this info is useful to someone!

0 Kudos
SCMHenry
Enthusiast
Enthusiast

I'm a little confused by the last post.....

I understand that the latest tg3 driver from Broadcom is tg3-3.124c.v50.1, and it is available from VMware's download site.

The version included in the Dell customized ESXi 5.1 ISO is tg3-3.123b.v50.1. 

It seems obvious to me that 3.124c is the most up to date.... am I missing something?

Which version is recommended/preferred?

0 Kudos
ga_navtech
Contributor
Contributor

Sorry - just to clarify - Glenn7 near the beginning of this thread was talking about driver version: 3.120h.v50.2. This was the version that I found on the Dell site last week and used it to create a modified ISO with the standard VMWare ESX 5.1 Hypervisor.

Even with the net queue fix - this driver did NOT work on our latest Dell R720 server. So we simply resorted to the Dell 5.1 ISO, as available from this thread, and installed this over the top. This appeared to fix the issue.

Having checked the driver version as installed by the Dell ISO - it is more up to date: 3.123b.v50.1 - BUT it is not the latest - as you correctly point out.

I can't comment on what the latest driver offers over the version on the Dell ISO but thought it worthwhile letting people know that we had resolved our issue using the Dell ISO.

I'm still not clear what the status of this Dell 5.1 ISO is - the Dell support engineer we spoke today did not know it existed. So either it's very hot off the press or we've been lucky enough to get a link prior to it being formally published.

I followed the link from Hairyman to the Dell forum and then tried downloading (using standard HTTP - not download manager) and got a page telling me the file could NOT be found. I've never experienced this before on the Dell site. Thankfully the same Dell forum thread includes the DropBox link - this did work. Could be just us - but something suggests things are being rushed at the moment to fill the obvious gap.

Hope that helps clear it up.

0 Kudos
Glenn7
Contributor
Contributor

Below is the driver we have (Dell 5.0 Update1 ISO build) with the netqueue fix applied.  We have not seen the issue reoccur so far.

So there is no confusion run below command to check what driver you have installed and compare it to the vmware HCL (see prev post in this thread on how to use vmkchdev command to get ID settings)

esxcli software vib list | grep Broadcom

net-tg3                3.123b.v50.1-1OEM.500.0.0.472560      Broadcom  VMwareCertified  2012-09-13

Below are the supported driver versions for the BCM 5720 Quad Port 1Gb card on the VMWARE I/O HCL

http://www.vmware.com/resources/compatibility/detail.php?deviceCategory=io&productid=20918&deviceCat...

0 Kudos
alainrussell
Enthusiast
Enthusiast

We appear to be having the same issue with 5720 and 5719 cards in new Dell R620's - we've had 3 occassions over the past week where this has stalled production web servers. I've applied the command line fix and also the latest driver from vmware. Currently going to file a support query with both Dell and Vmware. Any luck wiht other fixes that could stop this happening - starting to become a major issue for us.

Thanks

Alain

0 Kudos
ga_navtech
Contributor
Contributor

Thanks - don't worry I fully understand the situation with the drivers. The only reason I highlighted the Dell 5.1 ISO is because it potentially saves someone having to create their own custom ISO with the latest drivers. And although the driver that ships with this Dell ISO is slightly older it works fine for us.

Do you know whether the net queue fix is still required with latest drivers (as supported by VMWare) i.e.tg3 version 3.124c.v50.1?

Thanks.

0 Kudos
alainrussell
Enthusiast
Enthusiast

Unsure, but the 3 servers we've had hang this week have done so with the old drive, new driver and new driver with netqueue fix applied.

0 Kudos
JProos
Contributor
Contributor

Glenn7,

Did you need to move from 5.0 to 5.0 U1 to have the netqueue fix work?  I'm on 5.0, as provided from the factory, with just the netqueue fix and I haven't seen a recurrence since impementing the fix.  Which I did on or about 13SEP12 on all 3 of my R820 hosts.

Jason

0 Kudos
Glenn7
Contributor
Contributor

No we moved to 5.0 u1 as part of a proposed previous fix fo the issue. Don't think OS version relevant. We also have not yet seen a reoccurrence although it was 8 weeks between the last outages. 5 weeks and counting now....

0 Kudos
JProos
Contributor
Contributor

Glenn,

Thanks for the quick reply.  I’m not sure what to make of the other people who are reporting continuing incidents.  I guess it is still too early to conclude that the netqueue fix actually prevents the problem but, for me at least, this is pretty much as long as I EVER went between incidents so I’m getting pretty confident in it.

Jason

0 Kudos
alainrussell
Enthusiast
Enthusiast

I ended up disabling Netqueue as per the KB article here http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=203570... (this was what was recommended to me by Dell support). The initial fix I applied where I saw a recurrence of the issue I had actually omitted restarting the hosts ..

Since disabling Netqueue we've not seen the issue come back in over 12 days - previously we'd seen the issue 3 times within the previous week. We're on 5.0U1 for reference.

0 Kudos
Confiture
Contributor
Contributor

We met the same problem with R720 servers with ESXi 4.1 U2 release.

The last drivers have been installed, and we disabled netqueue as it recommanded in the KB. The DELL and Vmware supports told us it was a rare issue and they did not met this kind of problem with these servers... After long exchanges, dell told us that it was netq but they wasn't sure about that, then I found out that some people met the same problem here Smiley Happy

I hope the issue is resolved...

0 Kudos
ChrisGurley
Enthusiast
Enthusiast

Just so you guys know, I've disabled Netqueue, updated the BIOS to 1.2.6 and am running ESXi 5.0 U1, and my R820 just crashed after less than 2wks. Pretty sure it's due to the BCM5720s. I'll verify on Monday. Crashed about 10 production SQL servers...not cool. At least HA kicked in...

--Chris

0 Kudos
Glenn7
Contributor
Contributor

Hi Chris

Can you please confirm below to ensure we have the exact same scenario

- What driver version are you using for the BCM5720

- Did you reboot the host after applying the Netqueue fix

- Have you applied the Interrupt Remapping Disable option on the host (the command is in this post)

Thanks

Glenn

0 Kudos
ChrisGurley
Enthusiast
Enthusiast

Hey Glenn,

I hadn't disabled the interrupt remapping code, as it seemed to slip off of the recommendations as this thread progressed. That said, here's the other info:

1. Yes, I rebooted after disabling NetQueue

2. Driver version:

~ # ethtool -i vmnic0
driver: tg3
version: 3.123b.v50.1
firmware-version: FFV7.2.20 bc 5720-v1.25
bus-info: 0000:01:00.0

I can't verify if the ALERT: APIC: 1823... message was in my vmkernel logs because I rebooted before realizing I needed to check it. And unfortunately, since this was a new host, I hadn't checked to realize that while I setup syslog to send my log host, I hadn't opened the outbound firewall ports Smiley Sad. So no historical logs...

I'll disable IOV and go from there. It sounds like that could have been the issue as I did briefly examine the logs before shutting it down (I was going to remove the BCM5720 quad-port daughterboard altogether), and it seems that my CNA ports went down first. I'd avoided putting any networks (including mgmt) on the BCM's, thinking that would safeguard me, but apparently IOV caught me instead.

Thanks and I'll update this if we crash again after disabling IOV and rebooting.

--Chris

0 Kudos
Glenn7
Contributor
Contributor

Hi Chris

Thanks for the info I am using the same driver version (diff firmware FFV7.0.48 bc 5720-v1.17)

We didn’t have the APIC alert in our logs but DELL advised to disable Interrupt Remapping anyway – it may be part of the resolution.

Let’s monitor and see – if the issue reoccurs for anyone please post the results

Cheers

Glenn

0 Kudos