VMware Cloud Community
Glenn7
Contributor
Contributor

ESXi 5.0 & DELL R720 Network Connectivity Loss

Ok there have been a lot of threads about trying to get the Dell R720 working with the Broadcom 5720 daughter card.  You can inject the drivers post build or you use the DELL recovery cd http://ftp.dell.com/FOLDER00609866M/1/ to build the server.

The problem I am highlighting in this post occurs when the servers are commissioned.  We put 4 DELL R720 servers in production.  They are configured using 4 ports on the 5720 etherchannelled and using vDs. Within 2-3 weeks at different times all 4 hosts experienced a network failure and caused production outages.  All Vm's were unresponsive & offline and a reboot of each physical host resolved the issues temporarily but would reoccur at a later date.  Calls to both vmware and Dell were not very productive and it took 2 months to finally work out the problem, involving a lot of time & effort on my part.  Basically specific to the DELL R720 server the below criteria must be enforced to ensure the hosts do not experience random network loss.

The above was implemented on all 4 hosts they have been stable for the last 8 weeks.

0 Kudos
79 Replies
Glenn7
Contributor
Contributor

One of the DELL R720 Hosts just died again last night so issue still not resolved......

0 Kudos
JProos
Contributor
Contributor

Did you ever find a solution to this?  I'm experiencing what I believe to be the same problem with my 3 R820 servers.

0 Kudos
Glenn7
Contributor
Contributor

Hi

I may have found the root cause.  Currently testing will advise shortly.  Can you tell me what ISO you used to build the servers.  Can you also confirm what HBA cards you are using for storage.  If QLogic can you tell me the model and driver version please?

0 Kudos
JProos
Contributor
Contributor

Thanks for the reply.  I'm using dell's esxi image for 5.0.0 (not U1).  The nics are Broadcom 5720's.  The servers are R820's.  It's happened to me about 5 times since early June. I don't have any idea how to see the driver version of the nics in esxi.  I guess there's some way to see it via ssh.

Jason Proos

0 Kudos
Glenn7
Contributor
Contributor

Although the host drops off the n/w I believe its caused by the QLogic driver that DELL uses for the Storage HBA card (QLE2560 in our case).  To check the driver version ssh to the host and perform below.  I have confirmed both the DELL 5.0 and DELL 5.0 U1 iso images both have a newer driver than is listed on the vmware HCL.  The latest supported version on the vmware HCL is 901.k1.1-14vmw.  We have have other model DELL servers with the same HBA but correct driver version and they do not experience the outages we notice with servers running 911.k1.1-19vmw or 911.k1.1-26vmw.

cd /proc/scsi/qla2xxx

ls  [this should return numeral filenames eg 8 and 9]

cat 9 [output below will confirm HBA version used]

QLogic PCI to Fibre Channel Host Adapter for QLE2560:
        FC Firmware version 5.06.02 (90d5), Driver version 911.k1.1-19vmw

Host Device Name vmhba3

BIOS version 3.00
FCODE version 0.00
EFI version 2.15
Flash FW version 5.04.01
ISP: ISP2532

0 Kudos
JProos
Contributor
Contributor

Glenn,

Unfortunately, I have no QLogic HBAs in any of my affected servers.  All my storage is either local RAID or NFS (via Broadcom 5720 nics).  Either we’re experiencing different problems or the root cause isn’t related to QLogic FC HBAs, it seems.

In my situation what happens is that all the NFS datastores accessed by the affected host randomly go offline on just that host.  A reboot of the host resolves the issue until the next time.

A Dell tech support agent mentioned to me that he’s seen this same problem happen with a R620 and the issue was traced to the Broadcom drivers.  He mentioned that in the end Broadcom provided what he called a ‘test’ driver to the customer and the issue resolved.

Jason

0 Kudos
Glenn7
Contributor
Contributor

Thanks for the info.  I think from what we are both seeing it is def the BCM5720 card and/or drivers that are causing this issue.

We have Dell R710 and Dell R910 with other Broadcom nics BCM5709 and have never experienced this issue. Are you using vSwitches or virtual Distributed Switch – ours is vDs  Also to check your driver version for BCM5720 ssh to host and do below 2 commands.

lspci | grep BCM5720

000:001:00.0 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic0]

000:001:00.1 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic1]

000:002:00.0 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic2]

000:002:00.1 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic3]

ethtool -i vmnic0

driver: tg3

version: 3.120h.v50.2

firmware-version: FFV7.0.48 bc 5720-v1.17

bus-info: 0000:01:00.0

0 Kudos
JProos
Contributor
Contributor

Glenn,

Thanks.  I see from this information that we have the same Broadcom drivers installed and that they have the same firmware version.  That’s not too surprising, it seems.

Jason

0 Kudos
JProos
Contributor
Contributor

Glenn,

I forgot to answer your question.  I’m using a vSwitch.

Jason

0 Kudos
JProos
Contributor
Contributor

Glenn,

                vmware tech support just notified me that this is a known issue with the tg3 driver for the Broadcom nics and that the workaround is exactly what you suggested.  Further, they confirmed that the number of 0’s in the command does need to match the number of Broadcom nics in the host.  They also confirmed that Broadcom told vmware that they’re working on it.  The vmware PR was created on 18JUL12.  They report that the issue has been observed on Dell PE R620, R720 and now R820 servers.

                Ultimately, we should be able to access an updated Broadcom driver that has a real fix, I guess.

                Thanks for your help.  Hopefully, this resolves the issue for both of us and anyone else affected by the same problem.

Jason

0 Kudos
fabiolbj
Contributor
Contributor

Hi guys,

We are experiencing the same issue with our new Dell R720 server.
VMWare ESXi, 5.0.0, 504890
Image Profile: Dell Customized ESXi-5.0.0 Standard(A02)
~ # lspci | grep BCM5720
000:001:00.0 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic0]
000:001:00.1 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic1]
000:002:00.0 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic2]
000:002:00.1 Network controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet [vmnic3]
~ # ethtool -i vmnic0
driver: tg3
version: 3.120h.v50.2
firmware-version: FFV7.2.14 bc 5720-v1.25
bus-info: 0000:01:00.0
0 Kudos
JProos
Contributor
Contributor

The command I believe that both Glenn and I have been given from vmware tech support on the issue is below.  I suggest opening a case with them just to make sure you have the same problem.  They do have a record of the issue as does Broadcom (according to vmware tech support).  I'm not sure that Dell knows much about the situation.  vmware tech support told me that the issue was opened in their issue tracking system on 18JUL12.

esxcfg-module -s force_netq=0,0,0,0 tg3

The number of 0's in that command needs to match the number of Broadcom nics in the host.

In my case it I issued esxcfg-module -s force_netq=0,0,0,0,0,0 tg3 since I have 6 Broadcom 5720 nics in each host.

After issuing the command you have to reboot the host.

The issue is, I believe, known to be affecting Dell R620, R720 and R820 servers.

Jason

0 Kudos
Glenn7
Contributor
Contributor

All

Both vmware and Dell are now advising the fix for the n/w loss issues for the Dell R620, R720 and R820 are as below.  The issue is limited to the BCM5720 that we are aware of.  Jason and I have both applied these fixes to our environments and will be monitoring closely.  If anyone is experiencing this issue can you please log it with Dell and / or vmware so they have a record of it and can see how widespread it is.  If anyone applies the fix and the error reoccurs please let us all know immediately.

Host in mtce mode / ssh to host / run cmd "esxcfg-module -s force_netq=0,0,0,0 tg3" (without quotes - no of zeros = no of 5720 Nic ports) / reboot host

0 Kudos
Glenn7
Contributor
Contributor

Also forgot to mention after host reboot run below to ensure NetQueue disabled.  Should return the below force_netq values

esxcfg-module -g tg3
tg3 enabled = 1 options = 'force_netq=0,0,0,0'

0 Kudos
vcocaud
Contributor
Contributor

No one tried with the latest driver ? (included in Dell Customized ESXi 5.0 Update 1 ISO)

2116582.png

Currently installing 4x R620 with 5720 NICs ...

0 Kudos
JProos
Contributor
Contributor

Last I had checked, which was during the history of this discussion, there was no 5.0 U1 Dell ISO that listed the R820 in its compatibility list.  I see now that there is a 5.0 U1 update available for the R820.  I’ll probably be giving that a try shortly.  So far the driver reconfiguration regarding netqueue has been working successfully for me.

I haven't tried replacing the driver directly, either.

Jason

0 Kudos
vcocaud
Contributor
Contributor

Found only this about NetQueue in Release Notes :

v3.122h (February 17, 2012)
============================
    Fixes
    -----    

        😎 Problem: (CQ60632) NetQueue cannot be enabled on capable
                    devices.
           Cause  : A recent IRQ allocation change causes the driver to
                    fail to allocate enough interrupts for the NetQueue
                    case.
           Change : Change the code so that more interrupts can be
                    allocated for NetQueue devices.
           Impact : This bug affects all NetQueue capable devices.

0 Kudos
JProos
Contributor
Contributor

I’m not aware that anyone participating in this discussion knows what the actual bug is but I suppose that it’s possible that 5.0 U1 may fix it given that something to do with very low level driver netqueue functionality is referenced in the release notes for 5.0 U1.  The big question is who wants to be the guinea pig?  I think it might be a good idea to get back in touch with vmware tech support to find out if this is still an issue with 5.0 U1.

Jason

I have since talked VMware tech support and they say that the issue will exist until Broadcom gives them a new driver, which they haven't.  I don't know how the driver on the dell site plays into this as I haven't tried it.  Nor have I yet tried the 5.1 ISO Dell just released the other day.

Jason

Message was edited by: JProos

0 Kudos
Hampuslind
Contributor
Contributor

Hi, sorry for hijacking this thread but what you say about qlogic casuing issues are interesting.. We're having a bunch of HP servers that freezes/hangs now and then for no good reason (please see below thread for more details). It makes me wonder if it's the qlogic driver/firmware that causes our problem..

What is you latest status on qlogic drivers and firmware? What version do you run that are stable in your environment?

http://communities.vmware.com/message/2116878#2116878

Again, sorry for hijacking!

Br,

Hampus

0 Kudos