VMware Cloud Community
iambrucelee
Contributor
Contributor

HP NC532i (Broadcom 57711E) network adapter from flex-10 caused a hard crash, which bnx2 driver to use?

Is anyone else having this issue? We just had 3 servers crash due to a bnx2x_panic_dump. Once the network cards crashed the ESX server had to be rebooted to come back. Even though only a few vmNICs died, the entire server became unreachable, and the VMs became unreachable, even if the vmnic wasn’t bound to the vSwitch that the VM was on.

After researching it appears that VMware supports 3 different drivers:

1. bnx2x version 1.45.20

2. bnx2x version 1.48.107.v40.2

3. bnx2x version 1.52.12.v40.3

On 6/10/2010 VMware came out with a patch for 1.45.20, but esxupdate maked it obsolete, since our version (1.52.12v40.3) was newer. Should I downgrade my driver?

Also the VMware HCL has conflicting information. According to this:

http://www.vmware.com/resources/compatibility/search.php?action=search&deviceCategory=io&productId=1...

1.52.12.v40.3 is supported by vSphere4 Update2, and not vSphere Update1, yet the U2 release only has an update for the 1.45.20 driver.

Yet according to this:

http://www.vmware.com/resources/compatibility/search.php?action=search&deviceCategory=io&productId=1...

1.52.12.v40.3 is supported by both vSphere4 Update2 and vSphere Update1.

Here are the details of my environment:

HP BL460G6 blade servers, with flex-10 modules.

The individual blades are using HP NC532i Dual Port 10GbE Multifunction BL-c Adapter, firmware bc 5.0.11.

The chassis OA itself is using firmware v3.0.

The Flex-10 module is using firmware v. 2.33.

Crash Dump:

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu1:4426)VMotionRecv: 1080: 1276732954553852 😧 Estimated network bandwidth 75.588 MB/s during page-in

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu7:4420)VMotion: 3381: 1276732954553852 😧 Received all changed pages.

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.245 cpu7:4420)Alloc: vm 4420: 12651: Regular swap file bitmap checks out.

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu7:4420)VMotion: 3218: 1276732954553852 😧 Resume handshake successful

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu3:4460)Swap: vm 4420: 9289: Starting prefault for the migration swap file

Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.259 cpu0:4460)Swap: vm 4420: 9406: Finish swapping in migration swap file. (faulted 0 pages, pshared 0 pages). Success.

Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4639(vmnic1)]storm stats were not updated for 3 times
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4640(vmnic1)]driver assert
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:658(vmnic1)]begin crash dump -


Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:666(vmnic1)]def_c_idx(0xff5) def_u_idx(0x0) def_x_idx(0x0) def_t_idx(0x0) def_att_idx(0xc) attn_state(0x0) spq_prod_idx(0xf8)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:677(vmnic1)]fp0: rx_bd_prod(0x6fe7) rx_bd_cons(0x3e9) *rx_bd_cons_sb(0x0) rx_comp_prod(0x7059) rx_comp_cons(0x6c59) *rx_cons_sb(0x6c59)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:682(vmnic1)] rx_sge_prod(0x0) last_max_sge(0x0) fp_u_idx(0x6afb) *sb_u_idx(0x6afb)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:693(vmnic1)]fp0: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:697(vmnic1)] fp_c_idx(0x0) *sb_c_idx(0x0) tx_db_prod(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[4f]=[0:deda0310] sw_bd=[0x4100b462c940]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[50]=[0:de706590] sw_bd=[0x4100b4697b80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[51]=[0:deac2810] sw_bd=[0x4100baad8e80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[52]=[0:de9ae390] sw_bd=[0x4100bda03f40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[53]=[0:de3e9a90] sw_bd=[0x4100b463ecc0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[54]=[0:3ea48730] sw_bd=[0x4100bab19100]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[55]=[0:de5b1190] sw_bd=[0x4100bda83980]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[56]=[0:ded48410] sw_bd=[0x4100bdb06080]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[57]=[0:3e3f0d10] sw_bd=[0x4100bca0f480]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[58]=[0:de742110] sw_bd=[0x4100bda35d40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[59]=[0:de6ffc90] sw_bd=[0x4100bcab3800]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5a]=[0:de619710] sw_bd=[0x4100b4640c40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5b]=[0:de627e10] sw_bd=[0x4100bcaad440]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5c]=[0:3e455e10] sw_bd=[0x4100b462a9c0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5d]=[0:de3a6110] sw_bd=[0x4100bdaf1d80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5e]=[0:3e37df90] sw_bd=[0x4100b470d580]

any thoughts, suggestions ?

0 Kudos
102 Replies
sayahdo
Contributor
Contributor

Yep, NC532i and NC532m (i= integrated/on-board, m= mezzanine) are broadcom NICs and have the issue

NC364m, is and HP/Intel NIC and uses the e1000e version 0.4.1.7.1 driver. Should not have issues.

If you want to use Flex 10 consider the qLogic card (NC522m), which is a recommended work around from HP, but may require addition VC's

Cheers

Mike

0 Kudos
NAz0GuL
Contributor
Contributor

Thank for your kindly repsonse!

My another question is: HP BL680c with NC373i or NC326i is our current use. Do NC373i or NC326i have impact on ESX(ESXi) 4.1?

0 Kudos
KFM
Enthusiast
Enthusiast

As far as everyone in this thread knows, the problem mentioned exists only when using the HP NC532i in combination with the Flex-10 interconnects.

Any problems when using non-Flex-10 adapters such as the two you mentioned would most likely not be related to this case nor the symptoms we're seeing when using the NC532i cards in a Flex-10 environment.

0 Kudos
sayahdo
Contributor
Contributor

You might be right........ but because both the on-board and mezzanine card use the same driver, you may see this problem.

If it was specifically the on-board card, the NC532m would be listed as a work around, and it is not.

Food for thought for anyone wanting to configure a ESX 4.1 environment with Flex 10

Cheers

Mike

0 Kudos
abaack
Contributor
Contributor

In the vmware article it states that Beacon Probing does not support etherchannel connections. How are you utilizing etherchannel AND Beacon Probing?

I'm trying to see how to setup my vSwitch currently... right now we have it set at Link Status with both vmnic0 and vmnic1 as active. This caused a problem this morning when we did a Flex-10 firmware update. Since we're using etherchannel what is the best layout for the vSwitch?

0 Kudos
sayahdo
Contributor
Contributor

Sorry we are not. My diag was wrong in a previous post. I've fixed

0 Kudos
wzumbrun
Contributor
Contributor

I have experienced this issue and just got word from HP Support that there is a tenative release date of November 5th for a new driver. It will be a 1.60 version according to HP. HP said once it is ready, it will be available on the VMWare download site.

0 Kudos
abaack
Contributor
Contributor

Hopefully this fixes the Smart Link issues too.

0 Kudos
beovax
Enthusiast
Enthusiast

0 Kudos
MattG
Expert
Expert

Have you tried this?

-MattG

If you find this information useful, please award points for "correct" or "helpful".

-MattG If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
beovax
Enthusiast
Enthusiast

not yet, you need to download the ISO and extract just BCM-bnx2x-1.60.50.v41.2-offline_bundle-320302.zip and install using vihostupdate

I will be testing tomorrow in our demo environment. Is there a way to replicate the PSOD does it happen fairly frequently with the smart link detection turned on?

0 Kudos
Mackopes
Enthusiast
Enthusiast

Just having smartlink on will not cause the PSODs...

We saw it only when we had around 30-40 VMs and were trying to back them up during our backup window. So it appears to be a combination of 'many servers' with high network I/O.

We tried to repro with just high network I/O between a few VMs but couldn't get them to fail (with 1.52).

We will be doing the same tests in the next few days.

Aaron

0 Kudos
Rabie
Contributor
Contributor

We ran perfectly fine on 1.52 even with simulated network load tests running for a couple of hours on our one cluster and it was thrashing the network.

This same cluster also had dev VM's on them and ran for a month or two with out issues. Our other cluster run the 1.48 driver and had the bulk of our production VM's including a couple of MSSQL servers and have also been stable for months after downgrading to 1.48.

Then we had to move the production VM's accross to the second cluster while we did maintenace on the first cluster with in 3 hours 2 nodes locked up and they seem to have the bulk of our MSSQL servers.

From what I have gathered from the forums and some support conversations that checksum offloading in conjunction with the makeup of MSSQL packets agrevates the problem and causes either network failure or PSOD's which is why not everyone is seening the problem.

I do see that VMware have released the 1.60 driver for esx 4.1 officially, however that driver is not currently availible for 4.0, well not via the website. Oh and neither are listed on HP's site which is very annoying as you would hope that they would be driving this issue as blades and 10Gb is one of their flagship products.

R

0 Kudos
beovax
Enthusiast
Enthusiast

We were told by our HP contact that the 4.0 driver is a few weeks away

0 Kudos
Rabie
Contributor
Contributor

We were told 4.0 driver would be availble in the end of Oct and that 4.1 driver following the next week.

Now the 4.1 driver has arrived ontime as promised, so where is my 4.0 driver VMware/HP? :smileyblush:

0 Kudos
MattG
Expert
Expert

I am trying to apply the new driver and Update Manager is telling me it can't read the metadata from the zip.

Any thoughts?

-MattG

If you find this information useful, please award points for "correct" or "helpful".

-MattG If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
beovax
Enthusiast
Enthusiast

Installed ok for me using the VMA and vihostupdate

0 Kudos
MattG
Expert
Expert

From remote SSH into TSM ran:

esxupdate --bundle=BCM-bundle.zip update

And it worked.

-MattG

If you find this information useful, please award points for "correct" or "helpful".

-MattG If you find this information useful, please award points for "correct" or "helpful".
0 Kudos
Mackopes
Enthusiast
Enthusiast

Yea, the update manager way is 'broken'. The vmware.xml file in the bundle is improperly formatted.

I was able to edit the file and re-zip it up to make it work although I'm not sure if that is supported.

Aaron

0 Kudos
Rabie
Contributor
Contributor

I just got an email from my HP support repesentative and yes 4.1's driver has been released however 4.0 has been delayed a little and will hopefully be availible sometime next week.

0 Kudos