Is anyone else having this issue? We just had 3 servers crash due to a bnx2x_panic_dump. Once the network cards crashed the ESX server had to be rebooted to come back. Even though only a few vmNICs died, the entire server became unreachable, and the VMs became unreachable, even if the vmnic wasn’t bound to the vSwitch that the VM was on.
After researching it appears that VMware supports 3 different drivers:
1. bnx2x version 1.45.20
2. bnx2x version 1.48.107.v40.2
3. bnx2x version 1.52.12.v40.3
On 6/10/2010 VMware came out with a patch for 1.45.20, but esxupdate maked it obsolete, since our version (1.52.12v40.3) was newer. Should I downgrade my driver?
Also the VMware HCL has conflicting information. According to this:
1.52.12.v40.3 is supported by vSphere4 Update2, and not vSphere Update1, yet the U2 release only has an update for the 1.45.20 driver.
Yet according to this:
1.52.12.v40.3 is supported by both vSphere4 Update2 and vSphere Update1.
Here are the details of my environment:
HP BL460G6 blade servers, with flex-10 modules.
The individual blades are using HP NC532i Dual Port 10GbE Multifunction BL-c Adapter, firmware bc 5.0.11.
The chassis OA itself is using firmware v3.0.
The Flex-10 module is using firmware v. 2.33.
Crash Dump:
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu1:4426)VMotionRecv: 1080: 1276732954553852 😧 Estimated network bandwidth 75.588 MB/s during page-in
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu7:4420)VMotion: 3381: 1276732954553852 😧 Received all changed pages.
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.245 cpu7:4420)Alloc: vm 4420: 12651: Regular swap file bitmap checks out.
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu7:4420)VMotion: 3218: 1276732954553852 😧 Resume handshake successful
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu3:4460)Swap: vm 4420: 9289: Starting prefault for the migration swap file
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.259 cpu0:4460)Swap: vm 4420: 9406: Finish swapping in migration swap file. (faulted 0 pages, pshared 0 pages). Success.
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4639(vmnic1)]storm stats were not updated for 3 times
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4640(vmnic1)]driver assert
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:658(vmnic1)]begin crash dump -
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:666(vmnic1)]def_c_idx(0xff5) def_u_idx(0x0) def_x_idx(0x0) def_t_idx(0x0) def_att_idx(0xc) attn_state(0x0) spq_prod_idx(0xf8)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:677(vmnic1)]fp0: rx_bd_prod(0x6fe7) rx_bd_cons(0x3e9) *rx_bd_cons_sb(0x0) rx_comp_prod(0x7059) rx_comp_cons(0x6c59) *rx_cons_sb(0x6c59)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:682(vmnic1)] rx_sge_prod(0x0) last_max_sge(0x0) fp_u_idx(0x6afb) *sb_u_idx(0x6afb)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:693(vmnic1)]fp0: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:697(vmnic1)] fp_c_idx(0x0) *sb_c_idx(0x0) tx_db_prod(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[4f]=[0:deda0310] sw_bd=[0x4100b462c940]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[50]=[0:de706590] sw_bd=[0x4100b4697b80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[51]=[0:deac2810] sw_bd=[0x4100baad8e80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[52]=[0:de9ae390] sw_bd=[0x4100bda03f40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[53]=[0:de3e9a90] sw_bd=[0x4100b463ecc0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[54]=[0:3ea48730] sw_bd=[0x4100bab19100]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[55]=[0:de5b1190] sw_bd=[0x4100bda83980]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[56]=[0:ded48410] sw_bd=[0x4100bdb06080]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[57]=[0:3e3f0d10] sw_bd=[0x4100bca0f480]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[58]=[0:de742110] sw_bd=[0x4100bda35d40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[59]=[0:de6ffc90] sw_bd=[0x4100bcab3800]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5a]=[0:de619710] sw_bd=[0x4100b4640c40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5b]=[0:de627e10] sw_bd=[0x4100bcaad440]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5c]=[0:3e455e10] sw_bd=[0x4100b462a9c0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5d]=[0:de3a6110] sw_bd=[0x4100bdaf1d80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5e]=[0:3e37df90] sw_bd=[0x4100b470d580]
any thoughts, suggestions ?
Yep, NC532i and NC532m (i= integrated/on-board, m= mezzanine) are broadcom NICs and have the issue
NC364m, is and HP/Intel NIC and uses the e1000e version 0.4.1.7.1 driver. Should not have issues.
If you want to use Flex 10 consider the qLogic card (NC522m), which is a recommended work around from HP, but may require addition VC's
Cheers
Mike
Thank for your kindly repsonse!
My another question is: HP BL680c with NC373i or NC326i is our current use. Do NC373i or NC326i have impact on ESX(ESXi) 4.1?
As far as everyone in this thread knows, the problem mentioned exists only when using the HP NC532i in combination with the Flex-10 interconnects.
Any problems when using non-Flex-10 adapters such as the two you mentioned would most likely not be related to this case nor the symptoms we're seeing when using the NC532i cards in a Flex-10 environment.
You might be right........ but because both the on-board and mezzanine card use the same driver, you may see this problem.
If it was specifically the on-board card, the NC532m would be listed as a work around, and it is not.
Food for thought for anyone wanting to configure a ESX 4.1 environment with Flex 10
Cheers
Mike
In the vmware article it states that Beacon Probing does not support etherchannel connections. How are you utilizing etherchannel AND Beacon Probing?
I'm trying to see how to setup my vSwitch currently... right now we have it set at Link Status with both vmnic0 and vmnic1 as active. This caused a problem this morning when we did a Flex-10 firmware update. Since we're using etherchannel what is the best layout for the vSwitch?
Sorry we are not. My diag was wrong in a previous post. I've fixed
I have experienced this issue and just got word from HP Support that there is a tenative release date of November 5th for a new driver. It will be a 1.60 version according to HP. HP said once it is ready, it will be available on the VMWare download site.
Hopefully this fixes the Smart Link issues too.
I think I have the url for 1.60
Have you tried this?
-MattG
If you find this information useful, please award points for "correct" or "helpful".
not yet, you need to download the ISO and extract just BCM-bnx2x-1.60.50.v41.2-offline_bundle-320302.zip and install using vihostupdate
I will be testing tomorrow in our demo environment. Is there a way to replicate the PSOD does it happen fairly frequently with the smart link detection turned on?
Just having smartlink on will not cause the PSODs...
We saw it only when we had around 30-40 VMs and were trying to back them up during our backup window. So it appears to be a combination of 'many servers' with high network I/O.
We tried to repro with just high network I/O between a few VMs but couldn't get them to fail (with 1.52).
We will be doing the same tests in the next few days.
Aaron
We ran perfectly fine on 1.52 even with simulated network load tests running for a couple of hours on our one cluster and it was thrashing the network.
This same cluster also had dev VM's on them and ran for a month or two with out issues. Our other cluster run the 1.48 driver and had the bulk of our production VM's including a couple of MSSQL servers and have also been stable for months after downgrading to 1.48.
Then we had to move the production VM's accross to the second cluster while we did maintenace on the first cluster with in 3 hours 2 nodes locked up and they seem to have the bulk of our MSSQL servers.
From what I have gathered from the forums and some support conversations that checksum offloading in conjunction with the makeup of MSSQL packets agrevates the problem and causes either network failure or PSOD's which is why not everyone is seening the problem.
I do see that VMware have released the 1.60 driver for esx 4.1 officially, however that driver is not currently availible for 4.0, well not via the website. Oh and neither are listed on HP's site which is very annoying as you would hope that they would be driving this issue as blades and 10Gb is one of their flagship products.
R
We were told by our HP contact that the 4.0 driver is a few weeks away
We were told 4.0 driver would be availble in the end of Oct and that 4.1 driver following the next week.
Now the 4.1 driver has arrived ontime as promised, so where is my 4.0 driver VMware/HP? :smileyblush:
I am trying to apply the new driver and Update Manager is telling me it can't read the metadata from the zip.
Any thoughts?
-MattG
If you find this information useful, please award points for "correct" or "helpful".
Installed ok for me using the VMA and vihostupdate
From remote SSH into TSM ran:
esxupdate --bundle=BCM-bundle.zip update
And it worked.
-MattG
If you find this information useful, please award points for "correct" or "helpful".
Yea, the update manager way is 'broken'. The vmware.xml file in the bundle is improperly formatted.
I was able to edit the file and re-zip it up to make it work although I'm not sure if that is supported.
Aaron
I just got an email from my HP support repesentative and yes 4.1's driver has been released however 4.0 has been delayed a little and will hopefully be availible sometime next week.