Is anyone else having this issue? We just had 3 servers crash due to a bnx2x_panic_dump. Once the network cards crashed the ESX server had to be rebooted to come back. Even though only a few vmNICs died, the entire server became unreachable, and the VMs became unreachable, even if the vmnic wasn’t bound to the vSwitch that the VM was on.
After researching it appears that VMware supports 3 different drivers:
1. bnx2x version 1.45.20
2. bnx2x version 1.48.107.v40.2
3. bnx2x version 1.52.12.v40.3
On 6/10/2010 VMware came out with a patch for 1.45.20, but esxupdate maked it obsolete, since our version (1.52.12v40.3) was newer. Should I downgrade my driver?
Also the VMware HCL has conflicting information. According to this:
1.52.12.v40.3 is supported by vSphere4 Update2, and not vSphere Update1, yet the U2 release only has an update for the 1.45.20 driver.
Yet according to this:
1.52.12.v40.3 is supported by both vSphere4 Update2 and vSphere Update1.
Here are the details of my environment:
HP BL460G6 blade servers, with flex-10 modules.
The individual blades are using HP NC532i Dual Port 10GbE Multifunction BL-c Adapter, firmware bc 5.0.11.
The chassis OA itself is using firmware v3.0.
The Flex-10 module is using firmware v. 2.33.
Crash Dump:
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu1:4426)VMotionRecv: 1080: 1276732954553852 😧 Estimated network bandwidth 75.588 MB/s during page-in
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.131 cpu7:4420)VMotion: 3381: 1276732954553852 😧 Received all changed pages.
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.245 cpu7:4420)Alloc: vm 4420: 12651: Regular swap file bitmap checks out.
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu7:4420)VMotion: 3218: 1276732954553852 😧 Resume handshake successful
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.246 cpu3:4460)Swap: vm 4420: 9289: Starting prefault for the migration swap file
Jun 16 17:03:54 esx-2-6 vmkernel: 0:01:03:09.259 cpu0:4460)Swap: vm 4420: 9406: Finish swapping in migration swap file. (faulted 0 pages, pshared 0 pages). Success.
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4639(vmnic1)]storm stats were not updated for 3 times
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_stats_update:4640(vmnic1)]driver assert
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:658(vmnic1)]begin crash dump -
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:666(vmnic1)]def_c_idx(0xff5) def_u_idx(0x0) def_x_idx(0x0) def_t_idx(0x0) def_att_idx(0xc) attn_state(0x0) spq_prod_idx(0xf8)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:677(vmnic1)]fp0: rx_bd_prod(0x6fe7) rx_bd_cons(0x3e9) *rx_bd_cons_sb(0x0) rx_comp_prod(0x7059) rx_comp_cons(0x6c59) *rx_cons_sb(0x6c59)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:682(vmnic1)] rx_sge_prod(0x0) last_max_sge(0x0) fp_u_idx(0x6afb) *sb_u_idx(0x6afb)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:693(vmnic1)]fp0: tx_pkt_prod(0x0) tx_pkt_cons(0x0) tx_bd_prod(0x0) tx_bd_cons(0x0) *tx_cons_sb(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:697(vmnic1)] fp_c_idx(0x0) *sb_c_idx(0x0) tx_db_prod(0x0)
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[4f]=[0:deda0310] sw_bd=[0x4100b462c940]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[50]=[0:de706590] sw_bd=[0x4100b4697b80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[51]=[0:deac2810] sw_bd=[0x4100baad8e80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[52]=[0:de9ae390] sw_bd=[0x4100bda03f40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[53]=[0:de3e9a90] sw_bd=[0x4100b463ecc0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[54]=[0:3ea48730] sw_bd=[0x4100bab19100]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[55]=[0:de5b1190] sw_bd=[0x4100bda83980]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[56]=[0:ded48410] sw_bd=[0x4100bdb06080]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[57]=[0:3e3f0d10] sw_bd=[0x4100bca0f480]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.229 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[58]=[0:de742110] sw_bd=[0x4100bda35d40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[59]=[0:de6ffc90] sw_bd=[0x4100bcab3800]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5a]=[0:de619710] sw_bd=[0x4100b4640c40]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5b]=[0:de627e10] sw_bd=[0x4100bcaad440]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5c]=[0:3e455e10] sw_bd=[0x4100b462a9c0]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5d]=[0:de3a6110] sw_bd=[0x4100bdaf1d80]
Jun 16 17:09:42 esx-2-6 vmkernel: 0:01:08:57.230 cpu1:4280)<3>[bnx2x_panic_dump:712(vmnic1)]fp0: rx_bd[5e]=[0:3e37df90] sw_bd=[0x4100b470d580]
any thoughts, suggestions ?
HP Customer Advisory has gone public:
Virtually,
VMware Certified Professional
NOTE: If your problem or questions has been resolved, please mark this thread as answered and award points accordingly.
We are having the exact same issues with vSphere 4.1 on a brand new HP Blade Chassis system with Flex 10 virtual connect ethernet modules (rev 3.01).
Its been running fine for 8 weeks, and just this week decided to die. We now give a PSOD wjhich seems to relate to when any load is added (yet we cant easily replicate the issue). We have 4 vSphere bl460c G6 Blades and 3 Non-Virtual Blades in the encosure running Win 2008. Of the physical OS blades - 1 or all may lose network connectivity. Its a bit random, - so it seems the entire FLEX connection that dies. iLo was lost on the first outage we had, but has not seemed to die in the subsequent issues since upgrading the OA firmware to 3.11 on the Flex 10.
We've replaced the Chassis OA controller and Midplane, but it's made no difference.
Our BNX Driver version is 1.54.
Did anyone else with these issues find a satisfactory fix?
We still have an open ticket with VMware, I suggest you open one as
well. We are able to reproduce by simply connecting to a SQL server
through the network connection as opposed to connecting locally. When
we do it all ESXi 4.1 blades in the chassis lose Flex-10 connectivity.
Curiously, the ESXi 4.0u2 blades that also reside in the same chassis
continue operating fine.
On Tue, Sep 14, 2010 at 9:44 PM, fishface
I ended up with the following configuration:
Onboard Admin FW = 3.11
Virtual Connect FW = 3.01
iLO FW = 2.00
BL460G6 BIOS = 2010.05.20 (A) (24 Jun 2010)
ESXi version = 4.0 Update2 - clean install
The ESXi version means that we are using the 1.52 Broadcom NIC driver.
Items I am curious about:
Any of you that are still having problems, did you do a clean install of 4.1 or an upgrade?
Is anyone runnning 4.1 with VC FW of 2.32 or 2.33? And what version of OA?
Items to note:
I am not confident of HP at this point with VC Flex-10 and their VMware design and support decisions. Even their latest VC Cookbook says that Beacon Probing can be used with 2 NICs. Also, the cookbook makes no reference to 4.1 which was released at least 2 months before the current version of the cookbook.
Virtually,
VMware Certified Professional
NOTE: If your problem or questions has been resolved, please mark this thread as answered and award points accordingly.
Re Fishface's environment; details are
Onboard Admin FW = 3.11
Virtual Connect FW = 3.01
iLO 2 Firmware = 1.82
BL460G6 BIOS = 2010.05.20 (A) (24 Jun 2010)
ESXi version = 4.1 - clean install
1.54 Broadcom NIC driver.
When the ESX server has PSoD, approx 4 minutes latter all other ESX server are disconnected and the associated VM's
1 of the 3 Dedicated Windows Server is also disconnected (Windows Server 2008 x64)
We have a case logged with VMware and HP. What the bet the point fingers at each other.
Cheers
Mike
#############
Sorry, i was logged into vmware with my colleagues account and posted this uunder his name. I've removed and updated my post
#############
VMware support are now telling us that we need to install this patch to make the HP BL460c G6 with Intel 56XX proc supported
They think it will remove the Exception 6 PSoD
Its an ESiX 4.0 patch for ESXi 4.1 ????
Has anyone done this?
I've imported to vCetner 4.1 update manager but its telling me i'm compliant........ grrrrrr getting VMware support to call me back
Cheers
Mike
Message was edited by: sayahdo
.
In my environment we are running is 55xx procs, so this patch is not of concern for me, but I also found that HP has released an update BIOS for the 56xx processors that has to due with memory errors. You can read that here: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=37...
Virtually,
VMware Certified Professional
NOTE: If your problem or questions has been resolved, please mark this thread as answered and award points accordingly.
Hi,
As noted in my post, (a couple above)
I was logged into vmware with my colleagues account access the VMware Service Request and made a post under his name. I've removed and updated my post.
VMware support has come back to me. During a WebEx session they told me that we didn't need the patch
we ware running VMware ESXi 4.1.0 GA, which included the patch
Will check out the BIOS firmware
Cheers
Mike
So HP Level 2 Support point the finger at VMware.
VMware support pointed the finger at HP
Wicked
Anyhow we have an inside HP blade specialist that has siad the followning config will work
Onboard Admin FW = 3.11
Virtual Connect FW = 3.01
iLO 2 Firmware = 1.82
BL460G6 BIOS = 2010.08.16 (9 Sep 2010) NEW
NC532i = 2.2.4 +LOWER+
ESXi version = 4.1 - clean install
bnx2x-1.48
BUT.....
we are getting an error trying to get the driver into ESX4.1. Telling us very politely it is obsolete
Anyone injectted the 1.48 or 1.52 driver to ESXi 4.1??
Cheers
Mike
Again HP is just not getting it. Not surprising after reading this:
Virtually,
VMware Certified Professional
NOTE: If your problem or questions has been resolved, please mark this thread as answered and award points accordingly.
We have the same problem since 4 weeks.
ESXi4.1 installaed on HP BL460cG6 with VC.
Server crashes to PSOD everytime in a virtual Windows sombody tries some MS SQL work like configuration wizzard or SQL Management Studio for import/export Data to a MS-SQL Server running on a VM.
The Problem is reported to VMware and HP. VMware has first told us that this is a new BUG and an unknown problem. We have done a SR at 19th August. The first VMware and HP forum entries are from begin of August.
We have in the same blade enclosure ESX4.0 server - here we have no problem with the same VMs and same actions.
Yesterday we have started to load the 1.45 and 1.48 bnx2x driver from ESX4.0U2 to some test ESXi4.1 servers and the first tests were fine - no PSOD or crash.
We have to do some more tests with the MS SQL tools which has been used to crash a ESXi4.1 with the 1.54 bnx2x driver.
This can be a first workaround until VMware gets a new working Broadcom driver for the 57711 chips and HP virtual Connect.
We can do our project work which has to be interrupted since the server crashes everytime we have tried to configure our application database.
VMware support have officially told us that Broadcom have told them not to downgrading the driver or firmware with ESX 4.1
didn't help that ESXi 4.1 GA wont allow you to load the downgraded driver. even the vmware tech couldn't do it!
This is a massive problem and has been escalated within VMware Broadcom and HP
I've heard from a few people the the 1.6 driver is due out soon as well as a new release f 4.1. any day now.
We are expecting a call from VMware escalation Point of contact soon
Cheers
Mike
Hey Robert,
What version are you running for your;
O/A
VC
If you use ESX 4.0 it states that it has only been tested with; driver v1.52.12.v40.3, firmware v2.2.6, and VC version 2.33.
Cheers
Hi all ,
any news or progress in this problem ? We are about to upgrade to vSphere 4.1 and are running VC and HP BL490 G6 blades ,but won´t do that until I have heard that this has been fixed. I have intalled vSphere 4.1 in our test environment ,and haven´t seen this problem there ,but we are only running a couple of test servers in that environment,and they are pretty much idle all the time.... We are not running the latest VC firmware ( we are still on 2.31 ) but that was one of the things we would upgrade during this upgrade to vSphere 4.1. Any new input on this matter is appreciated
reagrds Stefan
We have been running the following configuration for nearly 2 weeks stable.........
BL460c G6 Blade BIOS – 2010.08.16 (9 Sep 2010)
Flex 10 NIC NC532i Driver – 1.54 (inbox)
Flex 10 NIC NC532i Firmware – 2.2.6
VC Flex 10 Eht Firmware –3.01
OA Firmware – 3.10
Smart Link – TURNED OFF
Beacon Probing - on
These are the recommendations from VMware and HP Support
There are 2 changes from the original deployment.......
Blade Bios has been updated
SMART link has been turned off
It would seem that the combination of these 2 has fixed the problem. Touch wood!
Cheers
Mike
Hi Mike ,
thanks for a fast response.
You are saying Smart Link - TURNED OFF ,are you referring to the "uncheck" box for Smart Link in VC ? And then you have changed Network failover method in vSwitch from "link status only" to "Beacon Probing" ?
I also hope that these changes is a workaround that helps. I would have preferred to run at "Link status only" but that obvious isn´t working so....
regards Stefan
Hi,
Yes this is a work around.
Yes, you must not use Smart Link. It must be turned off/disabled/unchecked
See this following HP advisory for more info. While it states that Smart link doesn't work, It doesn't state that it must be turned off/disabled/unchecked.
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02476622
And Yes, when disabling smart link you need to change the vSwitch fail over detection from link status to Beacon Probing
We found that the the combination of the Bios and the Smart Link removed our PSoD. And i'm not going to revert either change to see which one it was
There is new driver beeing worked on at the moment, 1.6. Its not due out for about another 5-6 weeks, so i've been told.
This should remedy the problem.
There are other work arounds
Use qLogic Flex 10 Mezz cards instead of the onboards. Bit of an expensive work around! Unless its green fields
HP came back to me with and alternative to Beacon Probing.
If your VC Flex 10 Eth module had 2 10 GbE uplinks to 2 Switches, 4 uplinks in total, you could create a Channel across each of the the 2 connection for diversity. see below
In our case we only had 1 uplink from each VC, 2 in total, so HP suggested this config using 1 Gb SFP's as a work around
We asked HP for qLogic cards and 2 more VC's at no cost to use as the equipment is not fit for purpose :). They said no. And stated that if we deviated from the HP Advisory, we would fall out of support. Bit humbug if you ask me, they just didn't want to stump up with new kit....
So........We have decided to stay with Beacon Probing at this stage. NO smart link, turned off/disabled/unchecked
Hope this helps you all.
Cheers
Mike
Hi Mike ,
Thanks for the detailed input on this ,really appreciated.
Cheers
Stefan
Hi:
We have BL680c with HP NC364m Quad Port 1GbE BL-c Adapter. I'm not sure whether this will has impact on our ESX 4.1?
Thank you!
As far as I know, the problems only stem from using the NC532i Flex-10 adapter. As the NC364m is NOT a Flex-10 device, I doubt it exhibits the same problems.