VMware Cloud Community
Brock
Contributor
Contributor

ESX Host crashed- possibly caused by a failed NIC

The server crashed this morning, but came back up after we rebooted it. I have two Broadcom cards in the server and a separate intel card. The Kernel logs before the server crash are full of this error. vmnic0 was one of the broadcom cards:

Nov 12 09:01:06 TABESX01 vmkernel: 210:21:41:44.920 cpu2:1026)<3>bnx2: vmnic0: BUG! Tx ring full when queue awake!

Nov 12 09:01:10 TABESX01 vmkernel: 210:21:41:48.919 cpu1:1078)WARNING: LinNet: 4288: Watchdog timeout for device vmnic0

Nov 12 09:01:12 TABESX01 vmkernel: 210:21:41:50.919 cpu2:1063)<3>bnx2: fw sync timeout, reset code = 1022a51

Here is part of the kernel log after the ESX server came back. I can't quite decifer the log but I assume it shows that the driver was only able to load one of the cards. Currently Vmnic0 is not active; is not displayed in the GUI or through Esxcfg-nics –l

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: driver bnx2 is looking for devices

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: Trying 00:08.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: Trying 00:1f.2

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: Announcing 00:1f.2

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: Trying 03:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.444 cpu3:1036)PCI: Announcing 03:00.0

Nov 12 10:27:40 TABESX01 vmkernel: <6>Broadcom NetXtreme II Gigabit Ethernet Driver bnx2 v1.5.10b (May 1, 2007)

Nov 12 10:27:40 TABESX01 vmkernel: <6>vmnic0: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem ce000000, IRQ 113, node addr ea5e 00 0cea5e

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: driver bnx2 claimed device 03:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Registering network device 03:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 2086: Couldn't find vmnic0. Creating a new node

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 3481: Connecting device vmnic0 to pps

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 3626: Device vmnic0 yet to come up

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)LinPCI: 202: Device 3:0 claimed.

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Mod: 2529: called already for this device.

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Trying 04:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Announcing 04:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Trying 06:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Announcing 06:00.0

Nov 12 10:27:40 TABESX01 vmkernel: <6>vmnic1: Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X 64-bit 133MHz found at mem ca000000, IRQ 153, node addr 001a64 09b15a

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: driver bnx2 claimed device 06:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Registering network device 06:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 2086: Couldn't find vmnic1. Creating a new node

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 3481: Connecting device vmnic1 to pps

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Uplink: 3626: Device vmnic1 yet to come up

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)LinPCI: 202: Device 6:0 claimed.

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)Mod: 2529: called already for this device.

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Trying 07:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Announcing 07:00.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Trying 07:00.1

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Announcing 07:00.1

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Trying 11:0e.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: Announcing 11:0e.0

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)PCI: driver bnx2 claimed 2 devices

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:03.633 cpu3:1036)IDT: 1336: 0x71 <vmnic0> sharable (entropy source), flags 0x10

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:05.634 cpu3:1036)<3>bnx2: fw sync timeout, reset code = 1020002

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:05.634 cpu3:1036)IDT: 1801: 0x71

Nov 12 10:27:40 TABESX01 vmkernel: 0:00:00:05.634 cpu3:1036)IDT: 1868: <vmnic0>

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.634 cpu3:1036)IDT: 1336: 0x99 <vmnic1> sharable (entropy source), flags 0x10

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.734 cpu3:1036)Uplink: 2495: Setting capabilities 0x0 for device vmnic1

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.734 cpu3:1036)NetNCP: 1818: Opening discovery port

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.734 cpu3:1036)NetDiscover: 946: Using port 0x3's output chain

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.734 cpu3:1036)Mod: 1436: Initialization for bnx2 succeeded with module ID 4.

Nov 12 10:27:41 TABESX01 vmkernel: 0:00:00:05.734 cpu3:1036)bnx2 loaded successfully.

My guess is that the card is dead. Does anyone have any other thoughts on it?

Thanks

0 Kudos
3 Replies
Texiwill
Leadership
Leadership

Hello,

I would open a case with your VMware Support Representative. You will also need the display that is on the Purple Screen as it has the exact code that failed and we can usually track that back to the hardware or software component that is failing. Without that, it is a guessing game as to where the failure occurred.

However given those errors you may want to verify the firmware for your network adapters is at the proper level for ESX, that the BIOS is also set properly before proceeding with a support call. If everything is set properly it could be the card, but that is not definitive.


Best regards,

Edward L. Haletky

VMware Communities User Moderator

====

Author of the book 'VMWare ESX Server in the Enterprise: Planning and Securing Virtualization Servers', Copyright 2008 Pearson Education.

SearchVMware Blog: http://itknowledgeexchange.techtarget.com/virtualization-pro/

Blue Gears Blogs - http://www.itworld.com/ and http://www.networkworld.com/community/haletky

As well as the Virtualization Wiki at http://www.astroarch.com/wiki/index.php/Virtualization

--
Edward L. Haletky
vExpert XIV: 2009-2023,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos
coco26
Contributor
Contributor

hi -

did anyone find a resolution to this? i have the same problem on my dl380g5

0 Kudos
azn2kew
Champion
Champion

While waiting for VMware SR calls, I would migrate the VMs to different hosts and start fresh install of the host, update all HBA, NICs, firmware to the latest and stress test it. I would use different set of NICs for the SC and Virtual Machine port groups and preferrably this would be Intel Quad ports. If not, spread the load between Broadcom built in and Intel with combination of SC/VMotion/Virtual Machines so if one port on Broadcom failed it still has available port on the Intel to run the show.

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!!

Regards,

Stefan Nguyen

VMware vExpert 2009

iGeek Systems Inc.

VMware, Citrix, Microsoft Consultant

If you found this information useful, please consider awarding points for "Correct" or "Helpful". Thanks!!! Regards, Stefan Nguyen VMware vExpert 2009 iGeek Systems Inc. VMware vExpert, VCP 3 & 4, VSP, VTSP, CCA, CCEA, CCNA, MCSA, EMCSE, EMCISA
0 Kudos