VMware Cloud Community
rb51
Enthusiast
Enthusiast

ESXi5.5 host failure - HA Cluster - PSOD

Hi all,

We had an issue with one of our ESXi5.5 hosts in our HA Cluster this weekend.

Hardware: Dell Blade M620, Broadcom BCM57810 10Gb

It threw a PSOD complaining about PCPU no heart beat and showing broadcom errors (see pic attached).

PSOD-CPU failure 051214-forum.JPG

Going through the vmkernel logs it seems that one of the HBAs failed

Extract from vmkernel.log

2014-12-05T23:13:58.377Z cpu8:10592052)WARNING: LinScsi: SCSILinuxAbortCommands:1837: Failed, Driver bnx2i, for vmhba36

...

2014-12-05T23:14:11.151Z cpu0:33497)<1>bnx2i::0x4109c61eab40: ####CID leaked bnx2i_tear_down_conn: sess 0x4109c75f4738 ep 0x4109dc859690 {0x5a, 0x1a}

...

2014-12-05T23:14:11.401Z cpu6:33497)bnx2i::0x4109c61eab40: bnx2i_conn_stop::vmnic3 - sess 0x4109c75f9948 conn 0x4109c75f9d20, icid 41, cmd stats={p=0,a=1,ts=1950037,tc=1950036}, ofld_conns 9

I will check whether there are new drivers available, but also including pic of current driver details....

ethtool051214-forum.JPG

Has anyone come across this before?

Am I right in assuming that the issue is with the physical nic?

What is strange is that in the vmkernel log file I cannot see any problems from the other 10Gb card (same model, I know it should be a different vendor) which should keep working an not failing the host (heartbeat).

Comments are appreciated.

0 Kudos
4 Replies
rcporto
Leadership
Leadership

The problem seems to be related to Broadcom drivers and I will recommend you upgrade the drivers to a newer version: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI55-BROADCOM-BNX2-225FV558&productId=35...

Another option is upgrade the Broadcom firmware too, but you should look for these drivers on Dell support page.

---

Richardson Porto
Senior Infrastructure Specialist
LinkedIn: http://linkedin.com/in/richardsonporto
abhilashhb
VMware Employee
VMware Employee

I agree with Richard.

PSOD usually happen because of firmware mismatch. If you can confirm from the server vendor and get the latest firmware it will also prevent future failures.

For now, you can also go ahead and upgrade the firmware for broadcom as that's the one causing the issue. Also, I suggest you open a ticket with VMware to get a better opinion.

Abhilash B
LinkedIn : https://www.linkedin.com/in/abhilashhb/

vNEX
Expert
Expert

Hi rb51,

as others mentioned you should upgrade bnx2x driver to the latest with regards to the Dell Customized Image (from 04 Dec 2014):

VMware ESXi 5.5 Update 2 Driver Details | Dell US

the right driver version is bnx2x - 2.710.39.v55.2

with regards to actual VMware async drivers download list for BCM8710:

VMware Compatibility Guide: I/O Device Search

there are also some newest builds: bnx2x - 2.710.52.v55.2

    

...its up to you which one you will use but as Abhilashhb point out its good to contact VMware support first.

Below is step by step guide how to install async drivers to existing ESXi installation:

VMware KB: Installing async drivers on VMware ESXi 5.0, 5.1, and 5.5

Here is the latest firmware from Dell (7.10.18) for BCM 57810:

Broadcom NetXtreme I and II Network Device Firmware 7.10.18 Driver Details | Dell US

Message was edited by: vNEX

_________________________________________________________________________________________ If you found this or any other answer helpful, please consider to award points. (use Correct or Helpful buttons) Regards, P.
0 Kudos
rb51
Enthusiast
Enthusiast

thank you guys for the replies so far, much appreciated...

Ticket logged with VMware support so they can be aware of the issue, which may impact other customers.

Going on hols from tomorrow PM (GMT) so not much time for tshoot/debug. Work colleagues will be monitoring host and we decided not to upgrade broadcom drivers/firmware until heard from VMware and I come back.

I hope VMware support team can provide few answers/clues about this issue.

regards,

rb51

0 Kudos