wilsonlopes00
Contributor
Contributor

Vsphere 5.5 and Emulex OneConnect 10Gb NIC trouble

I have installed  ESXi5.5 in a server with Emulex OneConnect 10Gb NICs.

I have installed the last driver for this nic - elxnet-10.0.575.9-1OEM.550.0.0.1331820.x86_64.vib.

After some network activity of virtual machines, the interfaces go down, even the switch ports are up.

vmnic4  0000:05:00.00 elxnet      Down 0Mbps     Half   00:00:c9:e4:13:16 9000   Emulex Corporation OneConnect 10Gb NIC

vmnic5  0000:05:00.01 elxnet      Down 0Mbps     Half   00:00:c9:e4:13:18 9000   Emulex Corporation OneConnect 10Gb NIC

Here is the logs

2013-11-19T15:49:12.395Z cpu2:33376)WARNING: elxnet: elxnet_detectDumpUe:238: 0000:005:00.0: UE Detected!!

2013-11-19T15:49:12.396Z cpu2:33376)elxnet: elxnet_detectDumpUe:249: 0000:005:00.0: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2013-11-19T15:49:12.396Z cpu2:33376)WARNING: elxnet: elxnet_detectDumpUe:257: 0000:005:00.0: UE lo: MPU bit set

2013-11-19T15:49:12.892Z cpu5:33377)WARNING: elxnet: elxnet_detectDumpUe:238: 0000:005:00.1: UE Detected!!

2013-11-19T15:49:12.892Z cpu5:33377)elxnet: elxnet_detectDumpUe:249: 0000:005:00.1: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2013-11-19T15:49:12.892Z cpu5:33377)WARNING: elxnet: elxnet_detectDumpUe:257: 0000:005:00.1: UE lo: MPU bit set

Anyone have a similiar trouble?

Tags (2)
122 Replies
MartynThomas
Contributor
Contributor

I reverted back to the 10.2.298.5 ELXNet driver on a single blade and ran it for an hour in production, it failed about 30 mins after with the same UE detected fault. Before bringing it into production I installed the OneConnect vCenter plug-in and OCE CIM provider to enable me to pull the dumps from the NICs.

I've supplied the dumps to HP and VMware to see if they can see anything strange!

Considering how common the Emulex OC11xx / HP NC553 NICs are really, i'm really shocked this has been dragging on so long without a proper resolve.

0 Kudos
wilber822
Enthusiast
Enthusiast

Hi MartynThomas,

May I know your NIC model and error message? I want to see is it exactly same like mine.

https://www.zhengwu.org
0 Kudos
MartynThomas
Contributor
Contributor

Hi Wilber,

I'm using HP BL460c and BL490c G7s with the HP NC553i onboard and HP NC553m Mezzanine cards.

Each device is reported by esxcfg-info (snipped) as the following:


VMNIC0

|----Vendor Id.......................................0x19a2

|----Device Id.......................................0x0710

|----Sub-Vendor Id...................................0x103c

|----Sub-Device Id...................................0x3315

|----Vendor Name.....................................Emulex Corporation

|----Device Name.....................................HP NC553i Dual Port FlexFabric 10Gb Converged Network Adapter

|----Device Class....................................512

|----Device Class Name...............................Ethernet controller

|----VmKernel Device Name............................vmnic0

VMNIC1

|----Vendor Id.......................................0x19a2

|----Device Id.......................................0x0710

|----Sub-Vendor Id...................................0x103c

|----Sub-Device Id...................................0x3315

|----Vendor Name.....................................Emulex Corporation

|----Device Name.....................................HP NC553i Dual Port FlexFabric 10Gb Converged Network Adapter

|----Device Class....................................512

|----Device Class Name...............................Ethernet controller

|----VmKernel Device Name............................vmnic1

Reported in the HCL as:

Model:

(HP NC553i) Emulex OneConnect OCe11102 10GbE NIC CNA for HP ProLiant Intel G7 BladeSystems

Device Type:

Network

DID:

0710

Brand Name:

HP

SVID:

103c

Number of Ports:

2

SSID:

3315

VID:

19a2


VMNIC2

|----Vendor Id.......................................0x19a2

|----Device Id.......................................0x0710

|----Sub-Vendor Id...................................0x103c

|----Sub-Device Id...................................0x3341

|----Vendor Name.....................................Emulex Corporation

|----Device Name.....................................HP NC552m Dual Port Flex-10 10Gbe BL-c Adapter

|----Device Class....................................512

|----Device Class Name...............................Ethernet controller

|----VmKernel Device Name............................vmnic2

VMNIC3

|----Vendor Id.......................................0x19a2

|----Device Id.......................................0x0710

|----Sub-Vendor Id...................................0x103c

|----Sub-Device Id...................................0x3341

|----Vendor Name.....................................Emulex Corporation

|----Device Name.....................................HP NC552m Dual Port Flex-10 10Gbe BL-c Adapter

|----Device Class....................................512

|----Device Class Name...............................Ethernet controller

|----VmKernel Device Name............................vmnic3

Model:

HP NC552m

Device Type:

Network

DID:

0710

Brand Name:

HP

SVID:

103c

Number of Ports:

2

SSID:

3341

VID:

19a2

My cards are running firmware: 10.2.340.19 and the driver: 10.2.298.5.

Error logged in the VMKernel log is as below, which in turn causes host isolation:

2015-01-15T13:36:07.669Z cpu15:33448)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:002:00.0: UE Detected!!

2015-01-15T13:36:07.669Z cpu15:33448)elxnet: elxnet_detectDumpUe:368: 0000:002:00.0: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-01-15T13:36:07.669Z cpu15:33448)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:002:00.0: UE lo: MPU bit set

2015-01-15T13:36:07.669Z cpu15:33448)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.0: UE hi: PMEM bit set

2015-01-15T13:36:07.669Z cpu15:33448)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.0: UE hi: NETCUnknown bit set

2015-01-15T13:36:07.932Z cpu18:33450)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:002:00.1: UE Detected!!

2015-01-15T13:36:07.932Z cpu18:33450)elxnet: elxnet_detectDumpUe:368: 0000:002:00.1: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-01-15T13:36:07.932Z cpu18:33450)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:002:00.1: UE lo: MPU bit set

2015-01-15T13:36:07.932Z cpu18:33450)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.1: UE hi: PMEM bit set

2015-01-15T13:36:07.932Z cpu18:33450)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.1: UE hi: NETCUnknown bit set

2015-01-15T13:36:12.072Z cpu0:32852)WARNING: elxnet: elxnet_asyncWorldWait:3592: 0000:002:00.0: GetStats Checkpoint 1 (12 sec) No resp for MCC cmd opcode: 0x4, subsystem:0x3, timeout:0, req_len:4080

2015-01-15T13:36:22.309Z cpu13:33454)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:006:00.1: UE Detected!!

2015-01-15T13:36:22.310Z cpu13:33454)elxnet: elxnet_detectDumpUe:368: 0000:006:00.1: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-01-15T13:36:22.310Z cpu13:33454)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:006:00.1: UE lo: MPU bit set

2015-01-15T13:36:22.310Z cpu13:33454)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.1: UE hi: NETCUnknown bit set

2015-01-15T13:36:23.221Z cpu0:33452)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:006:00.0: UE Detected!!

2015-01-15T13:36:23.221Z cpu0:33452)elxnet: elxnet_detectDumpUe:368: 0000:006:00.0: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-01-15T13:36:23.221Z cpu0:33452)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:006:00.0: UE lo: MPU bit set

2015-01-15T13:36:23.221Z cpu0:33452)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.0: UE hi: PMEM bit set

2015-01-15T13:36:23.221Z cpu0:33452)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.0: UE hi: NETCUnknown bit set

2015-01-15T13:36:24.074Z cpu0:32852)WARNING: elxnet: elxnet_asyncWorldWait:3592: 0000:002:00.0: GetStats Checkpoint 2 (24 sec) No resp for MCC cmd opcode: 0x4, subsystem:0x3, timeout:0, req_len:4080

2015-01-15T13:36:36.076Z cpu0:32852)WARNING: elxnet: elxnet_asyncWorldWait:3592: 0000:002:00.0: GetStats Checkpoint 3 (36 sec) No resp for MCC cmd opcode: 0x4, subsystem:0x3, timeout:0, req_len:4080

2015-01-15T13:36:36.076Z cpu0:32852)WARNING: elxnet: elxnet_asyncWorldWait:3611: 0000:002:00.0: GetStats MCC cmd timed out. opcode: 0x4, subsystem:0x3, timeout:0, req_len:4080

2015-01-15T13:36:36.076Z cpu0:32852)WARNING: elxnet: elxnet_generateUE:55: 0000:002:00.0: Injecting fatal error for post-mortem dump

Cheers,

Martyn

0 Kudos
wilber822
Enthusiast
Enthusiast

We are same NIC model and 90% similar error logs.

Emulex  asked me install OneCapture to dump some logs when the issue re-produced, then HP gave me the DEBUG driver after 1  month.

Looks like you also have a case opened with HP and VMware, is it?

Do you willing share the case number with me  so I can ask  HP and VMware BCS team check if we can help each other?

https://www.zhengwu.org
0 Kudos
MartynThomas
Contributor
Contributor

I can't send PMs yet as I don't have enough points but I'm more than happy to share my SR numbers.

Cheers,

Martyn

0 Kudos
wilber822
Enthusiast
Enthusiast

Hi MartynThomas,

Could you check is Jambo Frame enabled on the virtual machine on the problem host?

I cannot re-produce the issue by beta driver, but I see some error:

2015-01-28T06:35:00.490Z cpu10:4273238)WARNING: elxnet: elxnet_dumpPkt:4892: P0 :: vmnic2-q0 Failure reason: "9k without TSO"

2015-01-28T06:35:00.490Z cpu10:4273238)WARNING: elxnet: elxnet_dumpPkt:4895: P0 ::  pkt_len:11241, must_tso:0x0, tso_mss:0, num_frags: 4

https://www.zhengwu.org
0 Kudos
MartynThomas
Contributor
Contributor

I don't have jumbo frames enabled within any VMs, nor do I have jumbo frames enabled on any of my 4500x switches, VMNICs or dvswitches.

0 Kudos
MartynThomas
Contributor
Contributor

HP have suggested I try driver version 10.2.445.0 from the Emulex site, needless to say shortly after installation I encountered the usual host isolation issue, albeit a very slightly different error:

2015-02-03T15:06:30.634Z cpu14:33448)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:002:00.0: UE Detected!!

2015-02-03T15:06:30.634Z cpu14:33448)elxnet: elxnet_detectDumpUe:368: 0000:002:00.0: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-02-03T15:06:30.634Z cpu14:33448)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:002:00.0: UE lo: MPU bit set

2015-02-03T15:06:30.634Z cpu14:33448)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.0: UE hi: PMEM bit set

2015-02-03T15:06:30.634Z cpu14:33448)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.0: UE hi: NETCUnknown bit set

2015-02-03T15:06:30.651Z cpu1:33450)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:002:00.1: UE Detected!!

2015-02-03T15:06:30.652Z cpu1:33450)elxnet: elxnet_detectDumpUe:368: 0000:002:00.1: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-02-03T15:06:30.652Z cpu2:34113)World: 14296: VC opID hostd-6611 maps to vmkernel opID 8e89d881

2015-02-03T15:06:30.652Z cpu1:33450)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:002:00.1: UE lo: MPU bit set

2015-02-03T15:06:30.652Z cpu1:33450)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.1: UE hi: PMEM bit set

2015-02-03T15:06:30.652Z cpu1:33450)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:002:00.1: UE hi: NETCUnknown bit set

2015-02-03T15:06:32.078Z cpu11:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:002:00.0: GetDieTemperature Checkpoint 1 (12 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:06:40.427Z cpu5:39026)NetSched: 626: 0x2000004: received a force quiesce for port 0x200000c, dropped 46 pkts

2015-02-03T15:06:44.081Z cpu11:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:002:00.0: GetDieTemperature Checkpoint 2 (24 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:06:46.464Z cpu5:39026)NetSched: 626: 0x2000004: received a force quiesce for port 0x200000c, dropped 3 pkts

2015-02-03T15:06:56.083Z cpu11:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:002:00.0: GetDieTemperature Checkpoint 3 (36 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:06:56.083Z cpu11:32852)WARNING: elxnet: elxnet_asyncWorldWait:3605: 0000:002:00.0: GetDieTemperature MCC cmd timed out. opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:06:56.083Z cpu11:32852)WARNING: elxnet: elxnet_generateUE:55: 0000:002:00.0: Injecting fatal error for post-mortem dump

2015-02-03T15:06:56.286Z cpu11:32852)WARNING: elxnet: elxnet_txComplClean:3799: 0000:002:00.0: 2018 pending tx-completions

2015-02-03T15:06:56.286Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:002:00.0

2015-02-03T15:06:56.287Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:002:00.0 rxcq-3: did not receive flush compl

2015-02-03T15:06:56.287Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:002:00.0

2015-02-03T15:06:56.288Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:002:00.0 rxcq-2: did not receive flush compl

2015-02-03T15:06:56.288Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:002:00.0

2015-02-03T15:06:56.289Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:002:00.0 rxcq-1: did not receive flush compl

2015-02-03T15:06:56.289Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:002:00.0

2015-02-03T15:06:56.290Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:002:00.0 rxcq-0: did not receive flush compl

2015-02-03T15:06:56.291Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:06:56.291Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:06:56.291Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:06:56.291Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:06:56.299Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:002:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:06:56.299Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:002:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:06:56.299Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:002:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:06:56.299Z cpu11:32852)WARNING: elxnet: elxnet_uplinkReset:2332: 0000:002:00.0: f/w init failed

2015-02-03T15:06:56.299Z cpu11:32852)lacp: LACPDisableDVPort:4275: LACP is not enabled on portset DvsPortset-0

2015-02-03T15:06:56.299Z cpu11:32852)NetPort: 1632: disabled port 0x2000004

2015-02-03T15:06:56.300Z cpu11:32852)NetPort: 2903: resuming traffic on DV port 2022

2015-02-03T15:06:56.300Z cpu11:32852)Uplink: 6530: enabled port 0x2000004 with mac b4:99:ba:fb:f4:d0

2015-02-03T15:06:56.504Z cpu11:32852)WARNING: elxnet: elxnet_txComplClean:3799: 0000:006:00.0: 2018 pending tx-completions

2015-02-03T15:07:02.094Z cpu22:33452)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:006:00.0: UE Detected!!

2015-02-03T15:07:02.094Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:006:00.0

2015-02-03T15:07:02.094Z cpu22:33452)elxnet: elxnet_detectDumpUe:368: 0000:006:00.0: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-02-03T15:07:02.094Z cpu22:33452)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:006:00.0: UE lo: MPU bit set

2015-02-03T15:07:02.094Z cpu22:33452)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.0: UE hi: PMEM bit set

2015-02-03T15:07:02.094Z cpu22:33452)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.0: UE hi: NETCUnknown bit set

2015-02-03T15:07:02.095Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:006:00.0 rxcq-3: did not receive flush compl

2015-02-03T15:07:02.095Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:006:00.0

2015-02-03T15:07:02.096Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:006:00.0 rxcq-2: did not receive flush compl

2015-02-03T15:07:02.096Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:006:00.0

2015-02-03T15:07:02.097Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:006:00.0 rxcq-1: did not receive flush compl

2015-02-03T15:07:02.097Z cpu11:32852)WARNING: elxnet: elxnet_rxQueuesDestroy:2289: elxnet_cmdRxqDestroy failed for 0000:006:00.0

2015-02-03T15:07:02.098Z cpu11:32852)WARNING: elxnet: elxnet_rxCQClean:2230: 0000:006:00.0 rxcq-0: did not receive flush compl

2015-02-03T15:07:02.098Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:07:02.098Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:07:02.098Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:07:02.098Z cpu11:32852)elxnet: elxnet_quiesceIO:2122: Unarming EQ

2015-02-03T15:07:02.106Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:006:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:07:02.106Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:006:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:07:02.106Z cpu11:32852)WARNING: elxnet: elxnet_wrbFromMbox:2206: 0000:006:00.0: Error in Card Detected! Cannot allocate WRB from Mail box

2015-02-03T15:07:02.106Z cpu11:32852)WARNING: elxnet: elxnet_uplinkReset:2332: 0000:006:00.0: f/w init failed

2015-02-03T15:07:02.188Z cpu2:33454)WARNING: elxnet: elxnet_detectDumpUe:357: 0000:006:00.1: UE Detected!!

2015-02-03T15:07:02.188Z cpu2:33454)elxnet: elxnet_detectDumpUe:368: 0000:006:00.1: Forcing Link Down as Unrecoverable Error detected in chip/fw.

2015-02-03T15:07:02.188Z cpu2:33454)WARNING: elxnet: elxnet_detectDumpUe:385: 0000:006:00.1: UE lo: MPU bit set

2015-02-03T15:07:02.188Z cpu2:33454)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.1: UE hi: PMEM bit set

2015-02-03T15:07:02.188Z cpu2:33454)WARNING: elxnet: elxnet_detectDumpUe:395: 0000:006:00.1: UE hi: NETCUnknown bit set

2015-02-03T15:07:02.188Z cpu6:39030)World: 14296: VC opID hostd-7e68 maps to vmkernel opID 65326fa

2015-02-03T15:07:14.108Z cpu11:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:006:00.1: GetDieTemperature Checkpoint 1 (12 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:07:20.269Z cpu15:35856)World: 14296: VC opID hostd-c4a9 maps to vmkernel opID 1750f6e4

2015-02-03T15:07:24.655Z cpu14:39945)World: 14296: VC opID hostd-afb3 maps to vmkernel opID 17159c6b

2015-02-03T15:07:26.109Z cpu2:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:006:00.1: GetDieTemperature Checkpoint 2 (24 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:07:38.110Z cpu2:32852)WARNING: elxnet: elxnet_asyncWorldWait:3586: 0000:006:00.1: GetDieTemperature Checkpoint 3 (36 sec) No resp for MCC cmd opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:07:38.110Z cpu2:32852)WARNING: elxnet: elxnet_asyncWorldWait:3605: 0000:006:00.1: GetDieTemperature MCC cmd timed out. opcode: 0x79, subsystem:0x1, timeout:0, req_len:8

2015-02-03T15:07:38.110Z cpu2:32852)WARNING: elxnet: elxnet_generateUE:55: 0000:006:00.1: Injecting fatal error for post-mortem dump

Wilber, any chance you could share the debug driver you have? I can crash my host *almost* on demand so it would be interesting to see what it reveals.

Cheers,

Martyn

0 Kudos
wilber822
Enthusiast
Enthusiast

I sent you a PM

https://www.zhengwu.org
0 Kudos
MartynThomas
Contributor
Contributor

Thanks, I've dropped you a mail with my contact details Smiley Happy

Cheers,

Martyn

0 Kudos
MartynThomas
Contributor
Contributor

Just pushed the latest Emulex firmware (10.2.470.14) to one of my test hosts and within about 15 mins it's failed again. This isn't looking promising!

new_firmware_fail.JPG

0 Kudos
wilber822
Enthusiast
Enthusiast

Hi Martyn,

I shoot you an email.

https://www.zhengwu.org
0 Kudos
MartynThomas
Contributor
Contributor

Just to keep everyone else in the loop, I've installed and tested the development driver (10.2.261.6251) and the host has remained stable so far.

However I am seeing the following logged in the VMKernel.log, the same as Wilber822:

tso_error.JPG

TSO (TCP Segmentation Offload) is enabled by default in ESXi if a supported NIC is used, this can also be confirmed by running the following:

esxcli system settings advanced list -o /Net/UseHwTSO


hw_tso.JPG


If the above returns 1, it's enabled. 0 = disabled.

I'm tempted to turn off TSO to see if the errors are still logged.

Cheers,

Martyn

0 Kudos
MartynThomas
Contributor
Contributor

Disabling TSO makes no difference, errors are still logged in the VMKernel log relating to '9k without TSO'.

Cheers,

Martyn

0 Kudos
markzz
Enthusiast
Enthusiast

Hey guys and gals

I'm somewhat hesitant to chime in here as I've not seen precisely the issue you have.

We do have the 554FLB in our Gen8 Blades.

When this environment was first commissioned we experienced an issue where we could not see all 8 available paths in oneview or via the OA..

Without giving a very lengthy explanation the solution was quite obscure yet simple.

The 554FLB was installed with firmware version of something like 4.9.006 (if I recall)

When we saw this odd behaviour we update the firmware to 4.10.xxx which did not resolve the issue.

After about 2 weeks on this a HP engineer who was also working on it found that if he downgraded the firmware to a much earlier version, rebooted then upgraded to version 4.9.416 the issue is resolved..

He was correct.

I'm sorry to say I can not confirm the firmware version the 554FLB was supplied with or the version we downgraded to but I can confirm we are still running firmware version 4.9.416 on the 554FLB and our 552SFP's.

These have remained stable with this version.

OH another tidbit. HP made reference to the HP VMware FW and Software Recipe. This is apparently a list of proven firmware / software versions the HP engineers follow when on site performing customer installs etc.. It does appear to have some credibility in that people who do the job have formulated this list.. (rather than someone who sits on the phone and is not sure what a blade actually looks like)

I've found this url  http://vibsdepot.hp.com/hpq/recipes/HP-VMware-Recipe.pdf

0 Kudos
MartynThomas
Contributor
Contributor

Hi Mark,

Thanks for taking the time to respond.

My problem is that I am using, and have tried all of the 'supported' and 'un-supported' combinations from Emulex, HP and VMware, but I'm still without a stable platform.

Have a look at post number 36 on this thread Smiley Happy

Cheers,

Martyn

0 Kudos
Bleeder
Hot Shot
Hot Shot

I wonder why the HP VMware FW and Software Recipe hasn't been updated with the 4.9.416.4 firmware in place of 4.9.416.2.

http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c04326096

MartynThomas
Contributor
Contributor

Just to keep all updated, my test host has been rock solid on version 10.2.261.6251 of the Elxnet driver. I rolled the driver back to 10.2.298.5 this morning and again, within 15 minutes the host failed.

This is definitely a badly written firmware/driver issue!

0 Kudos
wilber822
Enthusiast
Enthusiast

Hi Martyn,

You are correct. HP has confirmed that's a problem in driver. I'm glad to know the driver fixed the problem.

I have feedback to HP. Hopefully they will release a new driver soon.

My environment hard to re-produce the issue even revert back to original driver, but I have confirmed with VMware and HP that your case is similar with mine.

Thanks a lot for your help.

https://www.zhengwu.org
0 Kudos
PeteSu
Contributor
Contributor

We've been having issues with our Emulex NC553i (OCe11102) based NICs and 5.5.  Since our datastores are mounted via NFS, we've been getting NFS APDs lasting long enough to cause VMs to go offline.  We were having this issue with HP's published recipe of firmware & driver combo for 5.5 U1 and U2.  Esxcli network nic stats show tons of receive packet drops on the vmnics.  What's interesting is that in the same host with both Broadcom and Emulex based NICs, only the Emulex vmnics recorded any packet drops.

Anyway, here are a few links of interest to anyone suffering from similar issues.

Packet drops and connectivity issues when using Emulex elxnet Driver version 10.2.298.5 or earlier on OCe10102 and OCe11102 adapters or OEM equivalents

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=209119...

HP revised their advisory as well - Advisory: (Revision) HP NC550x and NC551x Network Adapters - Only 4 of 8 Flex Ports May Be Functional When Using the Flex NIC Option in Virtual Connect

http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c04321064

Our symptoms are different than what's published in the advisory (but matches the VMware KB perfectly).  HP informed that we should downgrade both firmware and driver, and switch from native to legacy mode (due to the downrev driver).  HP just published the February recipe, but it still had the same U2 driver & firmware combo for Emulex.  Hopefully they are working on certifying the new Emulex drivers that's supposed to fix these issues.

We are on driver 4.9.288.0 and firmware 4.9.416.0 now.  The packet drops have stopped, but we are still monitoring for issues.  Hopefully they do not return...