VMware Cloud Community
Enter123
Enthusiast
Enthusiast

vSAN PSOD - emulex NIC ?

Hi,

I am reaching out to you guy because I haven't seen any progress with the vmware support investigation. We have the following environment:

        - vSAN ready nodes: DL360 gen9

          - Emulex HP FlexFabric 10Gb 2-port 556FLR-SFP+ Adapter

Repeating issue: PSOD with the message: Unrecoverable error in elxnet atpartners/samples/elxnet/elxnet_main.c:elxnet_getRxPfragInfo - 2668

Support suggested driver and firmware upgrade, we did it. Update from esxi 6 U2 to U3, we did it. But still it happens. They haven't seen it before, don't know the root cause. HPE hasn't seen it before, they don't know either.

This happens in different locations, Data center in different towns, Europe and US. What is common: vSAN 6.2, ESXi 6 U2 and U3, the same vSAN ready nodes and the same NIC cards.

Has anyone has or heard about something similar?

Thanks,

Enter

9 Replies
SureshKumarMuth
Commander
Commander

Even after updating the firmware and driver if the issue persists, VMware / HP should work with Emulex driver team via partner channel to fix the driver issue. From the PSOD message it is clear that elxnet is causing the issue.

Did vmware support team opened a PR with Engineering team to check the issue ? Please insist them to open a PR if the issue persist.

Regards,
Suresh
https://vconnectit.wordpress.com/
0 Kudos
Enter123
Enthusiast
Enthusiast

Thanks for responding. We have had many SR cases opened. Both VMware and HPE got logs, they observed live while the issue was happing, saw all symptoms etc. But they cannot find the root cause. So they suggested to install debug driver in our systems. We have done it in some clusters, but not everywhere. This has been going on for a while. First case started in August last year, but still no results. A couple of times we did NIC firmware and driver version upgrade, ESXi update from 6 U2 to U3. Nothing helped.

That is why I am asking openly, has anyone seen or heard about similar issue in someone's environment? If yes, how the issue was solved?

0 Kudos
GreatWhiteTec
VMware Employee
VMware Employee

Hi Enter,

I see your support case has been escalated, and VMW is working with HPE/Emulex on this, so hopefully you'll get some traction soon.

I noticed in the vSphere VCG that 2 driver types are listed, Async and Inbox. I am assuming you are using Async drivers, and I'm wondering if GSS has instructed you to test the Inbox drivers or vice-versa. Just a thought, since it looks like it may be a driver issue.

eode
Enthusiast
Enthusiast

Hi.

We experienced the same PSOD (elxnet_getRxPfragInfo - 2668) on some ProLiant DL380 Gen9 (the 556FLR-SFP+ Adapter being the similar factor here), running on the 11.1.145.0 / 11.1.183.62 combination (it's on VMware VCG, but not the latest version). Mainly, we started our "investigation" based on a latency issue, but the PSOD occured at a later stage, when we put hosts in maintenance mode. As the vmkernel/logs was not reporting any specific errors, we decided to upgrade the driver/firmware combination, pending verification from both VMware GSS & HPE. We have not seen the issue since upgrading the driver & firmware (this was in late December, 2017).

Both VMware & HPE finished their analysis, and returned with the very same action plan which we had already performed (upgraded driver & firmware). HPE also recommended a specific ESXi-build, but we were already on a newer version.

In short:

  • ESXi-build: 6921384 (aka 6.0 U3 + Patch 6)
  • Combination when issue was active: 11.1.145.0 / 11.1.183.62
  • Combination when issue was resolved: 11.2.1149.0 / 11.2.1226.20

So, quick questions:

  • Which version of ESXi are you running, both before & after upgrade?
  • Which version of driver (elxnet-module), both before & after upgrade?
  • Which version of firmware (before & after upgrade)?

Just for reference, the SVID:SDID on the 556FLR-SFP+-device is 103c:220a.

Regards,

Espen Ødegaard

0 Kudos
Enter123
Enthusiast
Enthusiast

Hi Espen,

thanks for the response. We have had these PSODs on different ESXi build versions: ESXi 6 U2, U3, different NIC firmware and driver versions. None of the upgrades helped.

That latest we had:

     - ESXi 6 U3 build 6856897

     - NIC Firmware 11.2.1226.20

     - NIC Driver 11.2.1149.0

And we got PSOD there. So we installed debug driver. And we are waiting to see if PSOD will happen with this debug driver, so we gather more info on the issue.

Regards,

Enter

0 Kudos
mcabeer
Contributor
Contributor

Same problem, DL360 GEN9, same SFP+ adapter.

driver:

esxcli software vib list |grep elx

elxnet                         11.2.1149.0-1OEM.600.0.0.2768847     EMU              VMwareCertified   2018-01-23

Product: VMware ESXi

   Version: 6.0.0

   Build: Releasebuild-6765062

   Update: 3

   Patch: 72

IMG_20180308_233051~2.jpg

0 Kudos
eode
Enthusiast
Enthusiast

Hi.

Not the exact same error (2682 vs 2688), but yeah, it's the elxnet_getRxPfragInfo, for sure.

Just to "verify", I'm guessing you're running the same (updated) firmware as well (latest on VCG)? You could check with this command (look for "Firmware Version")

esxcli network nic get -n vmnic5 | grep Version

Note: Where "vmnic5" is your vmnic-number. Also; case sensitive on "Version" (or you could just use grep -i version)

Also, I notice different ESXi-build. As mentioned I've not seen this on 6.0 U3 with Patch 6, that's ESXi build 6921384 - after applying the firmware & driver update. I did however see the issue before upgrading fimware & driver (on the very same ESXi build), and also on ESXi build 5572656 (before upgrading firmware & driver).

Curious; does this happen "out of the blue", or while triggering something (in our case, we did see network latency, but PSOD first occured while putting the host in maintenanceMode).

0 Kudos
kabanossi
Enthusiast
Enthusiast

I would start from the below:

- Update server BIOS to the latest version.

- Update SA controller firmware to the latest version.

- Install the latest hpsa driver. You can check the current hpsa driver version by running:  vmkload_mod -s hpsa | grep Version

0 Kudos
Enter123
Enthusiast
Enthusiast

Thanks everyone for response. In the end we replaced all emulex NICs with Intel NICs. No issues after that. And we haven't received any info about the root cause.

0 Kudos