VMware Cloud Community
STI69
Contributor
Contributor

SCSI Device Sense Codes : SCSIDeviceIO (...) failed & nmp_ThrottlelogForDevice messages in vmkernel.log for all Dell EMC Unity 650f VMFS volumes

An abstract of my vmkernel.log

We had storage incidents (crashes for unresolved bugs, being a readcache merge problem and a backend driver issue) on Unity 650F.

Dell is writing on a 'custom' fix on both issues. In the mean time we were asked to mitigate the controller autoresets, and upgrade to OE 5.0.3 which they agreed upon will not resolve the current controller resets. After complaints from our side, they digged into every component of our infra, to mitigate on the impact/issues on their storage. The following SCSI sense codes were found in the vmkernel log and we are referred to further seek host support to suppress 'illegal scsi commands'

These according to them are to addressed as they are contributing problems to crashe of a controller node , as target reset attempts are being made by the hosts (as seen from the storage side persfective)

2020-08-17T01:49:15.677Z cpu108:65805)ScsiDeviceIO: 3015: Cmd(0x439e4341dfc0) 0xfe, CmdSN 0xbbc2e5 from world 65687 to dev "naa.60060160e8004b00e2ca985c1400127d" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.

2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e4368a5c0, 65687) to dev "naa.60060160e8004b007e9d9a5cf732ff8e" on path "vmhba2:C0:T11:L47" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43720dc0) 0xfe, CmdSN 0x719b30 from world 65687 to dev "naa.60060160e8004b007e9d9a5cf732ff8e" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435e02c0, 65687) to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" on path "vmhba3:C0:T10:L102" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43502940) 0xfe, CmdSN 0x382ef0 from world 65687 to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.

2020-08-17T01:49:15.877Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435079c0, 65687) to dev "naa.60060160e8004b00de1a995cd70e3c6a" on path "vmhba3:C0:T12:L30" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL

I entred those value to

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/?host=8&device=0&plugin=0&sensekey=80&asc...=

pastedImage_6.png

so according to this, the HBA does a reset of the target, which in this case is a DELL EMC Unity FC port.

And now what ?

dev "naa.6006" is DELL EMC Storage, in my case Dell EMC Unity 650f running OE 4.5.1 (UWDC01) & OE 5.0.3 (UWDC02)

Current Dell EMC Unity Target code is OE 5.0.3

[root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160de004b005f5d2a5fbbcad438"

naa.60060160de004b005f5d2a5fbbcad438:1                                     /vmfs/devices/disks/naa.60060160de004b005f5d2a5fbbcad438:1                                     5f2a5df1-f99059c6-eed8-20040ff4978e  0  UWDC01_IT-PROD-WDC_V005

[root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160e8004b00e595255ec02cf074"

naa.60060160e8004b00e595255ec02cf074:1                                     /vmfs/devices/disks/naa.60060160e8004b00e595255ec02cf074:1                                     5e2596ab-2bec8188-f141-20040ff4978e  0  UWDC02_IT-PROD-WDC_V103

[root@esx070:~] vmkchdev -l | grep vmhba

0000:00:11.5 8086:a1d2 1734:1230 vmkernel vmhba0

0000:00:17.0 8086:a182 1734:1230 vmkernel vmhba1

0000:17:00.0 1077:2261 1077:029b vmkernel vmhba2 ----------------> FC HBA

0000:6d:00.0 1077:2261 1077:029b vmkernel vmhba3 ----------------> FC HBA

[root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -d

Dumping all key-value instance names:

Key Value Instance:  vmhba3/qlogic

Key Value Instance:  vmhba2/qlogic

Key Value Instance:  vmhba1/vmw_ahci

Key Value Instance:  vmhba0/vmw_ahci

Key Value Instance:  MOD_PARM/qlogic

[root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -l -i vmhba2/qlogic
Listing keys:
Name:   ADAPTER
Type:   string
value:
QLogic 16Gb 1-port FC to PCIe Gen3 x8 Adapter for QLE2690:
        FC Firmware Version: 8.05.61 (d0d5), Driver version 2.1.73.0

Host Device Name vmhba2

BIOS version 3.61
FCODE version 4.11
EFI version 6.11
Flash FW version 8.05.61
ISP: ISP2261, Serial# RFD1722T35676
MSI-X enabled
Request Queue = 0x4309f6548000, Response Queue = 0x4309f6569000
Request Queue count = 2048, Response Queue count = 512
Number of response queues for CPU affinity operation: 4
CPU Affinity mode enabled
Total number of MSI-X interrupts on vector 0 (handler = 23) = 26676
Total number of MSI-X interrupts on vector 1 (handler = 24) = 2186
Total number of MSI-X interrupts on vector 2 (handler = 25) = 1090148738
Total number of MSI-X interrupts on vector 3 (handler = 26) = 583007145
Total number of MSI-X interrupts on vector 4 (handler = 27) = 2055128386
Total number of MSI-X interrupts on vector 5 (handler = 28) = 1406005796
Device queue depth = 0x8
Number of free request entries = 1271
FAWWN support: disabled
FEC support: Disabled
Total number of outstanding commands: 0
Number of mailbox timeouts = 0
Number of ISP aborts = 0
Number of loop resyncs = 29
Host adapter:Loop State = [READY], flags = 0x20ae200
Link speed = [16 Gbps]

Dpc flags = 0x0
Link down Timeout =  010
Port down retry =  010
Login retry count =  010
Execution throttle = 2048
ZIO mode = 0x6, ZIO timer = 1
Commands retried with dropped frame(s) = 297

Product ID = 4953 5020 2261 0001

NPIV Supported : Yes
Max Virtual Ports = 254

SCSI Device Information:
scsi-qla0-adapter-node=20000024ff149042:160a00:0;
scsi-qla0-adapter-port=21000024ff149042:160a00:0;

Name:   TARGET
Type:   string
value:
Driver version 2.1.73.0

Host Device Name vmhba2

FC Target-Port List:
scsi-qla0-target-0=500000e0da81df29:122300:0:Online;
scsi-qla0-target-1=500000e0da81df39:142300:1:Online;
scsi-qla0-target-2=5006016249e4121e:140000:2:Online;
scsi-qla0-target-3=5006016349e0121e:120000:3:Online;
scsi-qla0-target-4=5006016849e4121e:140100:4:Online;
scsi-qla0-target-5=5006016a49e4121e:120200:5:Online;
scsi-qla0-target-6=5006016249e415ff:0e0000:6:Online;
scsi-qla0-target-7=5006016349e015ff:100000:7:Online;
scsi-qla0-target-8=5006016849e415ff:100100:8:Online;
scsi-qla0-target-9=5006016a49e415ff:0e0100:9:Online;
scsi-qla0-target-10=5006016249e41688:0e0500:a:Online;
scsi-qla0-target-11=5006016349e01688:100200:b:Online;
scsi-qla0-target-12=5006016849e41688:100300:c:Online;
scsi-qla0-target-13=5006016a49e41688:0e0300:d:Online;

Name:   NPIV
Type:   string
value:
Driver version 2.1.73.0

Host Device Name vmhba2

NPIV Supported : Yes

Looking at the Qlogic Site (Marvell Nowadays) and looing for the QL2690, we are at version -1 compared to the latest

QLogic / Marvell Driver Download

README

                   Read1st for Cavium Flash Image Package

                     --------------------------------------

                   **** ONLY FOR 268x/269x/27xx Series Adapters ****

1. Contents Of Flash Package

--------------------------------

The files contained in this Flash image package are zipped into a file that

will expand to provide the following versions for the 268x/269x/276x Series Adapters.

*  Flash Image Version 01.01.91

   BK010191.BIN contains:

   ----------------------

     Bootcode FC

       FC BIOS       v3.62

       FC FCode      v4.11  (Initiator)

       FC FCode      v4.10  (Target)

       FC EFI        v7.00  (Signed)

     FC Firmware   v8.08.231

     MPI Firmware  v1.00.19

     PEP Firmware(Quad-port)        v1.0.27

     PEP Firmware(Single/Dual port) v2.0.12

     PEP SoftROM(Quad port)         v1.0.16

     PEP SoftROM(Single/Dual port)  v2.0.11

     EFlash tool  v1.18

Reply
0 Kudos
4 Replies
nachogonzalez
Commander
Commander

Some quick questions:

- Is your firmware up to date?
- Have you tried upgrading all drivers?
- Is it possible that there is an issue on an SFP?
- Does this issue spread across multiple hosts?

Is it possible that you disable ATS Heartbeating? Disabling ATS Heartbeat - Huawei SAN Storage Host Connectivity Guide for VMware ESXi - Huawei

Let me know if you found this helpful

Reply
0 Kudos
STI69
Contributor
Contributor

SAN

SANSW23:xxxx> porterrshow 4
           frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
        tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err    err
  4:    3.9g   2.2g   0      0      0      0      0      0      0      8      0      0      0      0      0      0      0      0      0

SANSW22:xxxx> porterrshow 10
           frames      enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy  c3timeout    pcs    uncor
        tx     rx      in    err    g_eof  shrt   long   eof     out   c3    fail    sync   sig                  tx    rx     err    err
10:    3.9g   2.1g   0      0      0      0      0      0      0     16      0      0      0      0      0      0      0      0      0

Very few errors, c3 discard errors are frames that  got queued to the destination , then expired and got dropped.

Likely cause : the buffercredits got exhausted , a flow control issue in the fabric. This may be be caused by HBA speed mismatches on the same path of the esx070 to Unity 650f SP port., as the ESX070 speed of 16gbit matches the Unity 650f (16gbit as well).

Seen the high tx/rx, this a very low figure.

Indeed we have a lot of

"Lost access to volume xxxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"

We have regular posts of vRealize Loginsight messages on many hosts !!!!

Masses of those are send during the night, when Veeam backups Proxies attach the VMFS for backup transfert purposes.

I presume that when the VMFS gets dismounted , some ESXi hosts (and some more then others) report that the temporary device naa.xxxxxx is inaccessible.

When performing a lookup on the VMFS issuing esxcfg-scsidevs -m | grep naa.xxxxx

these devices are non existent after they have been declared inaccessible. infact these are volumes are recognised by vSphere as snap<hex value (?)>-<VMFS label>

[root@esx070:~] vmware -v
VMware ESXi 6.5.0 build-15256549

Imageprofile ESXi-6.5.0-20191204001-standard

https://esxi-patches.v-front.de/ESXi-6.5.0.html

According to the Marvell Site we have last version-1 as to the FW

I see an important update on Emulex  in a higher then ours build.

Still we have a QL2690

2020-07-30

Imageprofile ESXi-6.5.0-20200704001-standard (Build 16576891)

lpfc11.4.33.26-14vmw.650.3.138.16576891VMWUpdates the ESX 6.5.0 lpfcbugfiximportantESXi650-202007403-BG

The last image profile update on qlogic , which is below our build version

2019-07-02 (Update 3)

Imageprofile ESXi-6.5.0-20190702001-standard (Build 13932383) includes the following updated VIBs:

Important abstract from underneath full list

qlnativefc2.1.73.0-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 qlnativefcbugfiximportantESXi650-201907205-UG

We have this exact driver !!!!

NameVersionVendorSummaryCategorySeverityBulletin
bnxtnet20.6.101.7-23vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 bnxtnetenhancementimportantESXi650-201907216-UG
brcmfcoe11.4.1078.25-14vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 brcmfcoebugfiximportantESXi650-201907218-UG
esx-base6.5.0-3.96.13932383VMwareUpdates the ESX 6.5.0 esx-basebugfixcriticalESXi650-201907201-UG
esx-tboot6.5.0-3.96.13932383VMwareUpdates the ESX 6.5.0 esx-tbootbugfixcriticalESXi650-201907201-UG
esx-ui1.33.4-13786312VMwareVMware Host ClientsecurityimportantESXi650-201907103-SG
i40en1.8.1.9-2vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 i40enenhancementimportantESXi650-201907214-UG
igbn0.1.1.0-4vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 igbnbugfiximportantESXi650-201907206-UG
ixgben1.7.1.15-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 ixgbenenhancementimportantESXi650-201907204-UG
lpfc11.4.33.25-14vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lpfcbugfiximportantESXi650-201907217-UG
lsi-mr37.708.07.00-3vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-mr3enhancementimportantESXi650-201907209-UG
lsi-msgpt220.00.06.00-2vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt2bugfixmoderateESXi650-201907212-UG
lsi-msgpt317.00.02.00-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt3bugfiximportantESXi650-201907210-UG
lsi-msgpt3509.00.00.00-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 lsi-msgpt35bugfiximportantESXi650-201907211-UG
lsu-hp-hpsa-plugin2.0.0-16vmw.650.3.96.13932383VMwareUpdates the ESX 6.5.0 lsu-hp-hpsa-pluginbugfiximportantESXi650-201907215-UG
misc-drivers6.5.0-3.96.13932383VMWUpdates the ESX 6.5.0 misc-driversbugfiximportantESXi650-201907203-UG
nenic1.0.29.0-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 nenicenhancementimportantESXi650-201907219-UG
nvme1.2.2.28-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 nvmeenhancementimportantESXi650-201907207-UG
qlnativefc2.1.73.0-5vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 qlnativefcbugfiximportantESXi650-201907205-UG
smartpqi1.0.1.553-28vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 smartpqienhancementimportantESXi650-201907213-UG
tools-light6.5.0-2.92.13873656VMwareUpdates the ESX 6.5.0 tools-lightsecurityimportantESXi650-201907102-SG
vmkusb0.1-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 vmkusbbugfiximportantESXi650-201907202-UG
vmw-ahci1.1.6-1vmw.650.3.96.13932383VMWUpdates the ESX 6.5.0 vmw-ahcibugfiximportantESXi650-201907220-UG
vmware-esx-esxcli-nvme-plugin1.2.0.36-3.96.13932383VMwareUpdates the ESX 6.5.0 vmware-esx-esxcli-nvme-pluginenhancementimportantESXi650-201907208-UG
vsan6.5.0-3.96.13371499VMwareUpdates the ESX 6.5.0 vsanbugfixcriticalESXi650-201907201-UG
vsanhealth6.5.0-3.96.13530496VMwareESXi VSAN Health ServicebugfixcriticalESXi650-201907201-UG
Reply
0 Kudos
nachogonzalez
Commander
Commander

Please try this:

Upgrade Firmware and drivers. If that doesn't work disable ats heartbeating

Reply
0 Kudos
STI69
Contributor
Contributor

Its an ESXi 6.5 , and disabling ATS is only valid for 5.5 & 6.0

We have the latest driver on the ESXI side.

MARVELL :

Flash Image Version 01.01.91

However this requires Qconvergence CLI which is not available on ESX

   BK010191.BIN contains:

   ----------------------

     Bootcode FC

       FC BIOS       v3.62

       FC FCode      v4.11  (Initiator)

      FC EFI        v7.00  (Signed)

     FC Firmware   v8.08.231

Our Curent Version

BIOS version 3.61

FCODE version 4.11

EFI version 6.11

Flash FW v8.05.61

Let me see what I can do to schedule this update.

I checked on the Serverview Update DVD from Fujitsu Primergy RX4770 M4 and they released

COMMENT_PUBLIC

--------------

bk016042.BIN contains:

----------------------

Bootcode FC

COMMENT_PUBLIC

--------------

bk016042.BIN contains:

----------------------

Bootcode FC

FC BIOS       v3.61

FC FCode      v4.11 b2

FC EFI        v6.14 (Fujitsu) Signed

FC Firmware            v8.08.231

MPI Firmware           v1.03.17

PEP Firmware (Baker)   v1.0.24

PEP Firmware (Qlipper) v2.0.14

PEP SoftROM  (Baker)   v1.0.14

PEP SoftROM  (Qlipper) v2.0.09

FC EFI        v6.14 (Fujitsu) Signed

FC Firmware            v8.08.231

MPI Firmware           v1.03.17

PEP Firmware (Baker)   v1.0.24

PEP Firmware (Qlipper) v2.0.14

PEP SoftROM  (Baker)   v1.0.14

PEP SoftROM  (Qlipper) v2.0.09

So I will schedule an intervention on this ESX host next week and keep you posted.

Reply
0 Kudos