An abstract of my vmkernel.log
We had storage incidents (crashes for unresolved bugs, being a readcache merge problem and a backend driver issue) on Unity 650F.
Dell is writing on a 'custom' fix on both issues. In the mean time we were asked to mitigate the controller autoresets, and upgrade to OE 5.0.3 which they agreed upon will not resolve the current controller resets. After complaints from our side, they digged into every component of our infra, to mitigate on the impact/issues on their storage. The following SCSI sense codes were found in the vmkernel log and we are referred to further seek host support to suppress 'illegal scsi commands'
These according to them are to addressed as they are contributing problems to crashe of a controller node , as target reset attempts are being made by the hosts (as seen from the storage side persfective)
2020-08-17T01:49:15.677Z cpu108:65805)ScsiDeviceIO: 3015: Cmd(0x439e4341dfc0) 0xfe, CmdSN 0xbbc2e5 from world 65687 to dev "naa.60060160e8004b00e2ca985c1400127d" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0.
2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e4368a5c0, 65687) to dev "naa.60060160e8004b007e9d9a5cf732ff8e" on path "vmhba2:C0:T11:L47" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43720dc0) 0xfe, CmdSN 0x719b30 from world 65687 to dev "naa.60060160e8004b007e9d9a5cf732ff8e" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2020-08-17T01:49:15.677Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435e02c0, 65687) to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" on path "vmhba3:C0:T10:L102" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
2020-08-17T01:49:15.677Z cpu65:3876714)ScsiDeviceIO: 3015: Cmd(0x439e43502940) 0xfe, CmdSN 0x382ef0 from world 65687 to dev "naa.60060160e8004b0058fbdc5d4a63c4ba" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0.
2020-08-17T01:49:15.877Z cpu65:3876714)NMP: nmp_ThrottleLogForDevice:3630: Cmd 0xf1 (0x439e435079c0, 65687) to dev "naa.60060160e8004b00de1a995cd70e3c6a" on path "vmhba3:C0:T12:L30" Failed: H:0x8 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL
I entred those value to
so according to this, the HBA does a reset of the target, which in this case is a DELL EMC Unity FC port.
And now what ?
dev "naa.6006" is DELL EMC Storage, in my case Dell EMC Unity 650f running OE 4.5.1 (UWDC01) & OE 5.0.3 (UWDC02)
Current Dell EMC Unity Target code is OE 5.0.3
[root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160de004b005f5d2a5fbbcad438"
naa.60060160de004b005f5d2a5fbbcad438:1 /vmfs/devices/disks/naa.60060160de004b005f5d2a5fbbcad438:1 5f2a5df1-f99059c6-eed8-20040ff4978e 0 UWDC01_IT-PROD-WDC_V005
[root@esx070:~] esxcfg-scsidevs -m | grep "naa.60060160e8004b00e595255ec02cf074"
naa.60060160e8004b00e595255ec02cf074:1 /vmfs/devices/disks/naa.60060160e8004b00e595255ec02cf074:1 5e2596ab-2bec8188-f141-20040ff4978e 0 UWDC02_IT-PROD-WDC_V103
[root@esx070:~] vmkchdev -l | grep vmhba
0000:00:11.5 8086:a1d2 1734:1230 vmkernel vmhba0
0000:00:17.0 8086:a182 1734:1230 vmkernel vmhba1
0000:17:00.0 1077:2261 1077:029b vmkernel vmhba2 ----------------> FC HBA
0000:6d:00.0 1077:2261 1077:029b vmkernel vmhba3 ----------------> FC HBA
[root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -d
Dumping all key-value instance names:
Key Value Instance: vmhba3/qlogic
Key Value Instance: vmhba2/qlogic
Key Value Instance: vmhba1/vmw_ahci
Key Value Instance: vmhba0/vmw_ahci
Key Value Instance: MOD_PARM/qlogic
[root@esx070:~] /usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -l -i vmhba2/qlogic
Listing keys:
Name: ADAPTER
Type: string
value:
QLogic 16Gb 1-port FC to PCIe Gen3 x8 Adapter for QLE2690:
FC Firmware Version: 8.05.61 (d0d5), Driver version 2.1.73.0
Host Device Name vmhba2
BIOS version 3.61
FCODE version 4.11
EFI version 6.11
Flash FW version 8.05.61
ISP: ISP2261, Serial# RFD1722T35676
MSI-X enabled
Request Queue = 0x4309f6548000, Response Queue = 0x4309f6569000
Request Queue count = 2048, Response Queue count = 512
Number of response queues for CPU affinity operation: 4
CPU Affinity mode enabled
Total number of MSI-X interrupts on vector 0 (handler = 23) = 26676
Total number of MSI-X interrupts on vector 1 (handler = 24) = 2186
Total number of MSI-X interrupts on vector 2 (handler = 25) = 1090148738
Total number of MSI-X interrupts on vector 3 (handler = 26) = 583007145
Total number of MSI-X interrupts on vector 4 (handler = 27) = 2055128386
Total number of MSI-X interrupts on vector 5 (handler = 28) = 1406005796
Device queue depth = 0x8
Number of free request entries = 1271
FAWWN support: disabled
FEC support: Disabled
Total number of outstanding commands: 0
Number of mailbox timeouts = 0
Number of ISP aborts = 0
Number of loop resyncs = 29
Host adapter:Loop State = [READY], flags = 0x20ae200
Link speed = [16 Gbps]
Dpc flags = 0x0
Link down Timeout = 010
Port down retry = 010
Login retry count = 010
Execution throttle = 2048
ZIO mode = 0x6, ZIO timer = 1
Commands retried with dropped frame(s) = 297
Product ID = 4953 5020 2261 0001
NPIV Supported : Yes
Max Virtual Ports = 254
SCSI Device Information:
scsi-qla0-adapter-node=20000024ff149042:160a00:0;
scsi-qla0-adapter-port=21000024ff149042:160a00:0;
Name: TARGET
Type: string
value:
Driver version 2.1.73.0
Host Device Name vmhba2
FC Target-Port List:
scsi-qla0-target-0=500000e0da81df29:122300:0:Online;
scsi-qla0-target-1=500000e0da81df39:142300:1:Online;
scsi-qla0-target-2=5006016249e4121e:140000:2:Online;
scsi-qla0-target-3=5006016349e0121e:120000:3:Online;
scsi-qla0-target-4=5006016849e4121e:140100:4:Online;
scsi-qla0-target-5=5006016a49e4121e:120200:5:Online;
scsi-qla0-target-6=5006016249e415ff:0e0000:6:Online;
scsi-qla0-target-7=5006016349e015ff:100000:7:Online;
scsi-qla0-target-8=5006016849e415ff:100100:8:Online;
scsi-qla0-target-9=5006016a49e415ff:0e0100:9:Online;
scsi-qla0-target-10=5006016249e41688:0e0500:a:Online;
scsi-qla0-target-11=5006016349e01688:100200:b:Online;
scsi-qla0-target-12=5006016849e41688:100300:c:Online;
scsi-qla0-target-13=5006016a49e41688:0e0300:d:Online;
Name: NPIV
Type: string
value:
Driver version 2.1.73.0
Host Device Name vmhba2
NPIV Supported : Yes
Looking at the Qlogic Site (Marvell Nowadays) and looing for the QL2690, we are at version -1 compared to the latest
QLogic / Marvell Driver Download
README
Read1st for Cavium Flash Image Package
--------------------------------------
**** ONLY FOR 268x/269x/27xx Series Adapters ****
1. Contents Of Flash Package
--------------------------------
The files contained in this Flash image package are zipped into a file that
will expand to provide the following versions for the 268x/269x/276x Series Adapters.
* Flash Image Version 01.01.91
BK010191.BIN contains:
----------------------
Bootcode FC
FC BIOS v3.62
FC FCode v4.11 (Initiator)
FC FCode v4.10 (Target)
FC EFI v7.00 (Signed)
FC Firmware v8.08.231
MPI Firmware v1.00.19
PEP Firmware(Quad-port) v1.0.27
PEP Firmware(Single/Dual port) v2.0.12
PEP SoftROM(Quad port) v1.0.16
PEP SoftROM(Single/Dual port) v2.0.11
EFlash tool v1.18
Some quick questions:
- Is your firmware up to date?
- Have you tried upgrading all drivers?
- Is it possible that there is an issue on an SFP?
- Does this issue spread across multiple hosts?
Is it possible that you disable ATS Heartbeating? Disabling ATS Heartbeat - Huawei SAN Storage Host Connectivity Guide for VMware ESXi - Huawei
Let me know if you found this helpful
SAN
SANSW23:xxxx> porterrshow 4
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err
4: 3.9g 2.2g 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0
SANSW22:xxxx> porterrshow 10
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err
10: 3.9g 2.1g 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0
Very few errors, c3 discard errors are frames that got queued to the destination , then expired and got dropped.
Likely cause : the buffercredits got exhausted , a flow control issue in the fabric. This may be be caused by HBA speed mismatches on the same path of the esx070 to Unity 650f SP port., as the ESX070 speed of 16gbit matches the Unity 650f (16gbit as well).
Seen the high tx/rx, this a very low figure.
Indeed we have a lot of
"Lost access to volume xxxx due to connectivity issues. Recovery attempt is in progress and outcome will be reported shortly"
We have regular posts of vRealize Loginsight messages on many hosts !!!!
Masses of those are send during the night, when Veeam backups Proxies attach the VMFS for backup transfert purposes.
I presume that when the VMFS gets dismounted , some ESXi hosts (and some more then others) report that the temporary device naa.xxxxxx is inaccessible.
When performing a lookup on the VMFS issuing esxcfg-scsidevs -m | grep naa.xxxxx
these devices are non existent after they have been declared inaccessible. infact these are volumes are recognised by vSphere as snap<hex value (?)>-<VMFS label>
[root@esx070:~] vmware -v
VMware ESXi 6.5.0 build-15256549
Imageprofile ESXi-6.5.0-20191204001-standard
https://esxi-patches.v-front.de/ESXi-6.5.0.html
According to the Marvell Site we have last version-1 as to the FW
I see an important update on Emulex in a higher then ours build.
Still we have a QL2690
Imageprofile ESXi-6.5.0-20200704001-standard (Build 16576891)
lpfc | 11.4.33.26-14vmw.650.3.138.16576891 | VMW | Updates the ESX 6.5.0 lpfc | bugfix | important | ESXi650-202007403-BG |
The last image profile update on qlogic , which is below our build version
Imageprofile ESXi-6.5.0-20190702001-standard (Build 13932383) includes the following updated VIBs:
Important abstract from underneath full list
qlnativefc | 2.1.73.0-5vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 qlnativefc | bugfix | important | ESXi650-201907205-UG |
We have this exact driver !!!!
Name | Version | Vendor | Summary | Category | Severity | Bulletin |
---|---|---|---|---|---|---|
bnxtnet | 20.6.101.7-23vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 bnxtnet | enhancement | important | ESXi650-201907216-UG |
brcmfcoe | 11.4.1078.25-14vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 brcmfcoe | bugfix | important | ESXi650-201907218-UG |
esx-base | 6.5.0-3.96.13932383 | VMware | Updates the ESX 6.5.0 esx-base | bugfix | critical | ESXi650-201907201-UG |
esx-tboot | 6.5.0-3.96.13932383 | VMware | Updates the ESX 6.5.0 esx-tboot | bugfix | critical | ESXi650-201907201-UG |
esx-ui | 1.33.4-13786312 | VMware | VMware Host Client | security | important | ESXi650-201907103-SG |
i40en | 1.8.1.9-2vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 i40en | enhancement | important | ESXi650-201907214-UG |
igbn | 0.1.1.0-4vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 igbn | bugfix | important | ESXi650-201907206-UG |
ixgben | 1.7.1.15-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 ixgben | enhancement | important | ESXi650-201907204-UG |
lpfc | 11.4.33.25-14vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 lpfc | bugfix | important | ESXi650-201907217-UG |
lsi-mr3 | 7.708.07.00-3vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 lsi-mr3 | enhancement | important | ESXi650-201907209-UG |
lsi-msgpt2 | 20.00.06.00-2vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 lsi-msgpt2 | bugfix | moderate | ESXi650-201907212-UG |
lsi-msgpt3 | 17.00.02.00-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 lsi-msgpt3 | bugfix | important | ESXi650-201907210-UG |
lsi-msgpt35 | 09.00.00.00-5vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 lsi-msgpt35 | bugfix | important | ESXi650-201907211-UG |
lsu-hp-hpsa-plugin | 2.0.0-16vmw.650.3.96.13932383 | VMware | Updates the ESX 6.5.0 lsu-hp-hpsa-plugin | bugfix | important | ESXi650-201907215-UG |
misc-drivers | 6.5.0-3.96.13932383 | VMW | Updates the ESX 6.5.0 misc-drivers | bugfix | important | ESXi650-201907203-UG |
nenic | 1.0.29.0-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 nenic | enhancement | important | ESXi650-201907219-UG |
nvme | 1.2.2.28-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 nvme | enhancement | important | ESXi650-201907207-UG |
qlnativefc | 2.1.73.0-5vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 qlnativefc | bugfix | important | ESXi650-201907205-UG |
smartpqi | 1.0.1.553-28vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 smartpqi | enhancement | important | ESXi650-201907213-UG |
tools-light | 6.5.0-2.92.13873656 | VMware | Updates the ESX 6.5.0 tools-light | security | important | ESXi650-201907102-SG |
vmkusb | 0.1-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 vmkusb | bugfix | important | ESXi650-201907202-UG |
vmw-ahci | 1.1.6-1vmw.650.3.96.13932383 | VMW | Updates the ESX 6.5.0 vmw-ahci | bugfix | important | ESXi650-201907220-UG |
vmware-esx-esxcli-nvme-plugin | 1.2.0.36-3.96.13932383 | VMware | Updates the ESX 6.5.0 vmware-esx-esxcli-nvme-plugin | enhancement | important | ESXi650-201907208-UG |
vsan | 6.5.0-3.96.13371499 | VMware | Updates the ESX 6.5.0 vsan | bugfix | critical | ESXi650-201907201-UG |
vsanhealth | 6.5.0-3.96.13530496 | VMware | ESXi VSAN Health Service | bugfix | critical | ESXi650-201907201-UG |
Please try this:
Upgrade Firmware and drivers. If that doesn't work disable ats heartbeating
Its an ESXi 6.5 , and disabling ATS is only valid for 5.5 & 6.0
We have the latest driver on the ESXI side.
MARVELL :
Flash Image Version 01.01.91
However this requires Qconvergence CLI which is not available on ESX
BK010191.BIN contains:
----------------------
Bootcode FC
FC BIOS v3.62
FC FCode v4.11 (Initiator)
FC EFI v7.00 (Signed)
FC Firmware v8.08.231
Our Curent Version
BIOS version 3.61
FCODE version 4.11
EFI version 6.11
Flash FW v8.05.61
Let me see what I can do to schedule this update.
I checked on the Serverview Update DVD from Fujitsu Primergy RX4770 M4 and they released
COMMENT_PUBLIC
--------------
bk016042.BIN contains:
----------------------
Bootcode FC
COMMENT_PUBLIC
--------------
bk016042.BIN contains:
----------------------
Bootcode FC
FC BIOS v3.61
FC FCode v4.11 b2
FC EFI v6.14 (Fujitsu) Signed
FC Firmware v8.08.231
MPI Firmware v1.03.17
PEP Firmware (Baker) v1.0.24
PEP Firmware (Qlipper) v2.0.14
PEP SoftROM (Baker) v1.0.14
PEP SoftROM (Qlipper) v2.0.09
FC EFI v6.14 (Fujitsu) Signed
FC Firmware v8.08.231
MPI Firmware v1.03.17
PEP Firmware (Baker) v1.0.24
PEP Firmware (Qlipper) v2.0.14
PEP SoftROM (Baker) v1.0.14
PEP SoftROM (Qlipper) v2.0.09
So I will schedule an intervention on this ESX host next week and keep you posted.