VMware Cloud Community
bfrericks
Contributor
Contributor

NMP: nmp_ResetDeviceLogThrottling in vmkernel log

Hello all....  After updating to the latest 5.5 patches we started seeing some messages repeating at different times (no time pattern) in the vmkernel log on all of our hosts.  The message is:

cpu45:33681)NMP: nmp_ResetDeviceLogThrottling:3311: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "mpx.vmhba32:C0:T0:L0" occurred 2 times(of 0 commands)

These messages repeat all day every day.  We boot from a supported SD card and the message indicates the USB device is to blame.  I opened a ticket with VMware, they stated that the messages are benign and nothing to worry about and are related to an updated SCSI driver in the patches we applied.  Basically the message is indicating that vmware is sending sense data to the USB controller which it does not understand and is sending back an illegal request.  Their suggestion was to reach out to HP for a firmware update. (we are at the latest FW on all hosts and I have tried backing down the FW to see if that would help) I have done that however I'm in the midst of a pissing match between VMware and HP both blaming each other for the fix so I'm turning to the community for help.

We have a mix of HPE DL380G8 & G9, and a spattering of ML350s at remote offices.  All of the updated servers are logging this message.

While I appreciate that the messages are benign they are causing noise in the logs and I'd like to figure out a way to 1) fix the issue (ideally) or 2) suppress the messages and 3) validate that the messages are indeed benign.

Are others seeing these same messages in their logs?  Given we have seen this on multiple hosts across multiple generations of HPE servers I'm inclined to believe others are seeing this as well?

To add to it, we recently purchased several DL360 Gen10 servers running 6.5.  All of those servers boot from SD card as well and log a similar message:

cpu40:65665)NMP: nmp_resetDeviceLogThrottling:3348 last error status from device "evi.00e04c20202020" repeated 2 times

cpu43:65665_NMP: nmp_ThrottleLogForDevice:3647 Cmd 0x9e (0x439d423a9200, 0) to dev "evi.00e04c20202020" on path "vmhba32"C0:T0:Lo" Failed: H0x7 D0x0 P:0X0

Thanks in advance for any advice on resolving this.

11 Replies
MMosaadObaid
Contributor
Contributor

cpu45:33681)NMP: nmp_ResetDeviceLogThrottling:3311: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "mpx.vmhba32:C0:T0:L0" occurred 2 times(of 0 commands)

indicate that the storage did not response to the scanning done by the NMP hence NMP will issue reset to re-read the storage.

this could get ugly if you are using dual SD option as troubleshooting it from the hardware prospective is a bit complicated.

if you have a local RAID controller, i suggest you try booting from it and contact the hardware vendor if you can.

0 Kudos
STI69
Contributor
Contributor

020-08-14T10:15:46.923Z esx070 vmkernel: cpu4:66636)NMP: nmp_ResetDeviceLogThrottling:3463: Error status H:0x0 D:0x2 P:0x0 Sense Data: 0x5 0x20 0x0 from dev "naa.60060160e8004b008d70a85ea9b3e8dd" occurred 1 times(of 1 commands)

the volume is from a Dell EMC Unity 650f all Flash array.

It is raining such events, and for some reason this SCSI sense code is logged on our Loginsigt at the speed of light nearly.

I did find information from the SCSI sense conversion tool on https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Entering the SCSI sense code yields underneath. What is the cause of an ILLEGAL REQUEST ? INVALID COMMAND OPERATION CODE ?

pastedImage_1.png

0 Kudos
nachogonzalez
Commander
Commander

That error indicates that the storage did not respond to the NMP scan. Which makes sense since, if you are using local SD it should have only one path

I've seen that error many times in the past and you can take the following approaches:

1. Ignore it, if it is not repeating every second
2. Engage your hardware vendor so they can health check the hardware
3. Engage VMware but I see that you've already did.

Another thing to keep in mind, thoes this pattern repeat in all your hosts? are all the esxi hosts configured exactly the same?

0 Kudos
STI69
Contributor
Contributor

Dell EMC Unity Storage support does not like the errors and referred us to check the servers, more in specic the firmware of the HBA, but there are all running at target codes

We run ESXi 6.5 build 5256549

We use following setting on all our volumes as per DELL EMC (Unity) recommendation, thoughout

esxcli storage nmp satp rule add --satp "VMW_SATP_ALUA_CX" --vendor "DGC" --psp "VMW_PSP_RR" --psp-option "iops=1"

0 Kudos
mviel
Enthusiast
Enthusiast

We encounter the Same Error Type on an

DELL R740 Server with an Software ISCSI / Intel X520 Card on an EMC Unity 550F ( ver. 5.0.2.0.5.009 ) ( All other 36 Host work fine with the same setting )
VMWARE vSphere ESXi 6.7 U3 ( Patchlevel  VMware ESXi 6.7, Patch Release ESXi670-202004002 / But we did a Test on the Same Machine with different Boot Device / fresh Install latest DELL vSPhere Image 6.7 u3 - Version 4  and Patchlevel VMWARE June and we dont see no difference )

The Problem occures just after we relocated the Server . The Difference there is that the ISCSI Ports and the VDS  PORTS are now connect via an SFP+ Adapter rather then a DAC ( Direct Attached Copper ) .

We see no Log Warning in the Unity 550F .

Of Course we opend a production Ticket with VMWARE , but it was hard to give them Information as the ESXTOP show high Waits on several VMKERNEL Processes.  Finally the Support told us, that the ISCSI Interfaces are not longer shown in the Support Bundle. So wie Setup the whol Machine new , now different Firmware or different Software Driver were used. On first Glimse everthing looked good.

No Problem at all for example with the  Bootup Time  ( while we saw the Error we hat Bootup Time up to 4 hours ! , with 150 Luns attached ) , it was also possible again to do a VM-Support on the CLI again.

After the LUNs were permissioned on the UNITY 550F ( ALL FLASH ) the Problem came back .

An obversation is, that when you look at the ESXTOP Info , that many Process are having strange Numbers.

system  %RUN 3800 %Wait 56000 %RDY 60
vpxa                              %WAIT 3000 
helper                          %WAIT 16000

hostd                               %WAIT 3800

therefore have the impression that all those Logs flood the VMKERNEL , because of the Software ISCSI Connection and raising the the Communication to such an level so that all Process are prolonged. The Funny thing is , that even as we see a Problem on the Host site a TEST VM on the given Server looks "FINE" .

Also it takes a long time to loginto the WebServer of the Host and the connection to the VCSA is lost.

We still have no glue neither has VMWARE so far, we havent open up a Support Case with EMC .


0 Kudos
mviel
Enthusiast
Enthusiast

Could you maybe Provide information on the Nics ? Which Typ of Adapter do you use DAC or SFP+ ?

Which PatchVersion has your Unity 550F ALL FLASH.

We have Version X520 Intel ( Intel(R) Ethernet 10G 4P X520/I350 rNDC ) running on the latest Firmware Regarding DELL 19.5.12 , we have 36 othere Hosts on the same Hardware Firmware Version and connected to the same Unity 550F with no Problems )

0 Kudos
MMosaadObaid
Contributor
Contributor

please past the output of this command:

grep sense /var/run/log/vmkernel.log

0 Kudos
mviel
Enthusiast
Enthusiast

The Log is Full from those Logs , I changed the naa GUUID .

2020-08-17T06:49:33.656Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcda from world 2100373 to dev "naa.60060160c9f149" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.660Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcdc from world 2100373 to dev "naa.60060160ee2048" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.664Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcde from world 2100373 to dev "naa.60060160c9f14" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.667Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xce0 from world 2100373 to dev "naa.60060160c9f149" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.671Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xce2 from world 2100373 to dev "naa.60060160ee2048" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.676Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xce4 from world 2100373 to dev "naa.60060160c9f14" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.680Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xce6 from world 2100373 to dev "naa.60060160c9f14" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.686Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xce8 from world 2100373 to dev "naa.60060160c9f149" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.693Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcea from world 2100373 to dev "naa.60060160c9f149" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.697Z cpu2:2098268)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcec from world 2100373 to dev "naa.60060160ee2048" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.708Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcee from world 2100373 to dev "naa.60060160c9f14" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.712Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcf0 from world 2100373 to dev "naa.60060160ee204" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

2020-08-17T06:49:33.722Z cpu10:2098270)ScsiDeviceIO: 3469: Cmd(0x459bce3f6a00) 0x85, CmdSN 0xcf2 from world 2100373 to dev "naa.60060160c9f14" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

0 Kudos
MMosaadObaid
Contributor
Contributor

Disable ATS for heartbeat on all hosts connected to the same storage using the below command, verify you are using the correct SATP:

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

VMware Knowledge Base

0 Kudos
mviel
Enthusiast
Enthusiast

Well the Format is VMFS6 and maybe that will cure the Symptomes , but as written above I have 30 other Server which work fine . I have placed the Question up here to get maybe a new approach to the Problem , at the Moment I put my money on maybe defectiv SFP+ . a

Thanks MMosaadObaid for an Idea but turning the Heartbeat of seems to me a very dirty approach and it wouldnot explain that the other 30th Server are running stable and fine . with given firmware and Hardware . Its a really tricky question .

If we really should have a SFP+ Adapter Hardware Problem is shouldnt affect whole Server as it is seen with an ISCSI Software Adapter, to bad that I cant do the Tests with ISCSI HBA .

We are still in Contact with the Support .


Below you find the Articel Regarding Heartbeat and ATS :

VMware Knowledge Base

0 Kudos
nachogonzalez
Commander
Commander

I think ATS might do the trick.
Good one MMosaadObaid