VMware Cloud Community
supahted
Enthusiast
Enthusiast

WARNING: NMP: nmp_DeviceRequestFastDeviceProbe

I am currently testing ESXi 4 by adding one ESXi 4 host to our VMware production cluster. The host is a HP BL460c G1 blade running ESXi 4 build 175625 connected to a HP EVA 6000 storage array. The ESXi 4 host seems to run fine but i noticed the following kernel warnings in the system log:

Jul 18 17:00:27 vmkernel: 2:07:08:24.308 cpu7:40478)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100021b8480) to NMP device "naa.600508b4000554df00007000034a0000" failed on physical path "vmhba1:C0:T0:L11" H:0x2 D:0x0 P:0x0 Possible sense data: Jul 18 17:00:27 0x0 0x0 0x0.

Jul 18 17:00:27 vmkernel: 2:07:08:24.308 cpu7:40478)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.600508b4000554df00007000034a0000" state in doubt; requested fast path state update...

Jul 18 17:00:27 vmkernel: 2:07:08:24.308 cpu7:40478)ScsiDeviceIO: 747: Command 0x2a to device "naa.600508b4000554df00007000034a0000" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

These warnings don't appear on our ESXi 3 hosts. These warning seems something to do with the multipath policies but i don't understand the warning message. This warnings are reported frequently on multiple lun's. Does anybody knows what these warnings mean?

blog: http://vknowledge.wordpress.com/
Tags (1)
Reply
0 Kudos
59 Replies
iceman76
Enthusiast
Enthusiast

Sehr geehrte Dame, sehr geehrter Herr,

ich befinde mich vom 14. - 23. März 2010 nicht im Hause. Ich empfange Ihre eMail zwar, kann Sie aber nicht bearbeiten. Bitte wenden Sie sich in dringenden Fällen an unsere Technikhotline, die unter der Rufnummer 0611 780 3003 zu erreichen ist.

Mit freundlichen Grüßen

Carsten Buchberger

Reply
0 Kudos
Iridium130m
Contributor
Contributor

Doing a fresh install of ESXi 4 , have the latest ESXi firmware and am still experiencing this issue after completing an SRM test (so unable to operationally remove the datastore).

Was this issue not patched / resolved with ESXi 4?

Reply
0 Kudos
Iridium130m
Contributor
Contributor

Opened a ticket with VMware support. Evidently this issue is still open on certain arrays (IBM DS is evidently one of them as the DS4700 is what we are being affected with. Setting the apd advanced flag (esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD ) as listed on http://virtualgeek.typepad.com/virtual_geek/2009/12/an-important-vsphere-4-storage-bug-and-workaroun... and recommended by our support agent did not resolve the issue on this array. According to VMware support, they are still working with certain array vendors to fix this issue.

ugh.

Reply
0 Kudos
RanCyyD
Contributor
Contributor

The only solutione for us, was to devide the ESX-Boxes into two Clusters, one with the "old" 3,5, the other with 4.0 U1.

I tried some test-Boxes after several Updates, but the errors still exits.

Reply
0 Kudos
squebel
Contributor
Contributor

We're having the same issue running 4.0 U2 on HP blades using Emulex cards attached to Hitachi storage. Did anyone ever come up with a proper solution to this issue? Anyone else on Hitachi storage experience the same problem?

Reply
0 Kudos
BulletByte
Contributor
Contributor

Same issue here, exactly the same errors in /var/log/vmkernel (0x28 errors)

ESX 4.1 fresh install

HP Blade 460 G6

QLogic HBA

EVA 4400 Controller

No solutions yet?

Reply
0 Kudos
tvdh
Contributor
Contributor

Hello I have same issues

ESX 4.0 U2 with HP blades and emulex cards. as storage we have IBM SVC

Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410006028a80) to NMP device "naa.6005076801900303e800000000000018" failed on physical path "vmhba1:C0:T4:L68" H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6005076801900303e800000000000018" state in doubt; requested fast path state update...

Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)ScsiDeviceIO: 747: Command 0x2a to device "naa.6005076801900303e800000000000018" failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

any solution already available?

Reply
0 Kudos
squebel
Contributor
Contributor

Our issue ended up being a bad piece of hardware in the HP c7000 blade chassis specifically the Virtual Connect fibre channel module. It took us a lot of different troubleshooting steps to finally get down to where we could single-out the specific module. Module was replaced and the errors went away. We're seeing some more errors in some other chassis so we're starting the same process on them today.

Reply
0 Kudos
mitchellm3
Enthusiast
Enthusiast

tvdh,

I have been following this issue in our environment for some time.  Your error codes like mine are a little different than the others.  I'm getting: "NMP: nmp_CompleteCommandForPath: Command 0x2a".  This happens on different LUNs on different hosts.  Could you tell me a little more about your environment to help me troubleshoot...or tell me how you fixed your environment if you have already done so?

We have and IBM SVC running 5.1.0.4 code.

ESX hosts having the issue are vSphere 4.0 U1 and U2.

We have noticed the problem happens on datastores running on DS4800/5300 storage behind the SVC...more so than XIV storage behind the SVC but we have noticed the errors on both.

We use RR multipathing rather than fixed.  I have not changed back to fixed for testing yet.

Our errors usually happen around 2 AM, a busy backup time, but have not been able to corrolate the issue with any particular VCB backup or any one system causing the problem.  It happens when no VCB backups run and it happens when some do.  Also it has not been until recently that any VM is seeing issues caused by this.  Obviously this issue has moved up in priority.

We have a PMR open with IBM and a case open with VMware.  I will post the solution when one comes.

We do have a 4.1 cluster running on the same SVC without these errors.  It is lightly loaded as our Desktop group is still going through their VDI build/testing phase.  I'm not sure 4.1 is the answer but that is the only thing that runs clean.

Reply
0 Kudos
squebel
Contributor
Contributor

I wanted to post a follow-up from my previous post where I said that we were still seeing this problem on some hosts that we ruled out hardware issues. Our problem ended up being the use of LUSE (LUN Size Expansion) devices on a Hitachi USP-V array specfically during replication of the LUN to the DR site. A few hours into replication, we would see all kind of SCSI reservation warning and the disk latency would go through the roof. We found a white paper from HDS that recommended not using LUSE devices for VMFS datastores. We followed that advice and have not seen any of these errors since.

I know that not everyone having this problem is on Hitachi disk, but in our case it was the disk array that was the problem. So, hopefully this info at least helps one person still having issues.

Reply
0 Kudos
tadsmith
Contributor
Contributor

mitchellm3 - I've had the same issue with a DS4800 and ESXi 4.1. We're also seeing this early in the morning and have an open PMR and VMware ticket. Were you able to make any progress?

Reply
0 Kudos
mitchellm3
Enthusiast
Enthusiast

Progress is slow on this issue.  We have been working daily with IBM and VMware on this issue...moreso with IBM.  We still aren't sure what is causing the problem but we have significantly alleviated it.

1st and foremost - Anyone using HP insight agents on their HP VM hosts must take a look at this KB article.  It states that if your storage isn't HP, disable two of the IMA services or you could have storage issues.  We have done this across the board and now we aren't seeing as many errors.

The other thing to check on your DS4800's is what version of storage manager you have.  The newer versions of storage manager run a storage profiler every night at 2 AM.  This basically takes inventory of your config so that next time your DS4800 crashes and IBM support needs to recreate it, they'll have all the info they need.  This info is also found in your "collect all support data" dumps.  Anyway, we set that to run monthly and we haven't seen the big destage errors...corrolated to the vmware errors...on our San Volume Controller.

We're running much much better but I'm not sold on this problem being completely gone.  We are looking to upgrade the SVC to v5.1.0.8 and eventually to 6.1.

Reply
0 Kudos
tadsmith
Contributor
Contributor

Thanks for the quick follow up. I'll have to check and make sure nobody else has the profiler installed and running. I am running ESXi 4.1 going directly to a DS4800 (no SVC). It also seems to happen early in the morning, between 2:00 AM - 5:00 AM.

What type of SAN switches are you using? Are you doing any Metro/Global mirroring? We are running Brocade switches and do have aynchronous global mirroring enabled. Also, are you running Trend OfficeScan by any chance? I have yet to be able to rule out Trend from being the cause, although IBM continues to say that we aren't reaching any performance limits on the DS4800.

Reply
0 Kudos
barracuda_1
Contributor
Contributor

same problem here, strange thing is it's only on one LUN (out of 15)

i'm connecting from 4+2 ESX hosts (2 clusters)

4x DL585 G5

2x DL380 G5

connecting to MCdata 4700

Hitachi (HDS) AMS500

Hitachi (HDS) AMS2100

I'm also only seeing the problem on 1 host (DL585 G5),

i've checked my SAN PATH's and no problem on that end.

Jun 15 13:54:16 bumblebee vmkernel: 36:03:58:27.856 cpu4:8927)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. 
Jun 15 13:54:26 bumblebee vmkernel: 36:03:58:37.544 cpu4:10255)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:54:36 bumblebee vmkernel: 36:03:58:48.310 cpu15:5255)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:54:47 bumblebee vmkernel: 36:03:58:58.429 cpu3:8022)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:54:56 bumblebee vmkernel: 36:03:59:07.834 cpu3:8020)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:55:06 bumblebee vmkernel: 36:03:59:17.601 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:55:17 bumblebee vmkernel: 36:03:59:28.618 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:55:28 bumblebee vmkernel: 36:03:59:39.702 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:55:36 bumblebee vmkernel: 36:03:59:48.209 cpu3:8746)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Jun 15 13:55:45 bumblebee vmkernel: 36:03:59:57.114 cpu3:8020)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
Reply
0 Kudos
squebel
Contributor
Contributor

I would make certain you aren't use LUSE devices on that HDS array. Are you replicating that LUN by any chance? You could very well have a bad HBA and not really know it without doing some deep troubleshooting.

Reply
0 Kudos
barracuda_1
Contributor
Contributor

No I don't use LUSE, but the problem went away after my running clone was complete.

I think it was a problem with my cache filling up due to slow SATA disks.

Reply
0 Kudos
CsNoc
Contributor
Contributor

We're seeing yet another hex code for this log message:

Jun 21 00:56:22 pn003 vmkernel: 152:07:45:21.178 cpu4:4100)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410008036180) to NMP device "naa.600c0ff000da7b197d2d794b01000000" failed on physical path "vmhba1:C0:T6:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

so the code would be:

H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

the host = 8 message translates to :

SG_ERR_DID_RESET [0x08] The SCSI bus (or this device) has been reset. Any SCSI device on a SCSI bus is capable of instigating a reset.

Out setup is;

HP DL380 (4x) G5, (3x) G6 and (2x) G7.

Storage is MSA2312fc (with SAS disks)

Switches are Cisco MDS 9124

ESX 4.0.0 332073 (reason why we're not on ESX4.1 is because of the 64bit requirement of vSphere, still working on that one).

This message is repeated over multiple hosts, but not on all hosts.

We also have other MSA datastores of which none of them are showing these messages.

I am having a bit of trouble finding some usable counters on the MSA. The web gui isn't helpfull at all.. the commandline is not really self-explanatory i'm affraid. Is there a simple command for checking the performance counters on MDS cli?

The fiberchannel switches are showing no errors or congestion.

We do however perform nightly incremental backups of our vmware guests using the legacy method with Tivoli.

Reply
0 Kudos
Generious
Enthusiast
Enthusiast

I have seen this on several EVA8100 arrays and on ESX4.0/4.1 HP G4/G5/G6 blades.

sense codes: H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0

What we did was set the following value's

DiskQfullSampleSize => 32

QFullThreshold ==> 8

DiskmaxIOSize ==> 128kb

Changed the access method from MRU to RR instead.

Source : http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=...

Upgraded the VC Fibre channel modules and the blades HBA's to the latest firmware versions since then it hasn't reported the sense codes anymore.

Reply
0 Kudos
branox
Contributor
Contributor

Hi,

It's been a long time that this post wasn't updated, but recently we had the same problem with esx hosts and ibm storage array. We uses Esxi 5 update 1 in our environement.There is multiple hosts who access shared datastores between 2 sites.

All vms in a cluster became unresponsive, hosts esx became disconnected and we've seen some errors on vmkernel and others logs :


2012-11-14T13:21:58.148Z cpu12:4619516)ScsiDeviceIO: 2322: Cmd(0x412440da3800) 0x9e, CmdSN 0x55510f from world 0 to dev "" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

012-12-05T16:40:15.020Z cpu3:8195)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T0:L22, reservation state on device XXX is unknown.
2012-12-05T16:40:15.023Z cpu12:8204)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba6:C0:T0:L12, reservation state on device XXX is unknown.

To resolve this problem we have to do a reboot of the esxi host, we have try to restart the management agent without success.

Anyone has experienced this problem ? Solved this problem ?

We also opened a SR with vmware.

Thanks for your help !

Reply
0 Kudos
nicmac
Contributor
Contributor

I might have some helpful information here. I have a lab setup where I have completely broken various things a lot during the learning curve and through carelessness. I ran into this particular issue today after I moved my top-level openfiler VM's IP Storage vNIC (using VT-d to present pass-through volume sets from my areca controller to hosts)  onto another vSwitch Port Group in the same VLAN on its host. I was getting all kinds of errors related to this thread. The Hypervisor for the host was locking up, the iSCSI devices and datastores were flapping, the VMs were in unknown status, and when I could get info from some of the datastores, or when I tried to re-add them, the wizard said they were empty. Multiple reboots of the host and filer did nothing.

I realized that the switch I moved the vNIC into did not have jumbo frames enabled, but I'm gathering that if jumbo frames are suddenly disabled anywhere in the network loop that this might happen. I have no clue whether an update would affect the jumbo frames setting on vSwitches. In any case, it seems feasible that an upgrade/update might do something to muck up the VM Kernel ports or Port Groups related to the initiator or IP storage virtual network. Here is what I did to fix my scenario...

=========================

First, stabilize the host(s):

1. Stop iSCSI target service on filer.

2. Remove "unknown" guests from the affected host(s).  The logs should stop going nuts, but for me the vSphere client  was still very slow, so...
3. Reboot ESX host(s).

4. Unmap the LUNs from the target(s) on the filer. (I had to create entirely new targets as part of the process)
5. Make sure jumbo frames are turned on in the vSS/vDS at the switch level, port group and/or VM Kernel port for the initiator or filer. Of course this is only relevant if you have jumbo frames enabled on the filer and physical switch(es), which is what I assume.

6. Create a NEW target on the filer and map a LUN to  it, allow one ESX host in the ACL, and start the iSCSI target service.
7. Rescan  the HBA on the host. If this ultimately doesn't work then I would start over and  nuke/pave the switch, PG, VMK, etc. if not done already.

###I've performed several types of screw-ups with the iSCSI HBAs where the entire HBA/switch setup needed to be nuked and paved. If this process doesn't work, try removing the VM Kernel port(s) from the initiator(s) and removing the switches and creating them again with the relevant port group(s)/VM Kernel Port(s). Make sure jumbo frames are enabled everywhere relevant. Switch level, PG level, VMK level. Then add the new VM Kernel port(s) back to the initiator(s). All I can gather is that when something goes really bad the OS doesn't know how to deal with the existing devices or targets anymore.###

6. If the HBA devices show up normally again, check the datastores. All of mine but one out of 6 were not present and had to be re-added. That one showed up as "unknown (unmounted)". I tried to mount it and got an error, but then it mounted. It was probably already mounting, I guess. For the ones that I added back, I chose "Keep existing signature" in the wizard. I don't know what creating a new signature could ultimately affect, but it didn't seem like the right choice because I think you only need to resignature a copied datastore.

I added one LUN at a time to the target and brought all 6 datastores back online successfully without any data loss, ending my streak of a half-dozen irreparable catasrophes. I hope this helps.

Reply
0 Kudos