Same issue here, exactly the same errors in /var/log/vmkernel (0x28 errors)
ESX 4.1 fresh install
HP Blade 460 G6
EVA 4400 Controller
No solutions yet?
Hello I have same issues
ESX 4.0 U2 with HP blades and emulex cards. as storage we have IBM SVC
Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410006028a80) to NMP device "naa.6005076801900303e800000000000018" failed on physical path "vmhba1:C0:T4:L68" H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.6005076801900303e800000000000018" state in doubt; requested fast path state update...
Dec 8 05:24:24 BRUS220 vmkernel: 8:18:25:40.432 cpu6:4102)ScsiDeviceIO: 747: Command 0x2a to device "naa.6005076801900303e800000000000018" failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
any solution already available?
Our issue ended up being a bad piece of hardware in the HP c7000 blade chassis specifically the Virtual Connect fibre channel module. It took us a lot of different troubleshooting steps to finally get down to where we could single-out the specific module. Module was replaced and the errors went away. We're seeing some more errors in some other chassis so we're starting the same process on them today.
I have been following this issue in our environment for some time. Your error codes like mine are a little different than the others. I'm getting: "NMP: nmp_CompleteCommandForPath: Command 0x2a". This happens on different LUNs on different hosts. Could you tell me a little more about your environment to help me troubleshoot...or tell me how you fixed your environment if you have already done so?
We have and IBM SVC running 18.104.22.168 code.
ESX hosts having the issue are vSphere 4.0 U1 and U2.
We have noticed the problem happens on datastores running on DS4800/5300 storage behind the SVC...more so than XIV storage behind the SVC but we have noticed the errors on both.
We use RR multipathing rather than fixed. I have not changed back to fixed for testing yet.
Our errors usually happen around 2 AM, a busy backup time, but have not been able to corrolate the issue with any particular VCB backup or any one system causing the problem. It happens when no VCB backups run and it happens when some do. Also it has not been until recently that any VM is seeing issues caused by this. Obviously this issue has moved up in priority.
We have a PMR open with IBM and a case open with VMware. I will post the solution when one comes.
We do have a 4.1 cluster running on the same SVC without these errors. It is lightly loaded as our Desktop group is still going through their VDI build/testing phase. I'm not sure 4.1 is the answer but that is the only thing that runs clean.
I wanted to post a follow-up from my previous post where I said that we were still seeing this problem on some hosts that we ruled out hardware issues. Our problem ended up being the use of LUSE (LUN Size Expansion) devices on a Hitachi USP-V array specfically during replication of the LUN to the DR site. A few hours into replication, we would see all kind of SCSI reservation warning and the disk latency would go through the roof. We found a white paper from HDS that recommended not using LUSE devices for VMFS datastores. We followed that advice and have not seen any of these errors since.
I know that not everyone having this problem is on Hitachi disk, but in our case it was the disk array that was the problem. So, hopefully this info at least helps one person still having issues.
Progress is slow on this issue. We have been working daily with IBM and VMware on this issue...moreso with IBM. We still aren't sure what is causing the problem but we have significantly alleviated it.
1st and foremost - Anyone using HP insight agents on their HP VM hosts must take a look at this KB article. It states that if your storage isn't HP, disable two of the IMA services or you could have storage issues. We have done this across the board and now we aren't seeing as many errors.
The other thing to check on your DS4800's is what version of storage manager you have. The newer versions of storage manager run a storage profiler every night at 2 AM. This basically takes inventory of your config so that next time your DS4800 crashes and IBM support needs to recreate it, they'll have all the info they need. This info is also found in your "collect all support data" dumps. Anyway, we set that to run monthly and we haven't seen the big destage errors...corrolated to the vmware errors...on our San Volume Controller.
We're running much much better but I'm not sold on this problem being completely gone. We are looking to upgrade the SVC to v22.214.171.124 and eventually to 6.1.
Thanks for the quick follow up. I'll have to check and make sure nobody else has the profiler installed and running. I am running ESXi 4.1 going directly to a DS4800 (no SVC). It also seems to happen early in the morning, between 2:00 AM - 5:00 AM.
What type of SAN switches are you using? Are you doing any Metro/Global mirroring? We are running Brocade switches and do have aynchronous global mirroring enabled. Also, are you running Trend OfficeScan by any chance? I have yet to be able to rule out Trend from being the cause, although IBM continues to say that we aren't reaching any performance limits on the DS4800.
same problem here, strange thing is it's only on one LUN (out of 15)
i'm connecting from 4+2 ESX hosts (2 clusters)
4x DL585 G5
2x DL380 G5
connecting to MCdata 4700
Hitachi (HDS) AMS500
Hitachi (HDS) AMS2100
I'm also only seeing the problem on 1 host (DL585 G5),
i've checked my SAN PATH's and no problem on that end.
Jun 15 13:54:16 bumblebee vmkernel: 36:03:58:27.856 cpu4:8927)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:54:26 bumblebee vmkernel: 36:03:58:37.544 cpu4:10255)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:54:36 bumblebee vmkernel: 36:03:58:48.310 cpu15:5255)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:54:47 bumblebee vmkernel: 36:03:58:58.429 cpu3:8022)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:54:56 bumblebee vmkernel: 36:03:59:07.834 cpu3:8020)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:55:06 bumblebee vmkernel: 36:03:59:17.601 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:55:17 bumblebee vmkernel: 36:03:59:28.618 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:55:28 bumblebee vmkernel: 36:03:59:39.702 cpu3:4099)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:55:36 bumblebee vmkernel: 36:03:59:48.209 cpu3:8746)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Jun 15 13:55:45 bumblebee vmkernel: 36:03:59:57.114 cpu3:8020)ScsiDeviceIO: 1672: Command 0xfe to device "t10.HITACHI_999999990004" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
I would make certain you aren't use LUSE devices on that HDS array. Are you replicating that LUN by any chance? You could very well have a bad HBA and not really know it without doing some deep troubleshooting.
No I don't use LUSE, but the problem went away after my running clone was complete.
I think it was a problem with my cache filling up due to slow SATA disks.
We're seeing yet another hex code for this log message:
Jun 21 00:56:22 pn003 vmkernel: 152:07:45:21.178 cpu4:4100)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x410008036180) to NMP device "naa.600c0ff000da7b197d2d794b01000000" failed on physical path "vmhba1:C0:T6:L0" H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
so the code would be:
H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
the host = 8 message translates to :
SG_ERR_DID_RESET [0x08] The SCSI bus (or this device) has been reset. Any SCSI device on a SCSI bus is capable of instigating a reset.
Out setup is;
HP DL380 (4x) G5, (3x) G6 and (2x) G7.
Storage is MSA2312fc (with SAS disks)
Switches are Cisco MDS 9124
ESX 4.0.0 332073 (reason why we're not on ESX4.1 is because of the 64bit requirement of vSphere, still working on that one).
This message is repeated over multiple hosts, but not on all hosts.
We also have other MSA datastores of which none of them are showing these messages.
I am having a bit of trouble finding some usable counters on the MSA. The web gui isn't helpfull at all.. the commandline is not really self-explanatory i'm affraid. Is there a simple command for checking the performance counters on MDS cli?
The fiberchannel switches are showing no errors or congestion.
We do however perform nightly incremental backups of our vmware guests using the legacy method with Tivoli.
I have seen this on several EVA8100 arrays and on ESX4.0/4.1 HP G4/G5/G6 blades.
sense codes: H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0
What we did was set the following value's
DiskQfullSampleSize => 32
QFullThreshold ==> 8
DiskmaxIOSize ==> 128kb
Changed the access method from MRU to RR instead.
Upgraded the VC Fibre channel modules and the blades HBA's to the latest firmware versions since then it hasn't reported the sense codes anymore.
It's been a long time that this post wasn't updated, but recently we had the same problem with esx hosts and ibm storage array. We uses Esxi 5 update 1 in our environement.There is multiple hosts who access shared datastores between 2 sites.
All vms in a cluster became unresponsive, hosts esx became disconnected and we've seen some errors on vmkernel and others logs :
2012-11-14T13:21:58.148Z cpu12:4619516)ScsiDeviceIO: 2322: Cmd(0x412440da3800) 0x9e, CmdSN 0x55510f from world 0 to dev "" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
012-12-05T16:40:15.020Z cpu3:8195)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T0:L22, reservation state on device XXX is unknown.
2012-12-05T16:40:15.023Z cpu12:8204)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba6:C0:T0:L12, reservation state on device XXX is unknown.
To resolve this problem we have to do a reboot of the esxi host, we have try to restart the management agent without success.
Anyone has experienced this problem ? Solved this problem ?
We also opened a SR with vmware.
Thanks for your help !
I might have some helpful information here. I have a lab setup where I have completely broken various things a lot during the learning curve and through carelessness. I ran into this particular issue today after I moved my top-level openfiler VM's IP Storage vNIC (using VT-d to present pass-through volume sets from my areca controller to hosts) onto another vSwitch Port Group in the same VLAN on its host. I was getting all kinds of errors related to this thread. The Hypervisor for the host was locking up, the iSCSI devices and datastores were flapping, the VMs were in unknown status, and when I could get info from some of the datastores, or when I tried to re-add them, the wizard said they were empty. Multiple reboots of the host and filer did nothing.
I realized that the switch I moved the vNIC into did not have jumbo frames enabled, but I'm gathering that if jumbo frames are suddenly disabled anywhere in the network loop that this might happen. I have no clue whether an update would affect the jumbo frames setting on vSwitches. In any case, it seems feasible that an upgrade/update might do something to muck up the VM Kernel ports or Port Groups related to the initiator or IP storage virtual network. Here is what I did to fix my scenario...
First, stabilize the host(s):
1. Stop iSCSI target service on filer.
2. Remove "unknown" guests from the affected host(s). The logs should stop going nuts, but for me the vSphere client was still very slow, so...
3. Reboot ESX host(s).
4. Unmap the LUNs from the target(s) on the filer. (I had to create entirely new targets as part of the process)
5. Make sure jumbo frames are turned on in the vSS/vDS at the switch level, port group and/or VM Kernel port for the initiator or filer. Of course this is only relevant if you have jumbo frames enabled on the filer and physical switch(es), which is what I assume.
6. Create a NEW target on the filer and map a LUN to it, allow one ESX host in the ACL, and start the iSCSI target service.
7. Rescan the HBA on the host. If this ultimately doesn't work then I would start over and nuke/pave the switch, PG, VMK, etc. if not done already.
###I've performed several types of screw-ups with the iSCSI HBAs where the entire HBA/switch setup needed to be nuked and paved. If this process doesn't work, try removing the VM Kernel port(s) from the initiator(s) and removing the switches and creating them again with the relevant port group(s)/VM Kernel Port(s). Make sure jumbo frames are enabled everywhere relevant. Switch level, PG level, VMK level. Then add the new VM Kernel port(s) back to the initiator(s). All I can gather is that when something goes really bad the OS doesn't know how to deal with the existing devices or targets anymore.###
6. If the HBA devices show up normally again, check the datastores. All of mine but one out of 6 were not present and had to be re-added. That one showed up as "unknown (unmounted)". I tried to mount it and got an error, but then it mounted. It was probably already mounting, I guess. For the ones that I added back, I chose "Keep existing signature" in the wizard. I don't know what creating a new signature could ultimately affect, but it didn't seem like the right choice because I think you only need to resignature a copied datastore.
I added one LUN at a time to the target and brought all 6 datastores back online successfully without any data loss, ending my streak of a half-dozen irreparable catasrophes. I hope this helps.