To solve our problem (see post from end of july) falconstor had to fix its ipstor storage-server. There was indeed a problem with a lun 0 which was not presented in the correct way. After applying the patch we didn't see the error again.
Wrote Virtualgeek blog posts about a vSphere 4 (and vSphere 4u1) condition that can create this state, and two workarounds.
Not saying it is the root cause of the above noted cases, but to me it looks like it.
VMs (obviously those NOT on the lost datastore) becoming intermittently inaccessible if an APD (all paths dead) state is detected is a known issue. Commonly this can be triggered by yanking LUNs before removing datastores and ESX devices, or storage or FC/FCoE/iSCSI network issues.
You can see the post and workarounds here:
Chad Sakac, P.Eng. vExpert
VP, VMware Technology Alliance
This issue has been driving me crazy. How do you cleanly 'remove' a data store. I would do it the 'correct' way if I knew how!
This post has gotten a bit muddled with people replying with different SCSI codes (although they all start with NMP). I'm seeing the same SCSI codes as Ted (H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0) and I've got a VM running Windows 2003 R2 / SQL Server 2005 SP3 (all fully patched) giving me messages in Event Viewer that are listed at the same time
"SQL Server has encountered ** occurrence(s) of I/O requests taking longer than 15 seconds to complete on file in database (5). The OS file handle is 0x00000648. The offset of the latest long I/O is: 0x000000cb0f2000"
So I'm not seeing anything as horrible as some customers are having (via Chad's blog post), but it is noticeable to guests that something's not quiet right.
Running ESX4u1 and have had some shoddy support dealings with tier1 vm support in the past. This post is mainly to link the SQL i/o message back to a vmware storage bug. I'm hesitent to make any advanced config changes on the ESX box from a few "busy" messages, but I'd love to hear other opinions.
Is there a resolution or a hotfix from VMware available so far?
We´re getting the errors mentioned above, when I´m trying to save a full VM via VCB. (VMs on a Netapp)
Saving the same VM on a selfbuild-Openfiler causes no Problems. (also connected via iSCSI)
If you haven't already, patch with the most recent set of ESX4 patches. I'm still verifying if the below kb is the fix but so far I haven't seen the errors in my vmkernel logs this morning.
It would be awesome if someone else could reply as well.
Scratch that. I'm still getting the same NMP messages. I've now learned that not every datastore available to the ESX host is setting off the error.
The hosts are all on the same patch-level (last patches applied), but still getting the error (only on the Netapp-Storage), on Openfiler-Storage erverythings works fine.
As Morten Dalgaard wrote:
"The error, at least for me, also seems to be load related, as it
happens more often when VCB backup is running. Actually it almost only
occurs when VCB is running."
Seems the same here: the higher the load, the more the errors. (For example while restarting VMs)
Fascinating, that an Opensource-Software works fine and a really expensive storage doesn ´t...
Help and answers from VMware really appreciated!
I also experienced severe issues with a NETAPP FAS. - Have you upgraded to Ontap 7.3 or higher ? - That's required.
thanks for your answer, will ask this our Storage-Admin.
At the moment I also try the solution to remove the mpio-driver from the VCB...
Edit: Unfortunately no improvement...
Has anyone found or gotten a solution to this from vmware yet.on the issue ? I am seeing this on my hosts too.
state in doubt, requested fast path state update,
A lot of these errors
Mar 20 07:07:55 LIC-VM16 vmkernel: 0:18:41:26.931 cpu16:4312)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100bb1ccd40) to NMP device "naa.600601600de11a00fa6e3b607a38dd11" failed on physical path "vmhba2:C0:T0:L103" H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Mar 20 07:07:55 LIC-VM16 vmkernel: 0:18:41:26.931 cpu16:4312)ScsiDeviceIO: 747: Command 0x2a to device "naa.600601600de11a00fa6e3b607a38dd11" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Though mostly at one lun at a time, so it seems not the whole storage bus is out, while the error is host busy.
The virtual machines sometimes become unresponsive for a brief period of time.
Sehr geehrte Dame, sehr geehrter Herr,
ich befinde mich vom 14. - 23. März 2010 nicht im Hause. Ich empfange Ihre eMail zwar, kann Sie aber nicht bearbeiten. Bitte wenden Sie sich in dringenden Fällen an unsere Technikhotline, die unter der Rufnummer 0611 780 3003 zu erreichen ist.
Mit freundlichen Grüßen
Doing a fresh install of ESXi 4 , have the latest ESXi firmware and am still experiencing this issue after completing an SRM test (so unable to operationally remove the datastore).
Was this issue not patched / resolved with ESXi 4?
Opened a ticket with VMware support. Evidently this issue is still open on certain arrays (IBM DS is evidently one of them as the DS4700 is what we are being affected with. Setting the apd advanced flag (esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD ) as listed on http://virtualgeek.typepad.com/virtual_geek/2009/12/an-important-vsphere-4-storage-bug-and-workaround.html and recommended by our support agent did not resolve the issue on this array. According to VMware support, they are still working with certain array vendors to fix this issue.
The only solutione for us, was to devide the ESX-Boxes into two Clusters, one with the "old" 3,5, the other with 4.0 U1.
I tried some test-Boxes after several Updates, but the errors still exits.
We're having the same issue running 4.0 U2 on HP blades using Emulex cards attached to Hitachi storage. Did anyone ever come up with a proper solution to this issue? Anyone else on Hitachi storage experience the same problem?