VMware Cloud Community
tdubb123
Expert
Expert

Lost access o volume, restoreed access to volume

I am constantly getting these messages on all my hosts and looks like there is a  problem with some storage issues. any idea?

0 Kudos
10 Replies
DavoudTeimouri
Virtuoso
Virtuoso

Hi,

Please share your vmkernel.log.

Also please check these KBs:

  1. VMware KB: Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x
  2. VMware KB: Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.x

BR

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
tomtom901
Commander
Commander

VMkernel.log might help, so yes, please share. Is it only LUN 50, or other luns? Are there more LUNs on the same storage array that report problems?

0 Kudos
tdubb123
Expert
Expert

other luns too

in vmkernel I get this constantly

2014-03-16T12:44:02.508Z cpu17:13470)WARNING: NMP: nmp_DeviceRequestFastDevicePr obe:237:NMP device "naa.600601605a9128002ab13a8fccfae111" state in doubt; reques ted fast path state update...

2014-03-16T12:44:02.508Z cpu17:13470)ScsiDeviceIO: 2300: Cmd(0x412446aaa2c0) 0x2 a, CmdSN 0x8000002d from world 7939633 to dev "naa.600601605a9128002ab13a8fccfae 111" failed H:0x8 D:0x0 P:0x0

2014-03-16T12:44:25.649Z cpu8:58454)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x2a (0x4124003e86c0, 58454) to dev "naa.60060160de051b0096317d9c4a3fde11" on path " vmhba2:C0:T0:L6" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act :EVAL

2014-03-16T12:44:25.649Z cpu8:58454)WARNING: NMP: nmp_DeviceRequestFastDevicePro be:237:NMP device "naa.60060160de051b0096317d9c4a3fde11" state in doubt; request ed fast path state update...

2014-03-16T12:44:25.649Z cpu8:58454)ScsiDeviceIO: 2300: Cmd(0x4124003e86c0) 0x2a , CmdSN 0x8000000a from world 58454 to dev "naa.60060160de051b0096317d9c4a3fde11 " failed H:0x8 D:0x0 P:0x0

2014-03-16T12:44:25.649Z cpu8:58454)ScsiDeviceIO: 2300: Cmd(0x412404de90c0) 0x2a , CmdSN 0x80000035 from world 58454 to dev "naa.60060160de051b0096317d9c4a3fde11 " failed H:0x8 D:0x0 P:0x0

2014-03-16T12:44:33.904Z cpu12:59174)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x2 a (0x41244080b180, 8293) to dev "naa.600601605a91280048334c931e89e211" on path " vmhba2:C0:T2:L0" Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act :EVAL

2014-03-16T12:44:33.904Z cpu12:59174)WARNING: NMP: nmp_DeviceRequestFastDevicePr obe:237:NMP device "naa.600601605a91280048334c931e89e211" state in doubt; reques ted fast path state update...

2014-03-16T12:44:33.904Z cpu12:59174)ScsiDeviceIO: 2318: Cmd(0x41244080b180) 0x2 a, CmdSN 0x4b59d8 from world 8293 to dev "naa.600601605a91280048334c931e89e211" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-03-16T12:44:33.904Z cpu23:8293)BC: 4526: Failed to flush 28 buffers of size 8192 each for object 'vpxa.log' f530 28 3 51527d09 79d419b7 2500d8a0 fe0a11b5 7 002b04 1f 0 0 0 0 0: IO was aborted

0 Kudos
DavoudTeimouri
Virtuoso
Virtuoso

Hi,

Based on your log and VMware KB:     Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x ,the issue is related to your fabrics or your storage array overload. Check you FC connections and monitor your storage array SPs load.

VMK_SCSI_HOST_RESET = 0x08 or 0x8

vmkernel: 0:19:26:42.068 cpu0:4103)NMP: nmp_CompleteCommand
ForPath: Command 0x28 (0x4100070e8e80) to NMP device "naa.60060480000190
101883533030323731" failed on physical path "vmhba2:C0:T1:L27"
H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

This status is returned when the HBA driver has aborted the I/O. It can also occur if the HBA does a reset of the target.


Also please read this blog from VMware: Storage Performance and Timeout Issues | VMware Support Insider - VMware Blogs

** If you found this note/reply useful, please consider awarding points for "Correct" or "Helpful" **

Davoud Teimouri - http://www.teimouri.net - @davoud_teimouri

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
0 Kudos
tomtom901
Commander
Commander

Doesn't look that good. I'm guessing that naa.600601605a9128002ab13a8fccfae111 is the device ID of a lun on your ESXi host. State in doubt means that a SCSI command against the array timed out. This could have multiple issues, but probably the most common one is an overload of the storage array. Could that also be the case in your issue? You could use storage vMotion to move some VM's off of the storage array that is reporting these errors for its LUNs and see if that helps, if that's possible of course.

Did you see anything weird on the storage array hosting these LUNs? If you have an active support contract, this might be the moment to call the vendor Smiley Happy

0 Kudos
a_p_
Leadership
Leadership

I don't remember whether the error messages were exactly the same, but I had an issue like yours in a customer's environment some time ago and it was caused by a bad fibre cable. You may want to start checking the physical switch ports for CRC errors.

André

0 Kudos
lane0550
Contributor
Contributor

Hi, were you able to find resolution to this issue? We are experiencing the same errors in our VMware, Cisco UCS and EMC VNX 7500 physical setup. And so far, no resolution. The vendors have just been pointing fingers at each other. Would really appreciate an update if anything you tried made a difference.

0 Kudos
kastlr
Expert
Expert

Hi,

starting with VMware ESXi 5.5 Update 2 VMware changes the way it handles VMFS heartbeat checks.

Prior that version, VMware did use classic read and write IO's.

From 5.5U2 on, VMware does use VAAI ATS commands to perform the VMFS heartbeat checks.

You should disable VMware ATS heartbeats on all nodes of your ESXi cluster, based on my personal experience this will stabilize the storage environment.

Both VMware and EMC (and several other storage vendors) do have KB articles covering those problems.

ESXi host loses connectivity to a VMFS3 and VMFS5 datastore (2113956)

Random temporary loss of connection to single storage devices on ESXi hosts

Kind regards

Ralf


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos
lane0550
Contributor
Contributor

Hi,

We already do have VAAI ATS heartbeat disabled on the hosts and we still continue to see these messages associated with huge latency spikes and often 'lost access to volume' in the logs.

2016-12-20T04:28:42.522Z cpu2:33209)<7>fnic : 1 :: Abort Cmd called FCID 0x412500, LUN 0x0 TAG ea flags 3

2016-12-20T04:28:42.522Z cpu2:33209)<7>fnic : 1 :: abts cmpl recd. id 234 status FCPIO_SUCCESS

2016-12-20T04:28:42.522Z cpu2:33209)<7>fnic : 1 :: Returning from abort cmd type 2 SUCCESS

cpu34:33350)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x439e021bb500, 33006) to dev "naa.6006016009a03100dafcdb38315fe611" on path "vmhba1:C0:T0:L0" Failed: H:0x   8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

2016-12-20T04:28:42.522Z cpu34:33350)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "naa.6006016009a03100dafcdb38315fe611" state in doubt; requested fast path state update...

2016-12-20T04:28:42.522Z cpu34:33350)ScsiDeviceIO: 2651: Cmd(0x439e021bb500) 0x2a, CmdSN 0xf356 from world 33006 to dev "naa.6006016009a03100dafcdb38315fe611" failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2016-12-20T04:28:42.531Z cpu4:33006)BC: 4992: Failed to flush 2 buffers of size 8192 each for object 'vpxa.log' fa7 36 4 57b4d667 d3a81540 25004d53 6a011b5 8 38 abe34911 31 0 0 0: Timeout

Anything else we should be looking at?

0 Kudos
kastlr
Expert
Expert

Hi,

AFAIK, VMware declares "lost access to volume" when a VMFS heartbeat check (performed by each host on each VMFS datastore with an intervall of 3 sec) wouldn't be answered within 8 seconds.

The excerpt you added shows that write IO's seems to be timed out, which causes the host to send aborts to the array.

Writes are usually captured by the arrays cache, so let VCE check the VNX performance statistics if the array is performing well.

If you're using any kind of replication (like Mirrorview or RecoverPoint) this should also be checked.

VCE should also be able to see the SCSI aborts send from the hosts and if something happens within those timeframes.

On the ESXi side you should check the vobd.logs from your servers if they do contain indications about high response times.

You should also check the vmkernel.logs for SCSI failures and if those failures are logged by the host (H:0x<>0) or the array LUN (D:0x<>0).

It's also possible to figure out what kind of IO's are affected, in your exerpt Cmd 0x2a is a normal write IO.

Depending on the type of used HBA you should also check the statistics of those HBA's.

All of these tasks (and many more) could be performed when analyzing the vm-support output file.

Regards,

Ralf


Hope this helps a bit.
Greetings from Germany. (CEST)
0 Kudos