VMware Cloud Community
tdubb123
Expert
Expert

Lost access to volume - sucessfully restoed access to volume

on my esxi hosts, i see these messages under events

The disconnect and recovery happens at exactly the same time

i can access the lun just fine. Any idea?

12 Replies
SG1234
Enthusiast
Enthusiast

is it iscsi or FC? does it happen on all ESX hosts in the farm or specific hosts - is it common all the LUNs or just few - what type of storage array is it ?

Reply
0 Kudos
DavoudTeimouri
Virtuoso
Virtuoso

Check your vmkernel log and you can find some SCSI sense codes on that about storage connection problem.

Check these article:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103038...

VMware KB: Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x

-------------------------------------------------------------------------------------
Davoud Teimouri - https://www.teimouri.net - Twitter: @davoud_teimouri Facebook: https://www.facebook.com/teimouri.net/
Reply
0 Kudos
FritzBrause
Enthusiast
Enthusiast

Follow VMware KB: Host Connectivity Degraded in ESX/ESXi.

Check your logs /var/log/vmkernel.log and /var/log/vmkwarning.log.

Reply
0 Kudos
sgunelius
Hot Shot
Hot Shot

What version of ESXi are you running?  I was seeing this a lot more often when I was still running some 4.1 hosts, but don't seem to see that behavior as often now that we're running 5.x.

Reply
0 Kudos
tdubb123
Expert
Expert

this is running 5.1

Reply
0 Kudos
tdubb123
Expert
Expert

i am eeing a lot of these

2014-07-18T05:27:07.782Z cpu6:8375)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:647: Path "vmhba1:C0:T1:L24" (UP) command 0xa3 failed with status Timeout. H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-07-18T05:27:11.217Z cpu1:8229)HBX: 255: Reclaimed heartbeat for volume 50fa3f70-d7deffa5-86e4-0025b5110aff (cx4960-fc-r1-lun83): [Timeout] [HB state abcdef02 offset 4128768 gen 15069 stampUS 8295912102496 uuid 534a1994-26e71eaf-c0c4-0025b5110a7f jr$

2014-07-18T05:27:11.219Z cpu1:8229)FS3Misc: 1465: Long VMFS rsv time on 'cx4960-fc-r1-lun83' (held for 5296 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors

2014-07-18T10:39:52.782Z cpu14:11136)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 284 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

2014-07-18T10:40:07.888Z cpu12:9295)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 29' (held for 244 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

2014-07-18T12:10:56.719Z cpu14:1280799)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 471 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors

2014-07-18T17:45:49.864Z cpu19:8211)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8da9ad from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-07-18T17:46:00.025Z cpu16:14908860)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8daa96 from world 0 to dev "naa.60060160de051b00269e16130d21dd11" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported

2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported

2014-07-18T17:46:03.336Z cpu2:1281301)FSS: 4972: No FS driver claimed device 'control': Not supported

2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1547: Device rescan time 20780 msec (total number of devices 54)

2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1550: Filesystem probe time 4704 msec (devices probed 35 of 54)

2014-07-18T17:46:16.043Z cpu9:12133)ScsiDeviceIO: 2331: Cmd(0x41240e82c940) 0x1a, CmdSN 0x8dad77 from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Then I found this

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103340...

but I am ready running even a later version driver or the fnic and enic for cisco hardware.

any idea?

Reply
0 Kudos
JPM300
Commander
Commander

Is the HBA on the Cisco blade up to the latest firmware.  Many times when updating the Cisco UCS platform when the blades get updated some of the other firmwares aside from the CMC and adapter get kept at older versions.  Could it be a driver / firmware mismatch issue?  You could also open a TAC case with cisco and bring this to there attention, they may even have something logged on it now.


Let us know,

Reply
0 Kudos
tdubb123
Expert
Expert

do you think I need to play around with the queue depth and Disk.SchedNumReqOutstanding

values?

Reply
0 Kudos
SG1234
Enthusiast
Enthusiast

depends on whether your array is maxing out on queue length -- so that the hba has to queue up the i/o requests -- otherwise not much use in tweaking the queue depth i think

HTH,

~Sai Garimella

FritzBrause
Enthusiast
Enthusiast

The screenshot in first post does not show a time stamp.

What is the corresponding entry in the logs?

The log you supplied does not necessarily indicate any problem.

A temporary loss of access is not necessarily a problem.

If for example it is during the night when backup jobs are running or AV scans, I/O latency usually gets higher.

Normally it is recommended to upgrade to latest drivers/FW (HBA, array).

Also distribute load between the LUNs.

Make sure the correct path policy is used.

Do not play around with qdepth and other parameters. Not recommended.

Reply
0 Kudos
JPM300
Commander
Commander

Another reason this could be happening is are you reaching your max path on your hosts?  I have seen it on larger UCS enviroments that had 8Paths pers LUN with 150+ luns or so max our there max paths per host.  This would cause a LUN to disapear, loose paths,  or prevent them from adding new LUNs

maxpath.PNG

http://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf

How many paths per host do you have ?

This max path number also doesn't mean per SAN, its a cumlitive total of any PATH to that host, so if you have 3 SANS connected to the host it doesn't matter how many each uses, its a total number.

Reply
0 Kudos
virtualworld199
Contributor
Contributor

     These are the following reason why does it happen-

  • After a storage device has unexpectedly unpresented from the storage array, you are unable to mount it again.
  • This issue occurs when there was a running virtual machine when the storage device went offline.
  • An ESXi 5.x host cannot mount the storage after the LUN is online again .

Error codes -

pu34:5590)VC: 1449: Device rescan time 165 msec (total number of devices 75)

cpu34:5590)VC: 1452: Filesystem probe time 504 msec (devices probed 48 of 75)

cpu38:5590)ScsiDevice: 4592: naa.6006016058201700354179be0c6fdf11 device :Open count > 0, cannot be brought online

cpu34:5590)Vol3: 647: Couldn't read volume header from control: Invalid handle

cpu34:5590)FSS: 4333: No FS driver claimed device 'control': Not supported

cpu38:5590)ScsiDeviceIO: 2316: Cmd(0x4124c0ea2e80) 0x28, CmdSN 0x70509 to dev "naa.6006016058201700354179be0c6fdf11" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Please follow through the blow resolution-

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=201415...

To resolve this issue:

  1. Run this command to see the world that has the device open for the LUN:

    #esxcli storage core device world list -d naa_id
    For example:

    #esxcli storage core device world list -d naa.6006016058201700354179be0c6fdf11
    You see output similar to:

    Device                                World ID  Open Count  World Name
    ------------------------------------  --------  ----------  ----------
    naa.6006016058201700354179be0c6fdf11      2060           1  idle0
    If a VMFS volume is using the device indirectly, the world name includes the string idle0. If a virtual machine uses the device as an RDM, the virtual machine World ID is displayed. If any other process is using the raw device, the corresponding information is displayed.

    Notes:
    • If the host is not responding, run the command esxcfg-scsidevs –m | grep naa.id to get the corresponding datastore name.
    • Ensure all virtual machines registered on the volume in a PDL state do not require any further steps. If you have a virtual machine in that state, attempting to Retry or Cancel an operation will not return the virtual machine world ID. Click Cancel as the Retry operation cannot succeed unless the volume is remounted.

  2. Run this command to list all virtual machines running on the ESXi 5.x host and identify the virtual machine registered on that LUN:

    #esxcli vm process list
  3. To kill the virtual machine World ID, run this command:

    #esxcli vm process kill --type=force --world-id=World ID
    For example:

    #esxcli vm process kill --type=force --world-id=12131
  4. Rescan the storage using this command:

    #esxcfg-rescan -u vmhba#
  5. Run this command to see the device state:

    #esxcli storage core device list -d naa-id
  6. If the issue persists, reboot the ESXi 5.x host where virtual machine was registered.

If you have any questions, Please let me know, I will try my best to answer it.

Thank you.