is it iscsi or FC? does it happen on all ESX hosts in the farm or specific hosts - is it common all the LUNs or just few - what type of storage array is it ?
Check your vmkernel log and you can find some SCSI sense codes on that about storage connection problem.
Check these article:
VMware KB: Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x
Follow VMware KB: Host Connectivity Degraded in ESX/ESXi.
Check your logs /var/log/vmkernel.log and /var/log/vmkwarning.log.
What version of ESXi are you running? I was seeing this a lot more often when I was still running some 4.1 hosts, but don't seem to see that behavior as often now that we're running 5.x.
this is running 5.1
i am eeing a lot of these
2014-07-18T05:27:07.782Z cpu6:8375)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:647: Path "vmhba1:C0:T1:L24" (UP) command 0xa3 failed with status Timeout. H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-07-18T05:27:11.217Z cpu1:8229)HBX: 255: Reclaimed heartbeat for volume 50fa3f70-d7deffa5-86e4-0025b5110aff (cx4960-fc-r1-lun83): [Timeout] [HB state abcdef02 offset 4128768 gen 15069 stampUS 8295912102496 uuid 534a1994-26e71eaf-c0c4-0025b5110a7f jr$
2014-07-18T05:27:11.219Z cpu1:8229)FS3Misc: 1465: Long VMFS rsv time on 'cx4960-fc-r1-lun83' (held for 5296 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2014-07-18T10:39:52.782Z cpu14:11136)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 284 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors
2014-07-18T10:40:07.888Z cpu12:9295)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 29' (held for 244 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors
2014-07-18T12:10:56.719Z cpu14:1280799)FS3Misc: 1465: Long VMFS rsv time on 'ESX LUN 19' (held for 471 msecs). # R: 1, # W: 1 bytesXfer: 9 sectors
2014-07-18T17:45:49.864Z cpu19:8211)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8da9ad from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-07-18T17:46:00.025Z cpu16:14908860)ScsiDeviceIO: 2331: Cmd(0x4124473dca00) 0x1a, CmdSN 0x8daa96 from world 0 to dev "naa.60060160de051b00269e16130d21dd11" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported
2014-07-18T17:46:03.336Z cpu2:1281301)Vol3: 705: Couldn't read volume header from control: Not supported
2014-07-18T17:46:03.336Z cpu2:1281301)FSS: 4972: No FS driver claimed device 'control': Not supported
2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1547: Device rescan time 20780 msec (total number of devices 54)
2014-07-18T17:46:05.972Z cpu11:1281301)VC: 1550: Filesystem probe time 4704 msec (devices probed 35 of 54)
2014-07-18T17:46:16.043Z cpu9:12133)ScsiDeviceIO: 2331: Cmd(0x41240e82c940) 0x1a, CmdSN 0x8dad77 from world 0 to dev "naa.60060160de051b007e6f3f82048ce111" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Then I found this
but I am ready running even a later version driver or the fnic and enic for cisco hardware.
any idea?
Is the HBA on the Cisco blade up to the latest firmware. Many times when updating the Cisco UCS platform when the blades get updated some of the other firmwares aside from the CMC and adapter get kept at older versions. Could it be a driver / firmware mismatch issue? You could also open a TAC case with cisco and bring this to there attention, they may even have something logged on it now.
Let us know,
do you think I need to play around with the queue depth and Disk.SchedNumReqOutstanding
values?
depends on whether your array is maxing out on queue length -- so that the hba has to queue up the i/o requests -- otherwise not much use in tweaking the queue depth i think
HTH,
~Sai Garimella
The screenshot in first post does not show a time stamp.
What is the corresponding entry in the logs?
The log you supplied does not necessarily indicate any problem.
A temporary loss of access is not necessarily a problem.
If for example it is during the night when backup jobs are running or AV scans, I/O latency usually gets higher.
Normally it is recommended to upgrade to latest drivers/FW (HBA, array).
Also distribute load between the LUNs.
Make sure the correct path policy is used.
Do not play around with qdepth and other parameters. Not recommended.
Another reason this could be happening is are you reaching your max path on your hosts? I have seen it on larger UCS enviroments that had 8Paths pers LUN with 150+ luns or so max our there max paths per host. This would cause a LUN to disapear, loose paths, or prevent them from adding new LUNs
http://www.vmware.com/pdf/vsphere5/r55/vsphere-55-configuration-maximums.pdf
How many paths per host do you have ?
This max path number also doesn't mean per SAN, its a cumlitive total of any PATH to that host, so if you have 3 SANS connected to the host it doesn't matter how many each uses, its a total number.
These are the following reason why does it happen-
Error codes -
pu34:5590)VC: 1449: Device rescan time 165 msec (total number of devices 75)
cpu34:5590)VC: 1452: Filesystem probe time 504 msec (devices probed 48 of 75)
cpu38:5590)ScsiDevice: 4592: naa.6006016058201700354179be0c6fdf11 device :Open count > 0, cannot be brought online
cpu34:5590)Vol3: 647: Couldn't read volume header from control: Invalid handle
cpu34:5590)FSS: 4333: No FS driver claimed device 'control': Not supported
cpu38:5590)ScsiDeviceIO: 2316: Cmd(0x4124c0ea2e80) 0x28, CmdSN 0x70509 to dev "naa.6006016058201700354179be0c6fdf11" failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Please follow through the blow resolution-
To resolve this issue:
If you have any questions, Please let me know, I will try my best to answer it.
Thank you.