Can't access datastore. Error on storage disk

dr_robot · ‎02-03-2021

Hi,

Datastore is not being seen in the list. Performing a rescan of HBA, and storage doesn't bring it up. The LUN is shown under devices, but it's partition shows as unkown.

Getting the below errors messages in the vmkernel.log file.

2021-02-03T15:52:31.684Z cpu16:2098279)ScsiDeviceIO: 3483: Cmd(0x459a96f86440) 0x28, CmdSN 0x3ccb from world 0 to dev "naa.6006016073c04200e0d3975bc7aa726d" failed H:0x0 D:0x2 P:0x2 Invalid sense data: 0x20 0x50 0x3a.
2021-02-03T15:52:31.684Z cpu3:2097773)Partition: 430: Failed read for "naa.6006016073c04200e0d3975bc7aa726d": I/O error
2021-02-03T15:52:31.684Z cpu3:2097773)Partition: 1108: Failed to read protective mbr on "naa.6006016073c04200e0d3975bc7aa726d" : I/O error
2021-02-03T15:52:31.684Z cpu3:2097773)WARNING: Partition: 1261: Partition table read from device naa.6006016073c04200e0d3975bc7aa726d failed: I/O error
2021-02-03T15:52:31.817Z cpu63:2104635)WARNING: NFS: 1226: Invalid volume UUID mpx.vmhba2:C0:T7:L0
2021-02-03T15:52:31.918Z cpu33:2104635)FSS: 6092: No FS driver claimed device 'mpx.vmhba2:C0:T7:L0': No filesystem on the device
2021-02-03T15:52:31.925Z cpu24:2100616 opID=9c9a5fb5)World: 11943: VC opID esxui-d809-e387 maps to vmkernel opID 9c9a5fb5
2021-02-03T15:52:31.925Z cpu24:2100616 opID=9c9a5fb5)VC: 4616: Device rescan time 69 msec (total number of devices 12)
2021-02-03T15:52:31.925Z cpu24:2100616 opID=9c9a5fb5)VC: 4619: Filesystem probe time 169 msec (devices probed 6 of 12)
2021-02-03T15:52:31.925Z cpu24:2100616 opID=9c9a5fb5)VC: 4621: Refresh open volume time 6 msec
2021-02-03T15:52:38.614Z cpu16:2098279)ScsiDeviceIO: 3483: Cmd(0x459a97912500) 0x28, CmdSN 0x3de7 from world 0 to dev "naa.6006016073c04200e0d3975bc7aa726d" failed H:0x0 D:0x2 P:0x2 Invalid sense data: 0xe0 0x7f 0x41.

What could be the issue?

paudieo · ‎02-03-2021

based on the log snippit you sent, I can only comment on the log extract itself

The line

ScsiDeviceIO: 3483: Cmd(0x459a96f86440) 0x28, CmdSN 0x3ccb from world 0 to dev "naa.6006016073c04200e0d3975bc7aa726d" failed H:0x0 D:0x2 P:0x2 Invalid sense data: 0x20 0x50 0x3a.

this means ESXi issues is a read IO (0x28 is a scsi read) on the device

however there is a failure on device naa.6006016073c04200e0d3975bc7aa726d (underlying device where the datastore is created on)

Failed H:0x0 D:0x2 P:0x2 Invalid sense data: 0x20 0x50 0x3a.

The D:0x2 means its a device check condition (something wrong with the device or lun)

P:0x2 means its a plugin error (the PSA layer,) 0x20 0x50 and 0x3a means the device is a scsi sense failure been returned by the LUN or the array firmware

the lines
"Failed to read protective mbr on "naa.6006016073c04200e0d3975bc7aa726d" : I/O error"

WARNING: Partition: 1261: Partition table read from device naa.6006016073c04200e0d3975bc7aa726d failed: I/O error

means just that, after the rescan ESXi tries to read the partition info to see if it needs to mount the volume so we checks if there is a VMFS volume signature on it, it can't read the partition on the device so reports an IO error

check on the array side what is the health of the lun and has there been any-changes ,

or if you have a different storage team looking after the storage give them the LUN Id and what time the error got posted and get them to check

if its all the luns its obviously going to be an array wide issue, but if its just one lun affected, focus on that one only

other things to keep in mind, is this a shared lun/datastore, are other hosts connected to this device seeing the same behaviourany other changes made like recent driver / firmware upgrades performed on the hosts

but I would start at the storage end for the initial investigation

dr_robot · ‎02-04-2021

HI, thanks for the response.

The LUN is from a shared storage array, a Dell EMC Unity 300, connected by Fiber Channel. There are no errors being reported on the array or the LUN.LUN appears healthy.
The LUN was given access to other Vmware hosts and the same behaivor was seen. Still not able to access datastore. It's only 1 LUN /datastore with the issue. Others provisioned from the same storage are working fine on the host and on other hosts.

paudieo · ‎02-04-2021

what version of ESX is installed?

Are the read errors reporting the same across all the hosts?

You should be able to grep out the pattern of the naa id on the other VMkernel logs ie.

grep "naa.6006016073c04200e0d3975bc7aa726d" /var/log/vmkernel.log

If all the other luns are healthy and the problem follows the lun from one vSphere host to host, it does seem likely that it may well be the lun itself, irregardless if it is reporting healthy or not from array side

Has there being any recent changes to the enviroment, e.g.

any managment operations done on the volume from array end? , e.g. like snapshoting the LUN or similar operations that could account for the check condtions reporting against the device

When did it first go offline ?You will prob need to retrace and attempt to clarify what changed before it went offline
if its not obvious from the current logs then perhaps if you had syslog or a log aggregator you maybe able to udentify when it first went offline

It maybe quicker to file an SR with support to gather diagnostic bundles from all the hosts connected to get a deeper understanding of the configuration, and see if they can help narrow down the cause

dr_robot · ‎02-07-2021

Hi,

The ESXi version is VMware ESXi 6.7.0 Update 3.

Support from storage team was brought in, and they didn't find any errors on the storage side. Was referred back to VMware as they suspected the issue to be on the hosts.

"MBR/partition table is corrupted and they are related to host side and not storage:'

2021-02-04T08:58:37.451Z cpu2:2115589)Partition: 430: Failed read for "naa.6006016073c04200e0d3975bc7aa726d": I/O error

2021-02-04T08:58:37.451Z cpu2:2115589)Partition: 1108: Failed to read protective mbr on "naa.6006016073c04200e0d3975bc7aa726d" : I/O error

2021-02-04T08:58:37.451Z cpu2:2115589)WARNING: Partition: 1261: Partition table read from device naa.6006016073c04200e0d3975bc7aa726d failed: I/O error

"

If that's the case what would be the proceedure for recovering the corrupted partition table.

dr_robot · ‎02-07-2021

Getting output below:

partedUtil getptbl "vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d"
unknown
935722 255 63 15032385536

partedUtil getUsableSectors "vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d"
Unknown partition table on disk vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d

Tried running this command as well,

offset="128 2048"; for dev in `esxcfg-scsidevs -l | grep "Console Device:" | awk {'print $3'}`; do disk=$dev; echo $disk; partedUtil getptbl $disk; { for i in `echo $offset`; do ech
o "Checking offset found at $i:"; hexdump -n4 -s $((0x100000+(512*$i))) $disk; hexdump -n4 -s $((0x1300000+(512*$i))) $disk; hexdump -C -n 128 -s $((0x130001d + (512*$i))) $disk; done; } | grep -B 1
-A 5 d00d; echo "---------------------"; done

But just got:

/vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d
unknown
935722 255 63 15032385536
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error
hexdump: /vmfs/devices/disks/naa.6006016073c04200e0d3975bc7aa726d: Input/output error

With the error messages being seen, would it still be possible to recreate the partition as like in, http://vmwareinsight.com/Articles/2018/3/5802942/How-to-create-corrupt-or-missing-VMFS-Partition-tab...

If so, would that fix the issue?

paudieo · ‎02-08-2021

Hi

The hexdump command you quoted is reading off an offset on the disk to determine if there is a valid VMFS filesystem present.
It appears from the output you displayed, there is an I/O or input / output error, which I suspect are read errors, which I guess we already know at this point.

With the evidence you have provided I do not believe you will be able to re-create the partition since you can't read off the device.
It's worth mentioning re-writing the partition is potentially dangerous which could lead to data loss without first understanding what is going on.

What is so unique about this single volume?, I still stand by my suspicion that the volumes attributes have changed on the array, such as it has been marked read-only/or a de-activated snapshot.

The only other way to discount an ESXi/vSphere issue is to boot from a linux live CD on another server , and map the volume to that host to see if that server can successfully read off the device without an IO error using tools such as gparted.

At this stage I strongly suggest file a ticket with support and/or your storage support to dig deeper into why there are IO errors against that single device.

All

Can't access datastore. Error on storage disk

i