ESXi 5.5 Another Datastore missing

jjccooxx · ‎06-24-2017

I've been struggling to recover a missing local vfms 5 datastore, the LUN is visible but refreshing and or rescanning the Datastore just does not appear.

I am unable to mount in CLI either.

The HP SmartArray P400i controller shows the disks as ok although the boot information shows

1720 - S.M.A.R.T. Hard Drive(s) Detect Imminent Failure Port 2I: Box 1: Bay 4 - suggesting a SAS disk issue is imminent.

Is there anything I can do to recover a vm from this missing Datastore?

PartedUtil shows

# partedUtil getptbl /vmfs/devices/disks/mpx.vmhba1:C0:T0:L0

gpt

71380 255 63 1146734896

1 2048 1146734591 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

/vmfs/volumes # esxcli storage core device list |grep -A27 ^mpx.vmhba1:C0:T0:L0

mpx.vmhba1:C0:T0:L0

Display Name: Local VMware Disk (mpx.vmhba1:C0:T0:L0)

Has Settable Display Name: false

Size: 559929

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/mpx.vmhba1:C0:T0:L0

Vendor: VMware

Model: Block device

Revision: 1.0

SCSI Level: 2

Is Pseudo: false

Status: on

Is RDM Capable: false

Is Local: true

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Queue Full Sample Size: 0

Queue Full Threshold: 0

Thin Provisioning Status: unknown

Attached Filters:

VAAI Status: unsupported

Other UIDs: vml.0000000000766d686261313a303a30

Is Local SAS Device: false

Is Boot USB Device: false

No of outstanding IOs with competing worlds: 32

offset="128 2048"; for dev in `esxcfg-scsidevs -l | grep "Console Device:" | awk {'print $3'}`; do disk=$dev; echo $disk; partedUtil getptbl $disk; { for i in `echo $offset`; do echo "Checking offset found at $i:"; hexdump -n4 -s $((0x100000+(512*$i))) $disk; hexdump -n4 -s $((0x1300000+(512*$i))) $disk; hexdump -C -n 128 -s $((0x130001d + (512*$i))) $disk; done; } | grep -B 1 -A 5 d00d; echo "---------------------"; done

Result -

---------------------

/vmfs/devices/disks/mpx.vmhba1:C0:T0:L0

gpt

71380 255 63 1146734896

1 2048 1146734591 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

Checking offset found at 2048:

0200000 d00d c001

0200004

1400000 f15e 2fab

1400004

0140001d 4c 43 4c 5f 52 41 49 44 30 00 00 00 00 00 00 00 |LCL_RAID0.......|

0140002d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

vmkernel.log output -

2017-06-07T17:40:21.582Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:21.582Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e8087c140) 0x28, CmdSN 0x2c5 from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:24.695Z cpu2:32825)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:24.695Z cpu2:32787)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x412e80859ac0, 33801) to dev "mpx.vmhba1:C0:T0:L0" on path "vmhba1:C0:T0:L0" Failed: H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL

2017-06-07T17:40:24.695Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:24.695Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80859ac0) 0x28, CmdSN 0x2c7 from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:27.807Z cpu2:32783)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:27.807Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:27.807Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80859200) 0x28, CmdSN 0x2c9 from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:30.920Z cpu2:32825)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:30.920Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:30.920Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80858a80) 0x28, CmdSN 0x2cb from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:34.032Z cpu2:32779)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:34.032Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:34.032Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80858300) 0x28, CmdSN 0x2cd from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:37.144Z cpu2:32793)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:37.145Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:37.145Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e808569c0) 0x28, CmdSN 0x2db from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:40.255Z cpu2:32843)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:40.255Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:40.255Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80856240) 0x28, CmdSN 0x2dd from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:43.367Z cpu2:32779)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:43.367Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:43.367Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80855ac0) 0x28, CmdSN 0x2df from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:46.479Z cpu2:32779)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:40:46.480Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:40:46.480Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e80855340) 0x28, CmdSN 0x2e1 from world 33801 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

2017-06-07T17:40:46.480Z cpu3:33801)Fil3: 15338: Max timeout retries exceeded for caller Fil3_FileIO (status 'Timeout')

2017-06-07T17:40:48.804Z cpu1:33801)Config: 346: "SIOControlFlag2" = 0, Old Value: 0, (Status: 0x0)

2017-06-07T17:40:52.761Z cpu1:34271)WARNING: UserEpoll: 542: UNSUPPORTED events 0x40

2017-06-07T17:40:53.563Z cpu2:34271)WARNING: LinuxSocket: 1854: UNKNOWN/UNSUPPORTED socketcall op (whichCall=0x12, args@0xffd12d8c)

2017-06-07T17:40:54.445Z cpu3:33801)Config: 346: "VMOverheadGrowthLimit" = -1, Old Value: -1, (Status: 0x0)

2017-06-07T17:40:57.728Z cpu2:33989)Hardware: 3124: Assuming TPM is not present because trusted boot is not supported.

2017-06-07T17:41:00.176Z cpu2:34050)<4>cciss: cmd 0x4109904559c0 has CHECK CONDITION byte 2 = 0x3

2017-06-07T17:41:00.177Z cpu2:32787)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x412e8086a4c0, 33986) to dev "mpx.vmhba1:C0:T0:L0" on path "vmhba1:C0:T0:L0" Failed: H:0x3 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL

2017-06-07T17:41:00.177Z cpu2:32787)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "mpx.vmhba1:C0:T0:L0" state in doubt; requested fast path state update...

2017-06-07T17:41:00.177Z cpu2:32787)ScsiDeviceIO: 2337: Cmd(0x412e8086a4c0) 0x28, CmdSN 0x38e from world 33986 to dev "mpx.vmhba1:C0:T0:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0.

2017-06-07T17:41:00.247Z cpu2:34933)Boot Successful

2017-06-07T17:41:01.007Z cpu3:33804)Config: 346: "SIOControlFlag2" = 1, Old Value: 0, (Status: 0x0)

2017-06-07T17:41:01.736Z cpu1:34988)MemSched: vm 34988: 8263: extended swap to 8192 pgs

dekoshal · ‎06-24-2017

Below is the decode of the scsi sense code:

2017-06-07T17:41:00.177Z cpu2:32787)NMP: nmp_ThrottleLogForDevice:2321: Cmd 0x28 (0x412e8086a4c0, 33986) to dev "mpx.vmhba1:C0:T0:L0" on path "vmhba1:C0:T0:L0" Failed: H:0x3 D:0x0 P:0x0 Possible sense data: 0x5 0x20 0x0. Act:EVAL

Host bit is reporting issue. Try restarting the host agents using command services.sh restart or the reboot the esxi host itself.

Host Status	[0x3]	TIME_OUT	This status is returned when the command in-flight to the array times out.
Device Status	[0x0]	GOOD	This status is returned when there is no error from the device or target side. This is when you will see if there is a status for Host or Plugin.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0x5]	ILLEGAL REQUEST
Additional Sense Data	20/00	INVALID COMMAND OPERATION CODE

If you found this or any other answer helpful, please consider the use of the Helpful to award points.

Best Regards,

Deepak Koshal

CNE|CLA|CWMA|VCP4|VCP5|CCAH

continuum · ‎06-24-2017

Hi

> 1720 - S.M.A.R.T. Hard Drive(s) Detect Imminent Failure Port 2I: Box 1: Bay 4 - suggesting a SAS disk issue is imminent.

That does not sound good.
You should create a complete disk-image ASAP..
I hope you have another datastore with enough free space to create a full disk clone.
Run
dd if="/dev//disks/mpx.vmhba1:C0:T0:L0" bs=1M conv=notrunc of=/vmfs/volumes/<OTHER-DATASTORE>/almost-dead.bin
Once that is done stop using the disk and store it away.
We can extract the VMs from almost-dead.bin later.
At the moment priority number ONE is to create a diskimage befor the disk dies.
If possible improve air flow to that disk - maybe even put a gel-pad fresh out of the fridge on top of it to keep the disk as cool as possible.
Do NOT try to mount the datastore again - do not rescan - do not edit the partitiontable - do not try any GUI-operation against that disk !!!
Ulli
...
By the way - you post in a strange section of the forum - I only found your post through accident because I saw a message that you followed me.
Next time rather post in the regular ESXi section or here : VMware vSphere™ Storage

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jjccooxx · ‎06-25-2017

Hi dekoshal,

I've restarted the host but the issue remains, thanks for the sense data information!

jjccooxx · ‎06-25-2017

Hi continuum,

I have another datastore on the same host although not be large enough to receive a full disk image as it's currently another 4 disk array but with RAID5, I can prepare this as a RAID0 then use this to receive the cloned image. One issue, I would need to rescan the HBA one more time to mount the datastore.

Do you believe the following is a reasonable approach

1 shutdown host & Pull failing disk

2 Wipe & prepare RAID0 Array

3 Boot to ESXi, Rescan & Mount new RAID0

4 Shutdown host & reattach failing disk

5 Boot ESXi and run dd if="/dev//disks/mpx.vmhba1:C0:T0:L0" bs=1M conv=notrunc of=/vmfs/volumes/<OTHER-DATASTORE>/almost-dead.bin

Thanks for your advice.

Jason

continuum · ‎06-25-2017

Hi Jason
In that case I would rather suggest to reboot the host into a Linux LiveCD and then store a disk image on a networkshare or large external USB-drive.
To create a Raid0 array formatted with VMFS just to store one file sounds like unnecessary work. ( and includes the risk that the Raid0 will be used afterwards)
Raid0 with VMFS on a standalone ESXi is an absolute NoGo

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jjccooxx · ‎06-25-2017

I completely agree with your statement regarding RAID0, unfortunately I inherited the system and issues. After the recovery has been completed the host will only be used as a lab environment, certainly not with RAID 0.

I will give a LiveCD a try however I am quite inexperienced with Linux environments in general and will likley be out of my depth.

Presumably I will still use the dd command replacing

dd if="/dev//disks/mpx.vmhba1:C0:T0:L0" bs=1M conv=notrunc of=/vmfs/volumes/<OTHER-DATASTORE>/almost-dead.bin

with

dd if="/dev//disks/mpx.vmhba1:C0:T0:L0" bs=1M conv=notrunc of=/<NetworkShare>/almost-dead.bin

continuum · ‎06-26-2017

Jason
if you need assistance feel free to call via skype - see my signature.
Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

jjccooxx · ‎06-26-2017

Based upon your initial feedback and advice today after some rather clumsy trial and error returned the following via an Ubuntu Live CD.

Unfortunately it does not appear Ubuntu registered the SmartArray Controller completely, perhaps related to HP drivers.

ubuntu@ubuntu:/dev/disk/by-id$ sudo dd if="/dev/disks/mpx.vmhba1:C0:T0:L0" bs=1M conv=notrunc of=\\W4200000825\temp\almost-dead.bin returns

dd: failed to open '/dev/disk/mpx.vmhba1:C0:T0:L0': No such file or directory

ls by-

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0-part1

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0-part5

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0-part6

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0-part7

pci-0000:00:1d.7-usb-0:4:1.0-scsi-0:0:0:0-part8

pci-0000:00:1d.7-usb-0:5:1.0-scsi-0:0:0:0

pci-0000:00:1d.7-usb-0:5:1.0-scsi-0:0:0:0-part1

pci-0000:00:1f.1-ata-1

pci-0000:06:00.0-cciss-disk0 -- I am merely guessing this the RAID0

pci-0000:06:00.0-cciss-disk0-part1

pci-0000:06:00.0-cciss-disk1

pci-0000:06:00.0-cciss-disk1-part1

sudo lshw

description: RAID bus controller

product: Smart Array Controller

vendor: Hewlett-Packard Company

physical id: 0

bus info: pci@0000:06:00.0

A combination of gparted sudo lshw - and gparted resolved the following might be the location of the affected RAID0

sudo dd if="/dev/cciss/c0d0p1" bs=1M conv=notrunc of=\\W4200000825\temp\almost-dead.bin

dd: error reading '/dev/cciss/c0d0p1':

Input/output error

24+1 records in

24+1 records out

25223168 bytes (25 MB, 24 MiB) copied, 0.474883 s, 53.1 MB/s

I hope I have this wrong and there is something else I can try to return a more positive result.

"Jason

if you need assistance feel free to call via skype - see my signature.

Ulli"

continuum, I currently only have access to this host between 9 - 17:30 UTC +1

is there a suitable time for a skype ? day / evenings etc?

continuum · ‎06-27-2017

Hi Jason
that does not sound good.
Looks like dd is not suitable - instead we should use ddrescue as that can handle I/O errors.
I am located in germany and my skype is always on. So just send me a message and we should be able to arrange something asap.
Ulli

________________________________________________
Do you need support with a VMFS recovery problem ? - send a message via skype "sanbarrow"
I do not support Workstation 16 at this time ...

All

ESXi 5.5 Another Datastore missing