I have a home lab server that i use for a mix of different things.
Tonight i as looking over it to see if it would upgrade to 6.7.0 and noticed a datastore had vanished.
logs from vmkernel.log
2020-07-12T11:49:39.829Z cpu2:33215)<3>ata5.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x0
2020-07-12T11:49:39.829Z cpu2:33215)<3>ata5.00: irq_stat 0x40000008
2020-07-12T11:49:39.829Z cpu2:33215)<3>ata5.00: cmd 60/00:00:80:d8:00/04:00:00:00:00/40 tag 0 ncq 524288 in
res 41/40:00:d0:da:00/0e:00:00:00:00/40 Emask 0x409 (media error) <F>
2020-07-12T11:49:39.829Z cpu2:33215)<3>ata5.00: status: { DRDY ERR }
2020-07-12T11:49:39.829Z cpu2:33215)<3>ata5.00: error: { UNC }
2020-07-12T11:49:39.830Z cpu2:33215)<6>ata5.00: configured for UDMA/133
2020-07-12T11:49:39.830Z cpu2:33215)<6>ata5: EH complete
2020-07-12T11:49:39.830Z cpu7:35982)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" state in doubt; requested fast path state update...
2020-07-12T11:49:39.830Z cpu7:35982)ScsiDeviceIO: 2652: Cmd(0x43b580615f40) 0x88, CmdSN 0x8000007b from world 35978 to dev "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0$
2020-07-12T11:49:39.830Z cpu7:32798)NMP: nmp_ThrottleLogForDevice:3248: last error status from device t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ repeated 1 times
2020-07-12T11:49:39.830Z cpu7:32798)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x28 (0x43b58060e7c0, 34607) to dev "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" on path "vmhba36:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense$
2020-07-12T11:49:39.830Z cpu7:32798)ScsiDeviceIO: 2652: Cmd(0x43b58060e7c0) 0x28, CmdSN 0x2db7 from world 34607 to dev "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x11 0x4.
2020-07-12T11:49:39.830Z cpu7:32798)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x43b580671d80, 32782) to dev "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" on path "vmhba36:C0:T0:L0" Failed: H:0x3 D:0x0 P:0x0 Possible se$
2020-07-12T11:49:39.830Z cpu7:32798)ScsiDeviceIO: 2652: Cmd(0x43b580671d80) 0x2a, CmdSN 0x2db8 from world 32782 to dev "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
However the volume is actually mounted and i can read the files off the disk, I even have a running VM that has a disk off this store and it works fine.
[root@esxi:/vmfs/volumes] ls -al
lrwxr-xr-x 1 root root 35 Jul 12 12:08 3TB -> 5972c69e-68858108-8ea1-001e67b692d3
drwxr-xr-t 1 root root 3080 Jul 12 10:03 5972c69e-68858108-8ea1-001e67b692d3
[root@esxi:/vmfs/volumes/5972c69e-68858108-8ea1-001e67b692d3/win10] ls -al
total 824773640
drwxr-xr-x 1 root root 560 Jul 12 10:03 .
drwxr-xr-t 1 root root 3080 Jul 12 10:03 ..
-rw------- 1 root root 1099511627776 Jul 12 11:55 win10_1-flat.vmdk
-rw------- 1 root root 525 Jul 12 10:03 win10_1.vmdk
[root@esxi:/vmfs/volumes/5972c69e-68858108-8ea1-001e67b692d3/win10] date
Sun Jul 12 12:06:36 UTC 2020
i wrote some files back and forth to see it was working, note timestamp.
found an article talking about missing datastores and how partition table might be stuffed, but looks fine to me?
https://kb.vmware.com/s/article/2046610
[root@esxi:/dev/disks] partedUtil getptbl t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ
gpt
364801 255 63 5860533168
1 2048 5860532223 AA31E02A400F11DB9590000C2911D1B8 vmfs 0
/vmfs/devices/disks/t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWCC4N7PP74SZ
gpt
364801 255 63 5860533168
1 2048 5860532223 AA31E02A400F11DB9590000C2911D1B8 vmfs 0
Checking offset found at 2048:
0200000 d00d c001
0200004
1400000 f15e 2fab
1400004
0140001d 33 54 42 00 00 00 00 00 00 00 00 00 00 00 00 00 |3TB.............|
0140002d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
It's nothing critical, just a disk i use for iso storage, some on host backups, and a scratch drive that i don't want constantly reading and writing to the SSD datastore which works fine.
They are just SATA drives on the mainboard SATA controller, and the other disks work fine just this one isn't.
No guts no glory, and it is just lab, i cli upgraded it from 6.0.0 to 6.7.3 U3 and still same problem.
I think i might manually copy everything off, format the drive and reinitialize it and re-attach it to my VMs
This does not look like early warnings of a corrupted VMFS to me.
Partition looks fine as well.
Anyway - IMHO this SATA-disk has lost a good part of its credit.
I would replace it soon.
I suppose that this is due to the driver compatibility, or physical disk issue.
SCSI host code 0x3 means timeout, but not by NO_CONNECT or BUS_BUSY. I think that the ESXi could not access to the SATA storage device properly.
Interpreting SCSI sense codes in VMware ESXi and ESX (289902)
https://kb.vmware.com/s/article/289902
------------------------------------------------------------------
SG_ERR_DID_TIME_OUT
[0x03] TIMED OUT for other reason (often this an unexpected device selection timeout)
------------------------------------------------------------------
Did you use the "vmw_ahci" driver for that SATA device? To isolate the cause, it may be effective to change "vmw_ahci" native driver to "sata-ahci" vmklinux driver.
Enabling and Disabling Native Drivers in ESXi 6.5 (2147565)
https://kb.vmware.com/s/article/2147565
As a point to caution, if another SATA device is used for ESXi boot disk, changing the SATA driver may cause disruptive problems and require to reinstall the ESXi. So you should back up the host configuration at first.
How to back up ESXi host configuration (2042141)
Looks like i'm using the vmw_ahci driver
[root@esxi:~] esxcli system module list | grep ahci
vmw_ahci true true
[root@esxi:~] esxcli system module get -m vmw_ahci
Module: vmw_ahci
Module File: /usr/lib/vmware/vmkmod/vmw_ahci
License: BSD
Version: 1.2.8-1vmw.670.3.73.14320388
Build Type: release
Provided Namespaces:
Required Namespaces: com.vmware.vmkapi@v2_5_0_0
Containing VIB: vmw-ahci
VIB Acceptance Level: certified
So just disabling it will push it back to the native driver? Or do i need to load the native driver.
I just used the drivers out of the box for this install, pretty sure i started out at esxi 5or5.5 upgraded to 6, upgraded to 6.7.3 yestesterday.
The SATA controller is just the mainboard controller which is a intel S1200RPL, i have 6 SATA drives, 1xSSD which is datastore, 1x3TB which is a scratch drive stores ISOs some backups and 1 hdd for a VM that does a lot of writes that don't need to be fast, and 4 drives that are direct Mapped using RDM. The SSD and the 4 RDM drives all work fine with the default driver and are running. Even this 3TB disk works fine, the host can read and write to the drive no problems.
The boot drive is an 8GB USB stick
Last night i copied everything off the drive with SCP onto another server, and tried to delete the partition with partedUtil, that didn't work still has hooks even with the VMs shutdown. Today i'm going to load a Linux usb wipe the partition, format the drive, do a check on it, and then reinitialize it.
I tried disabling the vmw_ahci driver to go back to native, made no difference reverted.
booted without the disk fine which is good, so everything is running except for the stuff that was on that disk. Covers off most of my needs.
loaded a hirens boot cd scanned the disk, reports bad sectors with the WD tool, random scan works for awhile but stalls for 20-30 seconds on bad sectors... i think it went bad in the wrong place. given most of it works. Ordered a new drive anyway.
With the disk removed the storage/devices in the browser crashs the browser so i can't even try and see the drive at the moment. When i get the new drive i'll see if it works again, otherwise i might backup the config and reinstall ESXI looks like it wants something from that drive to work properly. That drive was the very original datastore drive before i replaced it with an SSD so might be some config it wants on there even thou it doesn't need it.