Hello,
I have a very strange issue that I am looking for some guidance on. I have an ESXi6 host with a 2TB local datastore that is a single Samsung 850 Pro SSD. This datastore has 3 virtual machines on it. A few weeks ago the ESXi box started physically locking up at night. The error in vsphere shows lost access to datastore.
It took me a few days to realize that backups on one of the VMs were crashing it. I disabled backups and it ran without issue for a few days. I attempted to switch from the backup software installed on the OS to Veeam but running a Veeam backup causes it to crash with the same error. I added a raid 10 array to this host and setup a vCenter server to attempt a storage vmotion, that failed as well. I tried setting up a new Windows Server on the raid 10 array and copying files from the problematic server to the new one but after chugging along at a data copy it locks up the ESXi host with the same lost access to datastore error. I have reinstalled ESXi and re-added the VM and everything comes out the same.
I checked the SSD and it shows as fine, the other VMs are it are fine. I think there is some corruption with the VMDK file but I am at a loss as to where to go. The biggest problem is that over the 3 weeks of troubleshooting the server has had many files modified and I can't run a backup or copy them off without the datastore crashing. The server runs fine as long as I don't try to back it up or move the data. I know I made some mistakes along the way (not restoring from backup right away, ever configuring it as a single drive to begin with) but any help would be greatly appreciated.
Just an update, installed VMware vSphere Data Protection 6.1 and tried to run a backup with the VM both turned on and off and it wont run. I get an unspecified error.
Could you create a support bundle for the ESXi hosts affected and I'll take a look at the logs.
It would not hurt to open a case with VMware if you have support also.
Here is an upload of the logs:
https://drive.google.com/open?id=0B160EOK2iCKbRVQxeUZMU3ByUG8
Thanks for taking the time to lend a hand. They do not have support but I am considering paying to open a case with VMware.
Not a problem,
I have reviewed the logs and I see that there is an issue with disk: naa.600605b003feec301f6edd2e3b915f13
From the log there are many of the below lines:
2016-09-25T21:57:28.834Z cpu3:143203)ScsiDeviceIO: 2651: Cmd(0x439d806fa080) 0x1a, CmdSN 0x336 from world 34323 to dev "naa.600605b003feec301f6edd2e3b915f13" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.
There are also issues with Samsung_SSD_850_PRO_2TB The errors for this are that SCSI commands to the disk are timing out:
2016-09-25T21:50:44.060Z cpu3:32789)ScsiDeviceIO: 2651: Cmd(0x439d80699d00) 0x28, CmdSN 0x270d from world 34726 to dev "t10.ATA_____Samsung_SSD_850_PRO_2TB_________________S2KMNWAG803470T_____" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.
2016-09-25T21:57:36.906Z cpu1:33178)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____Samsung_SSD_850_PRO_2TB_________________S2KMNWAG803470T_____" state in doubt; requested fast path state update...
The disks are likely the same one, I would suggest using storage vMotion to move or clone the VMs to another disk as it is likely that the SSD is failing.
I hope this helps.