VMware Cloud Community
nicksmoke
Contributor
Contributor

VM Crashing ESXi on Certain Files

Hello,

I have a very strange issue that I am looking for some guidance on.  I have an ESXi6 host with a 2TB local datastore that is a single Samsung 850 Pro SSD.  This datastore has 3 virtual machines on it.  A few weeks ago the ESXi box started physically locking up at night.  The error in vsphere shows lost access to datastore. 

It took me a few days to realize that backups on one of the VMs were crashing it.  I disabled backups and it ran without issue for a few days.  I attempted to switch from the backup software installed on the OS to Veeam but running a Veeam backup causes it to crash with the same error.  I added a raid 10 array to this host and setup a vCenter server to attempt a storage vmotion, that failed as well.  I tried setting up a new Windows Server on the raid 10 array and copying files from the problematic server to the new one but after chugging along at a data copy it locks up the ESXi host with the same lost access to datastore error.  I have reinstalled ESXi and re-added the VM and everything comes out the same.

I checked the SSD and it shows as fine, the other VMs are it are fine.  I think there is some corruption with the VMDK file but I am at a loss as to where to go.  The biggest problem is that over the 3 weeks of troubleshooting the server has had many files modified and I can't run a backup or copy them off without the datastore crashing.  The server runs fine as long as I don't try to back it up or move the data.  I know I made some mistakes along the way (not restoring from backup right away, ever configuring it as a single drive to begin with) but any help would be greatly appreciated.

0 Kudos
4 Replies
nicksmoke
Contributor
Contributor

Just an update, installed VMware vSphere Data Protection 6.1 and tried to run a backup with the VM both turned on and off and it wont run.  I get an unspecified error.

0 Kudos
virtualg_uk
Leadership
Leadership

Could you create a support bundle for the ESXi hosts affected and I'll take a look at the logs.

It would not hurt to open a case with VMware if you have support also.


Graham | User Moderator | https://virtualg.uk
0 Kudos
nicksmoke
Contributor
Contributor

Here is an upload of the logs:

https://drive.google.com/open?id=0B160EOK2iCKbRVQxeUZMU3ByUG8

Thanks for taking the time to lend a hand.  They do not have support but I am considering paying to open a case with VMware.

0 Kudos
virtualg_uk
Leadership
Leadership

Not a problem,

I have reviewed the logs and I see that there is an issue with disk: naa.600605b003feec301f6edd2e3b915f13

From the log there are many of the below lines:

2016-09-25T21:57:28.834Z cpu3:143203)ScsiDeviceIO: 2651: Cmd(0x439d806fa080) 0x1a, CmdSN 0x336 from world 34323 to dev "naa.600605b003feec301f6edd2e3b915f13" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

There are also issues with Samsung_SSD_850_PRO_2TB The errors for this are that SCSI commands to the disk are timing out:

2016-09-25T21:50:44.060Z cpu3:32789)ScsiDeviceIO: 2651: Cmd(0x439d80699d00) 0x28, CmdSN 0x270d from world 34726 to dev "t10.ATA_____Samsung_SSD_850_PRO_2TB_________________S2KMNWAG803470T_____" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0.

2016-09-25T21:57:36.906Z cpu1:33178)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237: NMP device "t10.ATA_____Samsung_SSD_850_PRO_2TB_________________S2KMNWAG803470T_____" state in doubt; requested fast path state update...

The disks are likely the same one, I would suggest using storage vMotion to move or clone the VMs to another disk as it is likely that the SSD is failing.

I hope this helps.


Graham | User Moderator | https://virtualg.uk
0 Kudos