We had a production sql server running on an ESX 4i host power off unexpectedly right after getting a failed snapshot error. the snapshot was taken by our NetApp storage device for its backup using SMVI. I understand that two issues can cause this. First if one of the snapshot files gets locked by either the vm, host, or third party application, in this case NetApps SMVI, secondly, if the operating state of the memory is not preserved at the time of the snapshot revert, it will power off.
The error we recieved was
Cannot open the disk '/vmfs/volumes/UUID/VMName/VMName-000001.vmdk' or one of the snapshot disks it depends on. Reason: Failed to lock the file.
The volume name and .vmdk name are changed to generic for this post, but thatis the exact error message.
Here was the diagnosis and fix :
The issue was that during a create or delete snapshots operation, the virtual machine was unexpectedly powered off with the symptoms : The create or delete snapshot task fails, and the virtual machine is powered off with error.
A NetApp post stated "This issue occurs when one of the files required by the virtual machine has been opened by another application during a Create or Delete Snapshot operation while a Virtual machine is running, all the disk files are momentarily closed and reopened. During this window, the files could be opened by another Virtual machine, management process, or third-party utility. If that application creates and maintains a lock on the required disk files, the Virtual machine cannot reopen the file and resume running".
Browsing the ESX logs, I found that Failed to lock the file error at the same time as the snapshot failed and in turn when the vm was unexpectedly powered off. So once confirmed on the host logs, the next step is to look at the NETAPP snapmanager settings for this particular vm and ensure that it is set to quiesce the disk prior to snapshots since it is a Sql server and is very write intensive.
This will ensure that all disk writes are congruent and will prevent a file lock mid snap issue. Hope this helps.
Here is another post with similar issues, just not caused by NetApp