I have two virtual
machines on two separate ESX 4 hosts. Each VM is running Server 2008
Enterprise with Failover Clustering installed. SQL 2008 Enterprise is
installed on the cluster. MSDTC and SAP are also installed on the cluster. The
cluster has been validated and I used a VMware document to configure the
VM's correctly. For some reason the cluster service will fail and will
not failover. You can't even restart the service. The node has to be
rebooted.
Event ID 1574
The failover cluster database could
not be unloaded. If restarting the cluster service does not fix the
problem, please restart the machine.
Before that I get these
events leading up to it. They are not in order but I left the
timestamps on them. Any ideas would be greatly appreciated. Thank you.
Log
Name: System
Source: Ntfs
Date: 6/2/2010
9:50:54 AM
Event ID: 137
Task Category: (2)
Level:
Error
Keywords: Classic
User: N/A
Computer:
DB1
Description:
The default transaction resource manager on
volume J: encountered a non-retryable error and could not start. The
data contains the error code.
---
Log Name: System
Source:
volmgr
Date: 6/2/2010 9:51:03 AM
Event ID:
57
Task Category: (2)
Level: Warning
Keywords:
Classic
User: N/A
Computer: DB1
Description:
The
system failed to flush data to the transaction log. Corruption may
occur.
---
Log Name: Application
Source:
Application Error
Date: 6/2/2010 9:50:49 AM
Event ID:
1000
Task Category: (100)
Level: Error
Keywords:
Classic
User: N/A
Computer: DB1
Description:
Faulting
application clussvc.exe, version 6.0.6002.18005, time stamp 0x49e025d2,
faulting module ntdll.dll, version 6.0.6002.18005, time stamp
0x49e0421d, exception code 0xc0000006, fault offset 0x000000000003347e,
process id 0x7a4, application start time 0x01cb01a48c7cb13c.
-
Log
Name: Application
Source: Application Error
Date:
6/2/2010 9:50:49 AM
Event ID: 1005
Task Category: (100)
Level:
Error
Keywords: Classic
User: N/A
Computer:
DB1
Description:
Windows cannot access the file
C:\Windows\Cluster\clussvc.exe for one of the following reasons:
there is a problem with the network connection, the disk that the file
is stored on, or the storage drivers installed on this computer; or the
disk is missing. Windows closed the program Microsoft Failover Cluster
Service because of this error.
Program: Microsoft Failover
Cluster Service
File: C:\Windows\Cluster\clussvc.exe
The
error value is listed in the Additional Data section.
User Action
1.
Open the file again. This situation might be a temporary problem that
corrects itself when the program runs again.
2. If the file still
cannot be accessed and
- It is on the network, your network
administrator should verify that there is not a problem with the network
and that the server can be contacted.
- It is on a removable
disk, for example, a floppy disk or CD-ROM, verify that the disk is
fully inserted into the computer.
3. Check and repair the file system
by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and
then click OK. At the command prompt, type CHKDSK /F, and then press
ENTER.
4. If the problem persists, restore the file from a backup
copy.
5. Determine whether other files on the same disk can be
opened. If not, the disk might be damaged. If it is a hard disk, contact
your administrator or computer hardware vendor for further assistance.
Additional
Data
Error value: 80000011
Disk type: 3
-
Log
Name: System
Source: Microsoft-Windows-Kernel-General
Date:
6/2/2010 9:50:49 AM
Event ID: 6
Task Category: None
Level:
Error
Keywords:
User: SYSTEM
Computer:
DB1
Description:
An I/O operation initiated by the Registry
failed unrecoverably.The Registry could not flush hive (file):
'\??\C:\Windows\Cluster\CLUSDB'.
-
Log Name:
System
Source: Service Control Manager
Date:
6/2/2010 9:50:53 AM
Event ID: 7031
Task Category: None
Level:
Error
Keywords: Classic
User: N/A
Computer:
DB1
Description:
The Cluster Service service terminated
unexpectedly. It has done this 1 time(s). The following corrective
action will be taken in 60000 milliseconds: Restart the service.
I have a very similar setup. VSphere 4, MS cluster with SQL Standard.
Getting one of the errors that you have:
Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program SQL Server Windows NT - 64 Bit because of this error.
Program: SQL Server Windows NT - 64 Bit
File:
The error value is listed in the Additional Data section.
User Action
1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.
2. If the file still cannot be accessed and
+ - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.+
+ - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.+
3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.
4. If the problem persists, restore the file from a backup copy.
5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.
Additional Data
Error value: 80000011
Disk type: 0
The 80000011 error is mentioned here and indicates something is contending for the disk.
We are having the exact same issues as the topic starter. (gmensching)
Did you ever find an answer?
We have a Microsoft cluster with 2 nodes running on a separate ESX 4.1 host.
Both nodes in our cluster use 2 shared cluster disks, which are LUN's connected by Raw Device mappings.
The two cluster nodes are running Windows 2008 SP2 x86.
They are hosted by two different ESX hosts.
The LSI SAS controller is being used.
The RDM's are in Physical Compatibility Mode SCSI Bus Sharing is set to physical.
I found out that the status code 0x80000011 maps to STATUS_DEVICE_BUSY and implies that the device is currently busy.
So this is definitely a disk-problem. Question is what causes the device to be busy? Configuration error?
And is there any way to know which device is causing the problem? (local C-drive or shared cluster drives)
MSCS does not support SAN multipathing, are you using PowerPath/VE or round-robin policy for Quorum/Witness LUNs?
If yes, create claim rule to enable NMP for Quorum/Witness LUNs and configure NMP to use MRU or Fixed policy.
On the advise of VMware I moved the C drive of the virtual machines off of a datastore on the FC SAN and on to the VM host's local datastore and I have not had the problem since. All of the shared disk works just fine. I still haven't figured out the problem though. I'm taking a look at the fiber switch and the disks.
We have no round-robin policy in Vmware enabled. (as seen in screenshot attached)
It's a fixed path.
We are using a Hitachi AMS1000 SAN.
On our SAN we have an Active/Passive configuration. The active/passive state did not change during the cluster problems.
We are not able to move the C-drive vmdk's to a local disk of the ESX host, because our ESX hosts have no local disks.
Our ESX hosts are blade servers which boot from SAN.
I think I have found my problem. When I created my configuration I missed a step. I had my shared disks using the same SCSI controller as the other disks. After changing the configuration of the shared disks to use the new controller and changing the original controller to be none for bus sharing my problems seem to have gone away. The step I missed from the VMware document is below.
7. Select a new virtual device node (for example, select SCSI (1:0)), and click Next.
NOTE This must be a new SCSI controller. You cannot use SCSI 0.
Hi,
Thanks for the follow-up!
I just figured out the same solution and I am configuring it as we speak.
I had also used only 1 SCSI controller for all disks (OS .vmdk file AND shared mapped raw lun's)
Once it has been tested, I'll let you know if it also fixed my problem.
I have now 2 SCSI controllers:
SCSI controller 0 is used for the OS .vmdk, and SCSI Bus Sharing is set to None.
SCSI controller 1 is used for all mapped raw lun's, and SCSI Bus Sharing is set to Physical.
Hi,
thnx for sharing your valuable info.
currently i am looking for the information/document to configure MSCS on my ESX 4.0\4.1 version.
will pls help me on that...if you have any document/info. pls share .
my email id is :- villykaras@gmail.com
Thnx and Regards,
Valerian Crasto.
Hi,
looks like the question is marked as answered. do you mind sharing how you resolve? i'm keen to know. Thanks.