My Microsoft cluster service keeps failing

gmensching · ‎06-03-2010

I have two virtual

machines on two separate ESX 4 hosts. Each VM is running Server 2008

Enterprise with Failover Clustering installed. SQL 2008 Enterprise is

installed on the cluster. MSDTC and SAP are also installed on the cluster. The

cluster has been validated and I used a VMware document to configure the

VM's correctly. For some reason the cluster service will fail and will

not failover. You can't even restart the service. The node has to be

rebooted.

Event ID 1574

The failover cluster database could

not be unloaded. If restarting the cluster service does not fix the

problem, please restart the machine.

Before that I get these

events leading up to it. They are not in order but I left the

timestamps on them. Any ideas would be greatly appreciated. Thank you.

Log

Name: System

Source: Ntfs

Date: 6/2/2010

9:50:54 AM

Event ID: 137

Task Category: (2)

Level:

Error

Keywords: Classic

User: N/A

Computer:

DB1

Description:

The default transaction resource manager on

volume J: encountered a non-retryable error and could not start. The

data contains the error code.

---

Log Name: System

Source:

volmgr

Date: 6/2/2010 9:51:03 AM

Event ID:

57

Task Category: (2)

Level: Warning

Keywords:

Classic

User: N/A

Computer: DB1

Description:

The

system failed to flush data to the transaction log. Corruption may

occur.

---

Log Name: Application

Source:

Application Error

Date: 6/2/2010 9:50:49 AM

Event ID:

1000

Task Category: (100)

Level: Error

Keywords:

Classic

User: N/A

Computer: DB1

Description:

Faulting

application clussvc.exe, version 6.0.6002.18005, time stamp 0x49e025d2,

faulting module ntdll.dll, version 6.0.6002.18005, time stamp

0x49e0421d, exception code 0xc0000006, fault offset 0x000000000003347e,

process id 0x7a4, application start time 0x01cb01a48c7cb13c.

-

Log

Name: Application

Source: Application Error

Date:

6/2/2010 9:50:49 AM

Event ID: 1005

Task Category: (100)

Level:

Error

Keywords: Classic

User: N/A

Computer:

DB1

Description:

Windows cannot access the file

C:\Windows\Cluster\clussvc.exe for one of the following reasons:

there is a problem with the network connection, the disk that the file

is stored on, or the storage drivers installed on this computer; or the

disk is missing. Windows closed the program Microsoft Failover Cluster

Service because of this error.

Program: Microsoft Failover

Cluster Service

File: C:\Windows\Cluster\clussvc.exe

The

error value is listed in the Additional Data section.

User Action

1.

Open the file again. This situation might be a temporary problem that

corrects itself when the program runs again.

2. If the file still

cannot be accessed and

- It is on the network, your network

administrator should verify that there is not a problem with the network

and that the server can be contacted.

- It is on a removable

disk, for example, a floppy disk or CD-ROM, verify that the disk is

fully inserted into the computer.

3. Check and repair the file system

by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and

then click OK. At the command prompt, type CHKDSK /F, and then press

ENTER.

4. If the problem persists, restore the file from a backup

copy.

5. Determine whether other files on the same disk can be

opened. If not, the disk might be damaged. If it is a hard disk, contact

your administrator or computer hardware vendor for further assistance.

Additional

Data

Error value: 80000011

Disk type: 3

-

Log

Name: System

Source: Microsoft-Windows-Kernel-General

Date:

6/2/2010 9:50:49 AM

Event ID: 6

Task Category: None

Level:

Error

Keywords:

User: SYSTEM

Computer:

DB1

Description:

An I/O operation initiated by the Registry

failed unrecoverably.The Registry could not flush hive (file):

'\??\C:\Windows\Cluster\CLUSDB'.

-

Log Name:

System

Source: Service Control Manager

Date:

6/2/2010 9:50:53 AM

Event ID: 7031

Task Category: None

Level:

Error

Keywords: Classic

User: N/A

Computer:

DB1

Description:

The Cluster Service service terminated

unexpectedly. It has done this 1 time(s). The following corrective

action will be taken in 60000 milliseconds: Restart the service.

tostao · ‎06-30-2010

I have a very similar setup. VSphere 4, MS cluster with SQL Standard.

Getting one of the errors that you have:

Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program SQL Server Windows NT - 64 Bit because of this error.

Program: SQL Server Windows NT - 64 Bit

File:

The error value is listed in the Additional Data section.

User Action

1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.

2. If the file still cannot be accessed and

+ - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.+

+ - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.+

3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.

4. If the problem persists, restore the file from a backup copy.

5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

Additional Data

Error value: 80000011

Disk type: 0

The 80000011 error is mentioned here and indicates something is contending for the disk.

bartbilliet · ‎08-31-2010

We are having the exact same issues as the topic starter. (gmensching)

Did you ever find an answer?

We have a Microsoft cluster with 2 nodes running on a separate ESX 4.1 host.

Both nodes in our cluster use 2 shared cluster disks, which are LUN's connected by Raw Device mappings.

The two cluster nodes are running Windows 2008 SP2 x86.

They are hosted by two different ESX hosts.

The LSI SAS controller is being used.

The RDM's are in Physical Compatibility Mode SCSI Bus Sharing is set to physical.

I found out that the status code 0x80000011 maps to STATUS_DEVICE_BUSY and implies that the device is currently busy.

So this is definitely a disk-problem. Question is what causes the device to be busy? Configuration error?

And is there any way to know which device is causing the problem? (local C-drive or shared cluster drives)

thakala · ‎08-31-2010

MSCS does not support SAN multipathing, are you using PowerPath/VE or round-robin policy for Quorum/Witness LUNs?

If yes, create claim rule to enable NMP for Quorum/Witness LUNs and configure NMP to use MRU or Fixed policy.

http://v-reality.info

Tomi http://v-reality.info

gmensching · ‎08-31-2010

On the advise of VMware I moved the C drive of the virtual machines off of a datastore on the FC SAN and on to the VM host's local datastore and I have not had the problem since. All of the shared disk works just fine. I still haven't figured out the problem though. I'm taking a look at the fiber switch and the disks.

bartbilliet · ‎09-01-2010

We have no round-robin policy in Vmware enabled. (as seen in screenshot attached)

It's a fixed path.

We are using a Hitachi AMS1000 SAN.

On our SAN we have an Active/Passive configuration. The active/passive state did not change during the cluster problems.

We are not able to move the C-drive vmdk's to a local disk of the ESX host, because our ESX hosts have no local disks.

Our ESX hosts are blade servers which boot from SAN.

gmensching · ‎09-13-2010

I think I have found my problem. When I created my configuration I missed a step. I had my shared disks using the same SCSI controller as the other disks. After changing the configuration of the shared disks to use the new controller and changing the original controller to be none for bus sharing my problems seem to have gone away. The step I missed from the VMware document is below.

7. Select a new virtual device node (for example, select SCSI (1:0)), and click Next.

NOTE This must be a new SCSI controller. You cannot use SCSI 0.

bartbilliet · ‎09-13-2010

Hi,

Thanks for the follow-up!

I just figured out the same solution and I am configuring it as we speak.

I had also used only 1 SCSI controller for all disks (OS .vmdk file AND shared mapped raw lun's)

Once it has been tested, I'll let you know if it also fixed my problem.

I have now 2 SCSI controllers:

SCSI controller 0 is used for the OS .vmdk, and SCSI Bus Sharing is set to None.

SCSI controller 1 is used for all mapped raw lun's, and SCSI Bus Sharing is set to Physical.

Valcra · ‎01-04-2011

Hi,

thnx for sharing your valuable info.

currently i am looking for the information/document to configure MSCS on my ESX 4.0\4.1 version.

will pls help me on that...if you have any document/info. pls share .

my email id is :- villykaras@gmail.com

Thnx and Regards,

Valerian Crasto.

idle-jam · ‎01-04-2011

Hi,

looks like the question is marked as answered. do you mind sharing how you resolve? i'm keen to know. Thanks.