VMware Cloud Community
Jonas_B
Contributor
Contributor

Disk problem with MSCS on ESX3.5 when high I/O

Hello,

We have a MS-cluster setup in vmware on 1 physical box. It consits of 2 VM's, 1 database and 1 application server. The OS is W2K server and SQL version is SQL 2005 Standard Edition. We've had this cluster in production for over a year, last week however we updated the ESX-host with the machine on. We booted the VM's and the cluster came up fine, no problems or error messages at all.We took an extra backup of the databases and there was no problems with it so everything looked fine.

Next day we noticed however that when we get high I/O on the server the SQL-server log error 170 (which is "Requested resource is in use"), SQL Error 9001. The server also log in the eventvwr "{Lost Delayed-Write Data}" error. When I google on the error's moste of the answers say that it's possible some kind of h/w error on the disk.

The disks in the MS-cluster is setup with .vmdk files on our iSCSI-vmfs. The scsi-controller used is buslogic (since its W2K-servers).

Has anyone ran in to something similar? Or anyone has any idea what we can do about this?

Tags (6)
Reply
0 Kudos
8 Replies
jhanekom
Virtuoso
Virtuoso

What did you update on the ESX hosts?

Reply
0 Kudos
Jonas_B
Contributor
Contributor

oh, I forgot to mention that lol...

We updated our hosts from ESX 3.0.2 to ESX 3.5 Update1

Reply
0 Kudos
kjb007
Immortal
Immortal

Are you running active/active or active/passive? Do you see any SCSI errors in the /var/log/vmkernel logfile?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
Jonas_B
Contributor
Contributor

The SQL-instace we run in active/passive,

The disk's are also active/passive, however they share the same disk controller.

In the /var/log/vmkernel I see no SCSI warnings at all.

Reply
0 Kudos
kjb007
Immortal
Immortal

I would run a couple of tests using iometer and/or sqlio tools. You may be running into some latency issues. What kind of SAN are you using, and have you checked your SAN for issues as well?

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
Reply
0 Kudos
racerzer0
Contributor
Contributor

Jonas_B.. did you ever find a solution for this?

Reply
0 Kudos
Jonas_B
Contributor
Contributor

Hello,

No we did not find a solution, we ended up installing a new SQL server with Win2K3 and no MSCS and moved all the databases there.

Reply
0 Kudos
noahj
Hot Shot
Hot Shot

I had a similar issue on a physical server that was clustering. I found that for some reason the storage group had moved to the opposite node and was no longer accessible to the primary Cluster node. Once I ensured they were both on the correct node the problems went away. Perhaps this issue is somehow similar?

Reply
0 Kudos