MSCS failovers - symmpi errors?

dtstephens · ‎08-08-2007

This one is driving us nuts.

Platform is ESX v2.5.3 build 22981, Windows 2003 Enterprise SP1. HP Proliant bl20-p servers, HP SANswitches and EMC Clariion CX700 storage (latest Flare code).

We've got a 2-node MSCS cluster across physical blades. Everything worked fine when we were on physical servers, and the migration to virtual was (fairly) trouble-free. We moved one of the cluster nodes (call it Node 1) to a different physical server several months ago, and since then we have been having problems with the cluster losing it's brain and failing the cluster resources from Node 1 to Node 2. It's always Node 1 that fails over. We usually see these errors in the system event log:

Event ID 9, The device \Device\Scsi\symmpi1 did not respond within the timeout period.[/b] followed by:

Event 1118, Cluster service was terminated as requested by node 2.[/b]

My research on these errors, especially the event 9, didn't shed much light on the problem. We are using physical disks, by the way, for the cluster shared resources.

Any ideas from the community would be MUCH appreciated!

Thanks,

ds

frankdenneman · ‎08-08-2007

Hi,

Have you altered the registry key in the guest os?

Set the I/O time to 60 seconds or more by setting

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\

Disk\TimeOutValue

See http://www.vmware.com/pdf/vi3_vm_and_mscs.pdf for more details

Are you using the local disk of the ESX host for your windows C-drive, or do you also have those vmdk stored on the SAN storage?

Message was edited by:

Frank_D

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

dtstephens · ‎08-08-2007

Thanks for the reply.

Yes, I had forgotten to mention that we did the reg hack quite a while ago. Set both nodes to timeout value of '60.' Hasn't seemed to make any difference.

Thanks for posting the link to that document. Although we're not up to ESX 3 just yet, there could be something applicable in that guide.

ds

dtstephens · ‎08-08-2007

As to your second question...ah, well, the C: drives are also on the SAN. And in reading the doc you referenced, it seems like this is not supported. I can't seem to find the previous version of this guide, but I'm fairly certain I've seen it (for ESX v2.n.n) I don't recall this being a restriction in that version of ESX. Think we've found the smoking gun?

frankdenneman · ‎08-08-2007

Well for what i know is that VMware introduced this requirement when ESX 3 was released. I haven't seen this restriction in 2.x documentation.

Bjørn Anders Jørgensen from VMware explains why the c-drive has to be placed on the local drive of the ESX host in this thread:

http://www.vmware.com/community/thread.jspa?messageID=639571

>Simply put MSCS is a bit "itchy" when it comes to timeouts.

>If there is a SAN fabric event like a fail-over,

>we have to put the VM to sleep until we can handle the IO.

>(remember, no caching in the vmkernel)

>This can be longer that the failover time, causing the standby node to try >take control of the cluster resources.

>When the originating cluster owner is rescheduled to run again,

>it has no concept of time passing nor that it's resources are now owned >by the other server, if sucsessful in grabbing them.

>We then have a split brain cluster.

>Now this is a unforseen side effect of virtualization,

>but MS puts similar restrictions in regular HW.

>Not 100% sure about boot from SAN,

>but you do need separate HBA for OS and data/quorum.

>MS is going away from SCSI reset for control mechanics for this exact >reason in Lonhorn.

>"No longer uses SCSI Bus Resets which can be disruptive on a SAN"

What do you see in the /var/log/vmkwarning and /var/log/vmkernel log files.

Ddo you see a reset of the HBA?

Blogging: frankdenneman.nl Twitter: @frankdenneman Co-author: vSphere 4.1 HA and DRS technical Deepdive, vSphere 5x Clustering Deepdive series

dtstephens · ‎08-09-2007

Thanks again for posting that thread link. I always get the best info from the forum conversations.

As to log messages, this is the only entry from warnings log:

Aug 8 09:27:50 esxccas04 vmkernel: 331:19:13:15.069 cpu3)WARNING: SCSI: 3141: Retry 0 on handle 3182552 still in progress after 2 seconds[/b]

And here's the vmkernel entries from the failover time:

Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2791: Reset requ

ested on handle 3182552 (1 outstanding commands)

Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2983: Reset on h

andle 3182552 \[0/0]

Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2612: handle 318

2552/3182552: 1 outstanding commands

Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2612: handle 318

2551/3182552: 0 outstanding commands

Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.745 cpu0)WARNING: SCSI: 5657: v

mhba1:1:2:1 status = 0/2 0x0 0x0 0x0

Aug 8 09:27:50 esxccas04 vmkernel: 331:19:13:15.069 cpu3)WARNING: SCSI: 3141: R

etry 0 on handle 3182552 still in progress after 2 seconds

Aug 8 09:28:38 esxccas04 vmkernel: 331:19:14:03.667 cpu2)SCSI: 2840: Completing

reset on handle 3182552 (0 outstanding commands)

Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.903 cpu2)SCSI: 8917: vmhba1:1:6

:smileyshocked: Retry (unit attn)

Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.907 cpu2)SCSI: 8917: vmhba1:0:8

:smileyshocked: Retry (unit attn)

Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.909 cpu2)SCSI: 8917: vmhba1:4:9

:smileyshocked: Retry (unit attn)

Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.911 cpu2)SCSI: 8917: vmhba1:0:1

0:0 Retry (unit attn)[/b]

(how'd those smileys get in there?)

admin · ‎08-09-2007

Take a look at this KB article:

http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=2021&slice...

It explains about Win2k3 SP1 issues with MSCS on ESX. It's time to patch your ESX server.

Chris

admin · ‎08-09-2007

The smilies came from the colon and the zero next to each other.

Chris

All

MSCS failovers - symmpi errors?