This one is driving us nuts.
Platform is ESX v2.5.3 build 22981, Windows 2003 Enterprise SP1. HP Proliant bl20-p servers, HP SANswitches and EMC Clariion CX700 storage (latest Flare code).
We've got a 2-node MSCS cluster across physical blades. Everything worked fine when we were on physical servers, and the migration to virtual was (fairly) trouble-free. We moved one of the cluster nodes (call it Node 1) to a different physical server several months ago, and since then we have been having problems with the cluster losing it's brain and failing the cluster resources from Node 1 to Node 2. It's always Node 1 that fails over. We usually see these errors in the system event log:
Event ID 9, The device \Device\Scsi\symmpi1 did not respond within the timeout period.[/b] followed by:
Event 1118, Cluster service was terminated as requested by node 2.[/b]
My research on these errors, especially the event 9, didn't shed much light on the problem. We are using physical disks, by the way, for the cluster shared resources.
Any ideas from the community would be MUCH appreciated!
Thanks,
ds
Hi,
Have you altered the registry key in the guest os?
Set the I/O time to 60 seconds or more by setting
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\
Disk\TimeOutValue
See http://www.vmware.com/pdf/vi3_vm_and_mscs.pdf for more details
Are you using the local disk of the ESX host for your windows C-drive, or do you also have those vmdk stored on the SAN storage?
Message was edited by:
Frank_D
Thanks for the reply.
Yes, I had forgotten to mention that we did the reg hack quite a while ago. Set both nodes to timeout value of '60.' Hasn't seemed to make any difference.
Thanks for posting the link to that document. Although we're not up to ESX 3 just yet, there could be something applicable in that guide.
ds
As to your second question...ah, well, the C: drives are also on the SAN. And in reading the doc you referenced, it seems like this is not supported. I can't seem to find the previous version of this guide, but I'm fairly certain I've seen it (for ESX v2.n.n) I don't recall this being a restriction in that version of ESX. Think we've found the smoking gun?
Well for what i know is that VMware introduced this requirement when ESX 3 was released. I haven't seen this restriction in 2.x documentation.
Bjørn Anders Jørgensen from VMware explains why the c-drive has to be placed on the local drive of the ESX host in this thread:
http://www.vmware.com/community/thread.jspa?messageID=639571
>Simply put MSCS is a bit "itchy" when it comes to timeouts.
>If there is a SAN fabric event like a fail-over,
>we have to put the VM to sleep until we can handle the IO.
>(remember, no caching in the vmkernel)
>This can be longer that the failover time, causing the standby node to try >take control of the cluster resources.
>When the originating cluster owner is rescheduled to run again,
>it has no concept of time passing nor that it's resources are now owned >by the other server, if sucsessful in grabbing them.
>We then have a split brain cluster.
>Now this is a unforseen side effect of virtualization,
>but MS puts similar restrictions in regular HW.
>Not 100% sure about boot from SAN,
>but you do need separate HBA for OS and data/quorum.
>MS is going away from SCSI reset for control mechanics for this exact >reason in Lonhorn.
>"No longer uses SCSI Bus Resets which can be disruptive on a SAN"
What do you see in the /var/log/vmkwarning and /var/log/vmkernel log files.
Ddo you see a reset of the HBA?
Thanks again for posting that thread link. I always get the best info from the forum conversations.
As to log messages, this is the only entry from warnings log:
Aug 8 09:27:50 esxccas04 vmkernel: 331:19:13:15.069 cpu3)WARNING: SCSI: 3141: Retry 0 on handle 3182552 still in progress after 2 seconds[/b]
And here's the vmkernel entries from the failover time:
Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2791: Reset requ
ested on handle 3182552 (1 outstanding commands)
Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2983: Reset on h
andle 3182552 \[0/0]
Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2612: handle 318
2552/3182552: 1 outstanding commands
Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.744 cpu0)SCSI: 2612: handle 318
2551/3182552: 0 outstanding commands
Aug 8 09:27:48 esxccas04 vmkernel: 331:19:13:12.745 cpu0)WARNING: SCSI: 5657: v
mhba1:1:2:1 status = 0/2 0x0 0x0 0x0
Aug 8 09:27:50 esxccas04 vmkernel: 331:19:13:15.069 cpu3)WARNING: SCSI: 3141: R
etry 0 on handle 3182552 still in progress after 2 seconds
Aug 8 09:28:38 esxccas04 vmkernel: 331:19:14:03.667 cpu2)SCSI: 2840: Completing
reset on handle 3182552 (0 outstanding commands)
Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.903 cpu2)SCSI: 8917: vmhba1:1:6
:smileyshocked: Retry (unit attn)
Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.907 cpu2)SCSI: 8917: vmhba1:0:8
:smileyshocked: Retry (unit attn)
Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.909 cpu2)SCSI: 8917: vmhba1:4:9
:smileyshocked: Retry (unit attn)
Aug 8 09:32:29 esxccas04 vmkernel: 331:19:17:53.911 cpu2)SCSI: 8917: vmhba1:0:1
0:0 Retry (unit attn)[/b]
(how'd those smileys get in there?)
Take a look at this KB article:
It explains about Win2k3 SP1 issues with MSCS on ESX. It's time to patch your ESX server.
Chris
The smilies came from the colon and the zero next to each other.
Chris