Everybody:
I've searched the forums and found no solutions to this one (just another similar question with no resolution) and I found nothing of any use on Oracle's Metalink.
I'm running ESX 3.0 on a couple of BL20p G3 blades. The storage is on an EVA5000 disk array (3.025 firmware), using two separate fabrics. The shared storage was created on a vDisk and zeroed, connected to each VM with the proper support for shared storage (i.e. - virtual bus, separate controller, etc.). I have two Windows 2003 Server VMs, an internal VM network for the Oracle RAC interconnect with private IP addressing, as well as a standard public network. I'm on Oracle 10g (10.2.0.3.0).
This problem ONLY happens with my RAC nodes so I'm 99% sure it's cluster related as I have no problems with standalone Oracle databases on VMs.
My W2K3 virtual machines crash at random. I can't find anything useful in the vmware (vmkernel, hostd.log, etc.) logs, however, if I look at the vmware.log for one of the nodes, the crashes start off with a "vcpu-0| CPU reset| soft". That's it, and that's not enough. The Oracle alert logs have nothing of any use beyond the usual remaining node indicating that it's lost connection with a member.
I've had similar problems with Oracle 9i RAC on virtual machines using ESX 2.5.x, though they never resulted in node restarts, just lost communications with instances going down. Once I put that configuration on physical hardware - I had no problems whatsoever. I can only assume that once I put this 10g configuration on physical hardware, these particular problems will vanish also (I hope! :-).
Does ANYBODY have ANY ideas on what could cause (probable) Oracle clusterware problems on ESX Server? This isn't production, but it bugs the living heck out of me. It can't be pure coincidence that I've had problems on both 9i and 10g RAC on both ESX 2.5.x and 3.0.
Thanks...
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
--The Windows eventlog error looks like this --
Event Type: Error
Event Source: System Error
Event Category: (102)
Event ID: 1003
Date: 2/12/2007
Time: 11:16:26 AM
User: N/A
Computer: SYSTEM NAMEOMMITTED BY POSTER
Description:
Error code 0000ffff, parameter1 00000000, parameter2 00000000, parameter3 00000000, parameter4 00000000.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 53 79 73 74 65 6d 20 45 System E
0008: 72 72 6f 72 20 20 45 72 rror Er
0010: 72 6f 72 20 63 6f 64 65 ror code
0018: 20 30 30 30 30 66 66 66 0000fff
0020: 66 20 20 50 61 72 61 6d f Param
0028: 65 74 65 72 73 20 30 30 eters 00
0030: 30 30 30 30 30 30 2c 20 000000,
0038: 30 30 30 30 30 30 30 30 00000000
0040: 2c 20 30 30 30 30 30 30 , 000000
0048: 30 30 2c 20 30 30 30 30 00, 0000
0050: 30 30 30 30 0000
Hey Chuck,
if your vm´s are crashing regulary without any explanation, please check your RAM.
This is the most usual reason for that.
Run a mem test tool on you server for several hours (vmware itselfs wants to do this for about 72hours).
If your server manufacturer has a tool for testing, take this instead of 3rd party software.
Thanks very much for this information, but there's something important to keep in mind.
This ONLY happens with my RAC systems. It >NEVER< happens with non-RAC systems. And it's happened with different RAC versions and different ESX Server versions - only using Oracle RAC with Clusterware.
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
We are using RAC 10g in two test environment on ESX 2.5.0 and ESX 3.0.1 and have no problems like that. Thats why I first would check the memory.
Btw. Oracle itself recommends to install the RAC Server on physical hardware and not in a vm.
Hmmm... Now that's interesting.
Are you using Windows VMs for your RAC?
Could you give me some more specifics? Fibre versus SCSI storage, etc.? Here's one thing I will do for the time being. I'll vMotion them over to the other node in our farm and see if they continue doing this.
They crash even when basically NOTHING is being done with them.
Thanks!
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
ITTheis,
I don't know if that's entirely true - Here is a page describing how to setup a RAC using VMware. (Note: instructions are for VMware Server, but IMHO its still a good reference for Windows + RAC + VMware )
http://www.oracle-base.com/articles/10g/OracleDB10gR2RACInstallationOnWindows2003UsingVMware.php
Chuck_Whealton,
I think you should test your RAM first (possibly for 72 hours as recommended by VMware). There are conditions were RAM might be a little flaky but not detectable until you implement something like a RAC.
I want to thank you both for this information. I have vMotioned the two RAC systems over to another blade and we'll see how they work over there.
I also saw one or two items I missed in the Oracle RAC on VMware documentation and I've made those changes.
In the meantime, I'll see if the SmartStart has anything for memory exercising on it.
Thanks very much and I'll post my findings, though that may be a number of days.
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
Guys, just an FYI that after moving my two Oracle 10g RAC virtual machines to another ESX server AND making a couple parameter changes that I had apparently missed, so far, I have no random crashes.
I finally dug up a SmartStart CD-ROM and I'm now performing memory tests on the ESX server in question (a BL20p G3). I have it set to run for two days. If it succeeds, I'll move both RAC virtual machines back and if I experience no more problems, I'll attribute it to the parameters I missed. If they fail, I'll run the rest of the diagnostics and take it from there.
I'll let you know. Thanks again for your help...
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
I have similar problems. We have ESX 3 on Linux servers, and I am setting up an Oracle 10g RAC on VM. /var/log/messages will show reservation conflicts for the shared disks I've set up for RAC and CRS, and for ASM. When the reservation conflict hits the disk being used by the CRS, the node is reset.
How did you build/configure your shared disks (the ones that work). What parameters had you missed?
Thanks,
Don
Hi...
In fact, after I posted my last message and made the mistake of opening my mouth to say I'd had no more crashes, I had one RAC node crash on the other ESX server.
My 2-day memory diagnostics test on the first ESX finished with no errors.
So it's not bad memory and it's not the couple of items I uncovered in the Oracle document that I had missed.
I'm out of ideas. This only happens to me with Oracle RAC on VMware ESX Server.
Charles R. Whealton
Charles Whealton @ pleasedontspam.com
Thanks for the reply.
Are you using ocfs or ocfs2 for your RAC? The reason I ask is that I had an OCFS RAC cluster running on VMWare. My current project is an OCFS2 RAC cluster. It's the one that is crashing.
Thanks, Don
Hi,
I am trying to set a Lab setup as described in
http://www.oracle-base.com/articles/10g/OracleDB10gR2RACInstallationOnWindows2003UsingVMware.php
using RHEL4 as host OS and VMWare free server and as I see nodes do crash at the moment close when both Oracle nodes are almost done starting. Once 2003 server went into blue screen wiht IRQ related complain but most often it is a VMWare error message. Or just an immediate termination of VM.
If anyone knows how would it be possible to fix this problem it would help a lot.
Have you ever gotten this problem resolved?
I'm having the exact same issue but running it on 2 HP DL380 with a MSA500.
I'm guessing it's not VMWARE nor is it memory hardware issues. There's nothing on the oracle forums but I'm leaning on a memory leak in RAC.
Thanks.
Did you solve your problem finally?
Got the same trouble.
CRS is installed but I can't install DB nor configured disk.
VMware server 1.0.3 is installed on W2k3 SP1 US and I ma using W2K3R2 Frenchas guest.
I have missed in the documentation but I have corrected afterward:
/etc/hosts: localhos.localdomain
node-priv.without extension
Distributed Transaction Service not activated
I have seen it somewher on the net but I cannot do it
Write Cache on disk in windows, it is greyed out.
scsi1:1.deviceType = "plainDisk" ... at startup vmware server change it to scsi1:1.deviceType = "Disk"
of course everytghing else seems fine... cluvfy for example
I am lacking idea for the moment but I guess I will try to switch to Linux as guest and /or host
Please post any help