VMware Cloud Community
chuck_whealton
Contributor
Contributor

Windows 2003 Server and Oracle 10g RAC causing random crashes

Everybody:

I've searched the forums and found no solutions to this one (just another similar question with no resolution) and I found nothing of any use on Oracle's Metalink.

I'm running ESX 3.0 on a couple of BL20p G3 blades. The storage is on an EVA5000 disk array (3.025 firmware), using two separate fabrics. The shared storage was created on a vDisk and zeroed, connected to each VM with the proper support for shared storage (i.e. - virtual bus, separate controller, etc.). I have two Windows 2003 Server VMs, an internal VM network for the Oracle RAC interconnect with private IP addressing, as well as a standard public network. I'm on Oracle 10g (10.2.0.3.0).

This problem ONLY happens with my RAC nodes so I'm 99% sure it's cluster related as I have no problems with standalone Oracle databases on VMs.

My W2K3 virtual machines crash at random. I can't find anything useful in the vmware (vmkernel, hostd.log, etc.) logs, however, if I look at the vmware.log for one of the nodes, the crashes start off with a "vcpu-0| CPU reset| soft". That's it, and that's not enough. The Oracle alert logs have nothing of any use beyond the usual remaining node indicating that it's lost connection with a member.

I've had similar problems with Oracle 9i RAC on virtual machines using ESX 2.5.x, though they never resulted in node restarts, just lost communications with instances going down. Once I put that configuration on physical hardware - I had no problems whatsoever. I can only assume that once I put this 10g configuration on physical hardware, these particular problems will vanish also (I hope! :-).

Does ANYBODY have ANY ideas on what could cause (probable) Oracle clusterware problems on ESX Server? This isn't production, but it bugs the living heck out of me. It can't be pure coincidence that I've had problems on both 9i and 10g RAC on both ESX 2.5.x and 3.0.

Thanks...

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

--The Windows eventlog error looks like this --

Event Type: Error

Event Source: System Error

Event Category: (102)

Event ID: 1003

Date: 2/12/2007

Time: 11:16:26 AM

User: N/A

Computer: SYSTEM NAMEOMMITTED BY POSTER

Description:

Error code 0000ffff, parameter1 00000000, parameter2 00000000, parameter3 00000000, parameter4 00000000.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

Data:

0000: 53 79 73 74 65 6d 20 45 System E

0008: 72 72 6f 72 20 20 45 72 rror Er

0010: 72 6f 72 20 63 6f 64 65 ror code

0018: 20 30 30 30 30 66 66 66 0000fff

0020: 66 20 20 50 61 72 61 6d f Param

0028: 65 74 65 72 73 20 30 30 eters 00

0030: 30 30 30 30 30 30 2c 20 000000,

0038: 30 30 30 30 30 30 30 30 00000000

0040: 2c 20 30 30 30 30 30 30 , 000000

0048: 30 30 2c 20 30 30 30 30 00, 0000

0050: 30 30 30 30 0000

Reply
0 Kudos
13 Replies
ITThies
Hot Shot
Hot Shot

Hey Chuck,

if your vm´s are crashing regulary without any explanation, please check your RAM.

This is the most usual reason for that.

Run a mem test tool on you server for several hours (vmware itselfs wants to do this for about 72hours).

If your server manufacturer has a tool for testing, take this instead of 3rd party software.

----- Please feel free so give some points for a correct / helpful answer! Thank you!
Reply
0 Kudos
chuck_whealton
Contributor
Contributor

Thanks very much for this information, but there's something important to keep in mind.

This ONLY happens with my RAC systems. It >NEVER< happens with non-RAC systems. And it's happened with different RAC versions and different ESX Server versions - only using Oracle RAC with Clusterware.

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

Reply
0 Kudos
ITThies
Hot Shot
Hot Shot

We are using RAC 10g in two test environment on ESX 2.5.0 and ESX 3.0.1 and have no problems like that. Thats why I first would check the memory.

Btw. Oracle itself recommends to install the RAC Server on physical hardware and not in a vm.

----- Please feel free so give some points for a correct / helpful answer! Thank you!
chuck_whealton
Contributor
Contributor

Hmmm... Now that's interesting.

Are you using Windows VMs for your RAC?

Could you give me some more specifics? Fibre versus SCSI storage, etc.? Here's one thing I will do for the time being. I'll vMotion them over to the other node in our farm and see if they continue doing this.

They crash even when basically NOTHING is being done with them.

Thanks!

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

Reply
0 Kudos
jamome
Enthusiast
Enthusiast

ITTheis,

I don't know if that's entirely true - Here is a page describing how to setup a RAC using VMware. (Note: instructions are for VMware Server, but IMHO its still a good reference for Windows + RAC + VMware )

http://www.oracle-base.com/articles/10g/OracleDB10gR2RACInstallationOnWindows2003UsingVMware.php

Chuck_Whealton,

I think you should test your RAM first (possibly for 72 hours as recommended by VMware). There are conditions were RAM might be a little flaky but not detectable until you implement something like a RAC.

chuck_whealton
Contributor
Contributor

I want to thank you both for this information. I have vMotioned the two RAC systems over to another blade and we'll see how they work over there.

I also saw one or two items I missed in the Oracle RAC on VMware documentation and I've made those changes.

In the meantime, I'll see if the SmartStart has anything for memory exercising on it.

Thanks very much and I'll post my findings, though that may be a number of days.

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

Reply
0 Kudos
chuck_whealton
Contributor
Contributor

Guys, just an FYI that after moving my two Oracle 10g RAC virtual machines to another ESX server AND making a couple parameter changes that I had apparently missed, so far, I have no random crashes.

I finally dug up a SmartStart CD-ROM and I'm now performing memory tests on the ESX server in question (a BL20p G3). I have it set to run for two days. If it succeeds, I'll move both RAC virtual machines back and if I experience no more problems, I'll attribute it to the parameters I missed. If they fail, I'll run the rest of the diagnostics and take it from there.

I'll let you know. Thanks again for your help...

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

Reply
0 Kudos
mcclngn
Contributor
Contributor

I have similar problems. We have ESX 3 on Linux servers, and I am setting up an Oracle 10g RAC on VM. /var/log/messages will show reservation conflicts for the shared disks I've set up for RAC and CRS, and for ASM. When the reservation conflict hits the disk being used by the CRS, the node is reset.

How did you build/configure your shared disks (the ones that work). What parameters had you missed?

Thanks,

Don

Reply
0 Kudos
chuck_whealton
Contributor
Contributor

Hi...

In fact, after I posted my last message and made the mistake of opening my mouth to say I'd had no more crashes, I had one RAC node crash on the other ESX server.

My 2-day memory diagnostics test on the first ESX finished with no errors.

So it's not bad memory and it's not the couple of items I uncovered in the Oracle document that I had missed.

I'm out of ideas. This only happens to me with Oracle RAC on VMware ESX Server.

Charles R. Whealton

Charles Whealton @ pleasedontspam.com

Reply
0 Kudos
mcclngn
Contributor
Contributor

Thanks for the reply.

Are you using ocfs or ocfs2 for your RAC? The reason I ask is that I had an OCFS RAC cluster running on VMWare. My current project is an OCFS2 RAC cluster. It's the one that is crashing.

Thanks, Don

Reply
0 Kudos
utkinpol
Contributor
Contributor

Hi,

I am trying to set a Lab setup as described in

http://www.oracle-base.com/articles/10g/OracleDB10gR2RACInstallationOnWindows2003UsingVMware.php

using RHEL4 as host OS and VMWare free server and as I see nodes do crash at the moment close when both Oracle nodes are almost done starting. Once 2003 server went into blue screen wiht IRQ related complain but most often it is a VMWare error message. Or just an immediate termination of VM.

If anyone knows how would it be possible to fix this problem it would help a lot.

Reply
0 Kudos
eotrmv
Contributor
Contributor

Have you ever gotten this problem resolved?

I'm having the exact same issue but running it on 2 HP DL380 with a MSA500.

I'm guessing it's not VMWARE nor is it memory hardware issues. There's nothing on the oracle forums but I'm leaning on a memory leak in RAC.

Thanks.

Reply
0 Kudos
fbaviere
Contributor
Contributor

Did you solve your problem finally?

Got the same trouble.

CRS is installed but I can't install DB nor configured disk.

VMware server 1.0.3 is installed on W2k3 SP1 US and I ma using W2K3R2 Frenchas guest.

I have missed in the documentation but I have corrected afterward:

/etc/hosts: localhos.localdomain

node-priv.without extension

Distributed Transaction Service not activated

I have seen it somewher on the net but I cannot do it

Write Cache on disk in windows, it is greyed out.

scsi1:1.deviceType = "plainDisk" ... at startup vmware server change it to scsi1:1.deviceType = "Disk"

of course everytghing else seems fine... cluvfy for example

I am lacking idea for the moment but I guess I will try to switch to Linux as guest and /or host

Please post any help

Reply
0 Kudos