VMware Cloud Community
LanBeheer
Contributor
Contributor

How is an Oracle RAC on VMware setup supposed to deal with storage failure?

Hi all,

I am currently testing Oracle RAC running on a VMware vSphere 6 platform. The main purpose for us to use Oracle RAC is high availability and redundancy. We want to have the Oracle database up and running with zero interruption even if any single piece of hardware fails or is accidentally misconfigured.

Our setup is similar to that described in VMware's own whitepaper, in that we use VMDK disks with the write-sharing parameter enabled in order to attach VMDK disks to two RAC nodes at the same time. The setup in more detail:

We have two geographically separated sites. Let's call them SiteA and SiteB. Each site contains a vSphere 6 cluster and an EMC VNX storage unit. Let's call them ClusterA and StorageA, and ClusterB and StorageB. There is a fast WAN connection between the two sites, making it possible to have inter-site storage traffic. StorageA presents a lun called DatastoreA to both ClusterA and ClusterB. Likewise, StorageB presents a lun called DatastoreB to both ClusterA and ClusterB. These luns are formatted within VMware as VMFS-5 datastores.

We have two virtual machines, let's call them RAC1 and RAC2, both running Windows Server 2012 R2. RAC1 is running on ClusterA, and RAC2 is running on cluster B. The storage for the OS and application is presented by each VM's local storage unit.

Now we create the storage for the Oracle ASM disk groups. In the VMware settings of RAC1 we create two VMFS virtual disks of the same size, the first one on DatastoreA and the second on DatastoreB, making sure to enable multi-writer mode. Then in the VMware settings of RAC2, we connect the exact same disks, again enabling multi-writer mode. We repeat this entire process for all the other disks needed by Oracle RAC.

Now, our DBA can install Oracle RAC and Oracle ASM disk groups on the two servers. He creates disk groups which each contain two failure groups. One failure group contains all the disks on DatastoreA, while the other failure group contains all the disks on DatastoreB. Having finished the Oracle ASM and Oracle RAC configuration, he installs a database.

While testing this setup for resiliency against hardware failures, we wanted to know what happens in the event of a total loss of a single storage unit. To this end, we accessed the management console of StorageA and unpresented DatastoreA from both clusters, meaning both clusters suddenly lost connectivity with DatastoreA, creating a PDL (permanent device loss) situation.

What happens next is that both RAC1 and RAC2 completely freeze, and VMware generates a dialog box for each VM, like this:

pastedImage_0.png

After reading up on this, it seems to me that it is standard ESXi behaviour to freeze a virtual machine as soon as it tries to write to a VMDK file that is no longer available. Because both RAC1 and RAC2 are connected and try to write to a VMDK file that is no longer there, both VMs are automatically suspended by ESXi.

This is, of course, exactly what we DO NOT want to happen in a HA solution like Oracle RAC. The way it is now, a single storage failure or even a WAN failure would result in total loss of the database instance, even though one copy of all the database storage is still online. What am I missing here? What is the correct way to configure Oracle RAC on the VMware platform???

I would greatly appreciate any insights.

0 Kudos
2 Replies
DanTMan63
Contributor
Contributor

We had a similar problem only we never got to the DB install part. ASM would not stay online and with the Voting disks throwing I/O errors in the alert log citing CSSNM00059 write failed. Basically we don't have control over our VMware deployment another group does. They set these up as they did our other servers that use RDMs. With VMware 6 they added the Multi-Writer flag in the GUI and our VMware group enabled it. The problem was they used the RDM configuration for the SCSI Bus Sharing and set it to Physical. I finally found KB 1034165 which deals with my issue exactly.

About half way down the KB it stated that the SCSI Bus Sharing for the shared disks should not be changed from "None". We changed the setting to None and ASM is stable once again. Hope this helps. The KB article has all kinds of stuff not covered in the best practices for Oracle ASM on VMware. (BTW we are running RHEL 6 for this instance)

0 Kudos
bguez
Contributor
Contributor

Hello,

we have the same problème.

Have you find a solution ?

thank

b.guez

0 Kudos