VMware Cloud Community
KrishK
Contributor
Contributor

Oracle RAC on vSphere4.1 - Performance Problem

Dear Mates,

We are seeing performance issues for the Oracle RAC VM's. Oracle Support say that queries are waiting for IO. Below is the environment.

2 node (VM) Oracle RAC setup.

1 VM in each ESXi server.

No other VM's sharing the physical server as of now but planned for other DB VM's.

Guest OS is RHEL 5.5

CPU and Memory utilization is less than 25% utilized, each VM has got 16GB RAM and 4vCPU's.

Each VM has got 10 RAW LUN's apart from OS in a datastore.

ESXi - 4.1 Update 2

Storage - HP EVA

Physical Server - HP Blade BL 460c G7

San connectivity VC Flex - 4*8 GB FC link with speed set as 4GB to match storage controller speed.

Attached are the performance report from vCenter server for respective VM's.

Do you see any abnormalities in values from the report, any help on this case is highly appreciated.

Tags (2)
Reply
0 Kudos
18 Replies
KrishK
Contributor
Contributor

Folks, looking for some suggestions,

Reply
0 Kudos
Gkeerthy
Expert
Expert

i am not able to see the performance report... it is not opening properly... attach the screen shots or esxtop..results..

-can you confirm the total. LUN is divided/spread and owned between 2 controllers...

-did you done the partition alignment....if you use VMFS or RDM it is good to align the partition

http://www.vmware.com/pdf/esx3_partition_align.pdf

Aligned partitions start at 128. If the Start value is 63 (the default), the partition is not aligned.

- did you use RAID 10 for the redo logs...

- try to use RAID 50 for the DB data.. disks.. it is faster than raid5

- what is the raid strip size... makse sure the best size from the EVA manual to use in the oracle databases..

- what is the multipathing policy... you are using..it should be round robin

then what it below values you get from the esxtop... run esx top in the batch mode to monitor for 5 minutes..

GAVG (Guest Average Latency) total latency as seen from vSphere

KAVG (Kernel Average Latency) time an I/O request spent waiting inside the vSphere storage stack.

QAVG (Queue Average latency) time spent waiting in a queue inside the vSphere Storage Stack.

DAVG (Device Average Latency) latency coming from the physical hardware, HBA and Storage device.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100820...

http://blogs.vmware.com/vsphere/2012/05/troubleshooting-storage-performance-in-vsphere-part-1-the-ba...

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)
Reply
0 Kudos
Sreec
VMware Employee
VMware Employee

Hi,

    Unfortunately i'm not able to view the excel sheet which you have uploaded,its corrupted!!!Can you please let me know what sort of performance issue you are seing for these RACclustered VM'S?Do we have similar sort of problems for any other Vm's which are part of same port group?(To isolate issue is specifically on these two Vm's).Once i'm clear with same i will certainly let you how we can move forward

Note:Oracle RAC is unsupported clustering in Vmware

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
KrishK
Contributor
Contributor

Thanks Keerthy, Below are my responses to your Queries

-can you confirm the total. LUN is divided/spread and owned between 2 controllers...

     Verfying with storage team

-did you done the partition alignment....if you use VMFS or RDM it is good to align the partition

     Default value - 63

- did you use RAID 10 for the redo logs...

     No RAID 5

- try to use RAID 50 for the DB data.. disks.. it is faster than raid5

     RAID 5

- what is the raid strip size... makse sure the best size from the EVA manual to use in the oracle databases..

     Verfying with storage team

- what is the multipathing policy... you are using..it should be round robin

     VMW_PSP_FIXED_AP

then what it below values you get from the esxtop... run esx top in the batch mode to monitor for 5 minutes..

Below are the values for the data analysed for 24 hrs

GAVG (Guest Average Latency) total latency as seen from vSphere - 28ms with max latency as 166ms

KAVG (Kernel Average Latency) time an I/O request spent waiting inside the vSphere storage stack.- 0.08ms with max as 3ms

QAVG (Queue Average latency) time spent waiting in a queue inside the vSphere Storage Stack.- 0.05ms with max as 8ms

DAVG (Device Average Latency) latency coming from the physical hardware, HBA and Storage device.- 6ms with max as 32ms

Whats the acceptable value for above parameters?

Reply
0 Kudos
KrishK
Contributor
Contributor

Hi Sreec,

There are some queries in Oracle DB waiting for I/0, hence trying to figure out where exactly the problem resides.

2 ESXi servers hosting 1 each oracle VM and these VM's are in RAC. We see average physical disk latency above 20ms, average Queue and kernel latency less than 1ms.

Max phy disk latency at 150ms, kernel latency as 3ms and queue latency as 6ms.

To understand storage performance do we need to consider only average values for a time interval or teh max values?

What does these valuses signify, whr does the problem reside, storage,kernel or queue stack?

CPU and memory of these VM's are less than 25% utilized.

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee

Hi Kris,

          Thanks for your response.What is Davg value that you are seing?Since the esxtop screen gets updated every2 sec by default i would suggest you to take a close look at the same for the lun where VM is residing.If the value is above 10 on a constant basis,then you will have a performance issue.Please confirm the values once more .

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
KrishK
Contributor
Contributor

DAVG - 12ms (Varying between 8 and 16ms with max at times at 120ms)

Reply
0 Kudos
Sreec
VMware Employee
VMware Employee

Hi ,

      If the DAVG value is going above 10 on a constant basis,we have a storage latency which needs to be fixed.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
TomHowarth
Leadership
Leadership

Your issues seems to be related to Storage your davg count is high.  should be below 10ms you are averaging 12.  your disk groups are not aligned properly,

what is the data profile of this RAC, is it more heavily read or write biased.

if Write Biased, you really need to make sure that your RDMs are backed onto RAID 10 sets as this will get you the most performace from your Disks.

Remember that no matter what the size of a RAID 5 vDisk volume you will never get more than 1 write IOP per volume, in fact it is actually less as the data has to be writen to two disks before the write ack is returned.

How big is the write cache on the EVAs?  what is the loading of that SAN,  also is it running the latest firmware there was an issue with earliert EVA controller firmware that caused heavy pathing issues against ESX 4 machines.

Change your pathing policy to Round Robin.

Also repost your Excel logs so that we can read them.

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
Reply
0 Kudos
KrishK
Contributor
Contributor

Thanks Tom, See if you can read the attached excel.

Reply
0 Kudos
TomHowarth
Leadership
Leadership

I am also going to move this post to the Virtualising Oracle section, as the folks that peruse that section are proper Oracle peeps

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
Reply
0 Kudos
Sreec
VMware Employee
VMware Employee

Hi Krish,

             I tried opening the attached excel sheet.Looks like its corrupted Smiley Sad

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos
TomHowarth
Leadership
Leadership

Your file just seems to be an XML file not an Excel spreadsheet. please check the file format and upload again.

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410
Reply
0 Kudos
KrishK
Contributor
Contributor

My bad...please see if the attached one works.

Reply
0 Kudos
Simon_H
Enthusiast
Enthusiast

Well, I thought it would be worth resizing the axes on the chart in that spreadsheet:

There are so many columns on there it's hard to see what's what, but apart from that really bad 174ms spike at 18:27:39, whilst quite a few numbers aren't great (10ms+) they're not catastrophic either (I'm assuming everything is in ms!). It might be worth you trying to group the same measurement from multiple devices together on their own separate charts (one for physical read, one physical write, etc) - it's a bit tedious but shouldn't take too long.

Other than that you need more external information. What are normal response times for this time of the day/month/quarter/year, what it is that makes someone say that the performance is bad, statistics (like AWR) from the database about what its measured response times are, and so on. You need to build up a full picture of the problem and system interactions to try to see what area the problem lies in (it is easy for people to point at the database and say it's slow when it may be a result of something completely different).

Good luck!

Message was edited by: Simon.H - to add image from file instead of inline

Gkeerthy
Expert
Expert

from my experience ...the disk latency is too high for a DB.i think you need to redesign lot of sections..for the VM..

1- Do a partition alignment for the logs/data disks - refer to my blog http://pibytes.wordpress.com/2013/01/19/partition-alignment-in-vmware-vsphere-5-a-deepdrive-part-1-2...

2- use RAID 50 for DB data and RAID 10 for redo logs.. if it is WRITE intensive... use RAID 10 for DB data

3- use PVSCSI contoller for the VM

4- use dedicated LUN for this VM.. and check there is any bottle neck in the HBA side

5- spread the LUN across the controller.. and i believe the EVA is ALUA aware and use Round robin as multipathing policy.. also in EVA i heard the mutipathing threshold you need to reduce..default is 1000 IOPS it will switch to other path. check with the HP vendor.

final question...? is there enough HDD spindles in the array... and how many VM are there... and is the IOPS required...? capacity planning is done correctly?

Please don't forget to award point for 'Correct' or 'Helpful', if you found the comment useful. (vExpert, VCP-Cloud. VCAP5-DCD, VCP4, VCP5, MCSE, MCITP)
KrishK
Contributor
Contributor

Thanks Keerthy,

This is one helpfull answer, but looks like issue has been sorted without infra making any changes.

Application team made some changes resulting in better overall performance, seems they did indexing to reduce IO requests to storage by Oracle.

Still am validating all your helpfull suggestions, while doing so i changed LSI logic controller to paravirtual controller for a test RHEL 5 VM and it didn't find the boot disk at all. This VM has got 5RDM's and 5 virtual disks.

Same way when i tested it with another test RHEL5 VM, it worked. This VM has got 1VMDK and 1 RDM, any difference this makes.

Any best practices to change the controller?

Reply
0 Kudos
Simon_H
Enthusiast
Enthusiast

I can't remember when the VMware PV controller driver was included in RHEL - maybe not until 6. If it's not there I assume it gets installed with VMware tools so make sure they're installed before attempting to switch.

Fundamentally things like the PV SCSI are tinkering around the edges though - if you're asking for more sustained IOPS than your storage can provide then its response times will lengthen. Changes that affect SQL execution plans (and there are many), whether application or database configuration inflicted, then have an impact on server resources (most typically storage).

Whilst I don't think you described the context of this performance problem too much, if it's a production system that has been running happily for months (or maybe a year or two given that you're on ESXi 4.1), the first question is probably "what's changed?". If it's a new application version but roughly the same amount of functionality and business, most likely the application/database configuration (like stats) is at fault. If there's a load of new business and processing, perhaps the storage is now undersized. Finally with shared storage I'd suggest making sure that something else outside your control hasn't changed, e.g. that there aren't some new VMs using the same physcial disks as you (perhaps after a storage admin did some reorganisation).

Good luck!

Reply
0 Kudos