VMware Cloud Community
meistermn
Expert
Expert

Single Threaded Application on HP DL 360 G3 faster than on Netapp 6070

The application Gupta SQL 9.0.1 is now running in a VM. The Datastores are on a Netapp 6070 (Raid DP 41 FC Disks a 300 GB)

The VM uses 5 LUN's. C: Windows 2003 D:Data E: Gupta SQL : F: Gupta SQL Logs G: Pagefile

Before it run on a HP DL360 G3 (Smart 5i raid controller 64 MB Cache) with 6 x 36 GB and 3 Logical Volumes on raid 5 .

On the HP DL360 G3 the databse update statistics runs 1Hour 3 Minuten.

The VM on the Netapp needs to 2 Hours 14 Minutes.

Analyzed Application SQL Gupta with perfmon: Average transfer size /sec for SQL Log 64 KB Read and Write (E:) and for SQL log 4 KB Read and Write

Performance Tweaks:

1.) Datacore Uptempo: Only 5 Mintes Faster. I think the DB , which 30 GB , is to big.

2.) Partition Aligment for E: and F Partition : 1 Hour 45 Minutes. Improvement 29 Minutes .

3.) OS Filesystem Blocksize(Clustersize) change for E: to 64 KB (default 4 KB) no improvement

4.) OS Filesystem Blocksize changed for F: to 32 KB (default 4 KB) no improvement

5.) Separate LUN for Pagefile G: no improvment

6.) VMware VC changed Queue Depth to 128 : no improvement

Arrgh!!!! What the hell is the problem? Is it cache for random io?

0 Kudos
30 Replies
mreferre
Champion
Champion

Interesting.

Is this an I/O bound workload? Or is it CPU bound? Or ...

If it's I/O bound ... what does the disk queue lenght look like from Perf Monitor ?

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
Craig_Baltzer
Expert
Expert

In addition to Disk Queue Length as Massimo suggested, Avg. Disk sec/Read and Avg. Disk sec/Write would be interesting to see, esp if you still have the DL360 G3 around for comparison.

Craig

0 Kudos
meistermn
Expert
Expert

It is CPU , IO and Memory bound.

CPU stays above 50 % Usage

Memory is 70 -80 % Usage

I/O is dropping see attachment

The VM has 2 VCPU and 4 GB RAM.

0 Kudos
Craig_Baltzer
Expert
Expert

The dropping IO is just a symptom, something is causing the IO rate to drop. The disk sec/read and disk sec/write counters will give you some insight into whether the storage response is getting slower at the same time (i.e. if disk sec/write goes from 20ms to 500ms then disk usage (IO rate) is going to drop and you know to look at the IO path as a possible source. If the disk sec/write stay constant then contention likely lies elsewhere...

meistermn
Expert
Expert

The disk sec /read is 1182 often over 1000. This is bad.

The disk sec /write is 25.

0 Kudos
Craig_Baltzer
Expert
Expert

Certainly "feels" like there are some IO challenges. Have you had a look at ? It would be interesting to see what ESX is seeing in terms of IO for the LUNs being used for the databases, with GAVG/cmd and DAVG/cmd for the VM being of particular interest. I'd assume that since "read" is the issue that you'd mainly want to look at the database LUN as the log LUN shouldn't be getting used on a read operation...

0 Kudos
Craig_Baltzer
Expert
Expert

Forgot to ask if this VM was P2V'd from the physical DL360 or was a fresh OS install? If it was a P2V have all the DL360/HP specific drivers and devices been cleaned out of the VM?

0 Kudos
meistermn
Expert
Expert

It is was a Fresh install.

0 Kudos
meistermn
Expert
Expert

From the following document I what to be sure i did understand the three Memory Statistics right:

http://communities.vmware.com/docs/DOC-5600

Level

Counter name in API

Description

units

1

mem.usage.average

The percentage of memory used as a percent of all available machine memory. Available for host and VM.

percent

This means, when I look at the performance statistic in VC for the VM SV100138 , which has 4 GB VRAM , the Percent. So what is 3699 percent ?

Look at the attachment. Unclear.

2

mem.active.average

The amount of memory used by the VM in the past small window of time. This is the "true" number of how much memory the VM currently has need of. Additional, unused memory may be swapped out or ballooned with no impact to the guest's performance.

kiloBytes

Active means how many memory the VM has used in KB. This means that used 1551892 KB (~ 1.5 GB Memory). That seams to be clear.

2

mem.consumed.average

The amount of machine memory that is in use by the VM. While a VM may have been configured to use 4 GB of RAM, as an example, it might have only touched half of that. Of the 2 GB left, half of that might be saved from memory sharing. That would result in 1 GB of consumed memory.

kiloBytes

Unclear. In the attached performance graph comsumed memory is constant used with 3548044 ( 3.5 GB Memory).

Okay Active Memory in VC for a VM is equal to (Total - Available = Used) in Process Explorer in Sysinternals and Task manager (Section Physical Memory ) . See attachment

0 Kudos
meistermn
Expert
Expert

It seems to me that I cannot trust the windows perfmon/sysinternals process monitor for memory.

I used 4 performance tools.

1. VC (see attachment vc-active-memory.gif)

2. Vm Perfmon inside the VM (see attachment vmperfmon-activemory.gif)

3.) Process Monitor inside the VM (see attachment processmonitor-physical-memory.gif)

4.) Spotlight on Windows from ouside of a XP client.

0 Kudos
mcowger
Immortal
Immortal

How does your Host connect to the NetApp? FC? NFS? iSCSI? Over what kind of link?

--Matt

--Matt VCDX #52 blog.cowger.us
0 Kudos
meistermn
Expert
Expert

FC

0 Kudos
meistermn
Expert
Expert

Look at the spotlight-overview. SQL-Gupta is at the moment running.

Read Hit Ration is 0 %

Pages found in RAM 50 %

Reading from Pagefile to Memory 3582 pages/s to 5100 pages/sa

From Perfmon queue depth is 29 , disk writes /sec 245,178 and Disk Read /sec 1695,234 ?

0 Kudos
Craig_Baltzer
Expert
Expert

The "3699%" is definitely strange. It should be close to "Memory Active" / "Memory Granted". Do any of your other VMs show a strange % like this? Wonder if this is a bug in the localization of the UI...

Consumed memory looks reasonable. Most times when the OS and applications start they allocate a bunch of memory (a.k.a. "consume" it) by setting up buffers, caches, etc. They then may never actually write any data there (i.e. cache never fills, etc.). In this case the VM is 3.9GB allocated to it, the OS + apps have issued memory requests for 3.6GB, and if busily using 1.5GB of that so the proportions make sense i.e. (granted > consumed > active).

There may not be a direct relationship between "Available" in System Information and "Consumed" in VC. Page sharing is going on (ESX looking for pages that can be shared thus reducing the "consumed" value). When I add "Memory Shared" to my graphs it is almost a mirror image of "Memory Consumed" (i.e. if "Memory Shared" is going up then "Memory Consumed" tends to be going down at the same time). If there are no other VMs running on the ESX server then they should be very close.

From a performance perspective memory looks good. The counters you have show the "Memory Balloon" driver is not active, meaning that there is no memory overcommitment putting pressure on this VM to give up memory. You could also add "Memory Swapped" just to verify that ESX is not doing any swapping, however I'm almost 100% certain there is no swapping from looking at Memory Active and Memory Consumed.

0 Kudos
Craig_Baltzer
Expert
Expert

From the OS the closest thing to an "active" memory counter would be in perfmon (Process, Working set) not in System Information or Task Manager. There's not going to be a direct mapping between "acitve" and "working set" as the algorithms used to calculate them are very different as is the calcuation time period. So ESX may think memory is inactive while Windows still thinks its part of the working set for a process.

I don't think there is anything of value trying to relate these two things together when both ESX and the OS are not showing any memory overcommitment/swapping/paging...

0 Kudos
RParker
Immortal
Immortal

3 Things

We have Dell 2950, SAS drives.

We have Netapp 3070 via 4GB Fiber. We just upgraded to 300G / 15K Drives on a brand new system (from 3050).

We have the SAME problem as you. It can't be what you are I are doing. You have HP, the only common denominator is Netapp. Not to slam Netapp, but it seems to be a common issue. Maybe it's the Fiber drivers.

You didn't say if you were using Emulex or QLogic, I am moving toward Emulex as better performance, from early testing I have done.

But I just wanted to say you can set your mind at ease, that it's not you or the configuration, or some "slow down". Something inherit with this setup just doesn't jibe, and I don't know what it is. But the other common factor obviously is ESX, so perhaps there is something beneath the scenes that isn't giving ESX to be all it can be.

0 Kudos
RParker
Immortal
Immortal

You can dig around all you want. This has been bugging me for months. I give up.

I can take a new setup, install ESX on a 2950 with 32GB of RAM. Dont' care what we do, Linux, Windows 2003, Windows 2008, SQL Server, Oracle, 4GB 8GB of RAM, 2 CPU, 4 CPU, and everything in between ever combination, trust me, I have tried it. It makes NO difference.

Fiber seems to be "laggy" for lack of a better word in ESX. But it's NOT from a physical perspective, it's VERY fast native on a stand alone machine, doing the SAME operations.

It MUST be a flaw in ESX. IT works fine, it's just NOT as good as it SHOULD be, and unless someone sits down with VM Ware to sort this out, this isn't going to be fixed.

0 Kudos
Craig_Baltzer
Expert
Expert

Ok, pardon the dumb Canadian question but these "periods" and "commas" are confusing me in the localized UI screens. Does 3.583 mean "three thousand, five hundred and eighty three" and 3,75 mean "three and three quarters" (i.e. between 3 and 4)?

I think Spotlight memory section is getting confused by the virtualization as its numbers don't make any sense. On the one hand it says that only 50% of the pages are in RAM so the rest are paged, yet there is only 31MB used in the page file. The paging rate implies that there are 3582 page operations per second happening point to the page file, yet the page file is on G: and there is no IO going against that drive showing in perfmon. What does perfmon say for Memory, Pages/sec say? That's the counter that tells you "hard" paging that caused something to be read/written to disk (which, based on drive G: showing no IO, I don't think his happening).

The disk counters are interesting to see, particularly the IO rates (transfers/sec). That's showing 1700 IOPS going to the LUN for drive F, and 245 going to the LUN for drive E. Are these the same set of disks on the NetApp (i.e. are E: and F: just two VMDK files on the same SAN LUN, or are they 2 LUNs that share the same set of disks?). To me it looks like there is a significant write issue going on with disk writing based on the low IOPS and high queue length. The rough rule of thumb is that a 15K RPM FC disk can deliver 140 IOPS in a RAID1 configuration, and around 120 IOPS in a RAID5 configuration give or take a bit. So you're seeing basically 2 physical drives worth of IO happening for drive E: and a ton of queuing. Something is way not right.

So to take the database out of the picture I'd suggest moving to IOMETER for another round of testing and see what the actual throughput of this is. That will let you create a few different LUN configurations on the SAN and see what works best. Whenever you're using IOMETER for testing to make the disks the same size as you're using for the databases and let IOMETER use the whole disk; you want to make sure that you don't create little disks that the NetApp can easily cache and skew your results.

0 Kudos
mcowger
Immortal
Immortal

Interesting - we have no problem pushing hundreds of MB/sec at thousands of IOPs on our 32GB Dell 2950s against our fibrechannel SANs.

--Matt

--Matt VCDX #52 blog.cowger.us
0 Kudos