meistermn
Expert
Expert

Single Threaded Application on HP DL 360 G3 faster than on Netapp 6070

The application Gupta SQL 9.0.1 is now running in a VM. The Datastores are on a Netapp 6070 (Raid DP 41 FC Disks a 300 GB)

The VM uses 5 LUN's. C: Windows 2003 D:Data E: Gupta SQL : F: Gupta SQL Logs G: Pagefile

Before it run on a HP DL360 G3 (Smart 5i raid controller 64 MB Cache) with 6 x 36 GB and 3 Logical Volumes on raid 5 .

On the HP DL360 G3 the databse update statistics runs 1Hour 3 Minuten.

The VM on the Netapp needs to 2 Hours 14 Minutes.

Analyzed Application SQL Gupta with perfmon: Average transfer size /sec for SQL Log 64 KB Read and Write (E:) and for SQL log 4 KB Read and Write

Performance Tweaks:

1.) Datacore Uptempo: Only 5 Mintes Faster. I think the DB , which 30 GB , is to big.

2.) Partition Aligment for E: and F Partition : 1 Hour 45 Minutes. Improvement 29 Minutes .

3.) OS Filesystem Blocksize(Clustersize) change for E: to 64 KB (default 4 KB) no improvement

4.) OS Filesystem Blocksize changed for F: to 32 KB (default 4 KB) no improvement

5.) Separate LUN for Pagefile G: no improvment

6.) VMware VC changed Queue Depth to 128 : no improvement

Arrgh!!!! What the hell is the problem? Is it cache for random io?

0 Kudos
30 Replies
Craig_Baltzer
Expert
Expert

Two other quick things:

  1. General "rule of thumb" is not to increase queue depth above 64 on ESX (not that it is particularly relevant here since you had performance problems at the defaults, but there is always a concern in a shared SAN environment of overrunning the SAN storage controller)

  2. Have another look at the "storage analysis" section of and particularily using ESXTOP to look at %USD (amount of queue depth being used), DAVG/cmd (latency in the SAN), KAVG/cmd (latency in the kernel), ABRTS/s (the number of commands being aborted (basically a timeout))

You can import esxtop data into perfmon to make it more readible. goes through how to setup esxtop to capture the data to a csv file, and http://communities.vmware.com/docs/DOC-5100 goes through how to get it into perform. Looking at things in perfmon with all the features available to you (esp with Vista/Windows 2008 performon) much bettern than trying to figure anything out looking at the esxtop "table" display...

0 Kudos
meistermn
Expert
Expert

There is a huge differents e between single and multithreed applications.

With multithreaded iometer testing (outstanding Io 25) the netapp 6070 I can get 180 MB/s for 100 % random read.

But for single threaded application it is only 7 MB/s (outstanding io 1) or slower for random read.

Read the following document. Page 20 and Page 21.

There are to paging files . One on c: and one g: .

0 Kudos
meistermn
Expert
Expert

We use Qlogic cards (PCI-x). Look at snia document for single threaded application.

I did following test: All run datacore uptempo

HP DL 360 with Raid Controller Smart 5i without a BBWC (64 MB Cache default 100 Read Ahead) 6 x 36 Disks. 3 Logical Volumes ,all Raid 5.

Iometer result 20 MB/s for 100 % read random and 1 out standing io

IBM DS 8000 with Raid 5 (7+1)

Iometer result 16 MB/s for 100 % read random and 1 out standing io

Netapp 6070 with raid dp 41 disks raid dp.

Iometer result 7 MB/s for 100 % read random and 1 out standing io.

I read about the raid levels 1 , 10 , 5 and 6 and sql, oracel documents and all tell that for sql logs raid 10 is use and for sql db allthough raid 10 should be used. Allthough a sun document telles that the raid level 6 (raid dp) can only achieve 66 % raid 5 performance.

I am allthough astonished about the new netapp performance accelartion module.

They say following:

"For read intensive random I/O applications that are latency sensitive this requires the configuration of a high number disks in order to provide user and application acceptable latencies."

And if you look at the snia document on page 20 at the Note "Single thread applications are extremly sensitive to latency ...", they point in the same direction.

For me applications that need high random read /write io should be placed on solid satet disks or use much cache.

I am very interested in the new intel solid state disks. Look at the iozone results . (sorry the text is german, but the random values tells for them self)

iozone KByte/sec

Intel SSD 80 GB SATAII

Seagate Cheetah 73.4 GB 15k rpm SAS

2 xWD Raptor 36 GB 10k rpm RAID0 SATA

Maxtor DiamondMax 9 Plus 160 GB 7.2k rpm SATA






sequential write 4 KB

75'471

74'730

47'840

44'720

sequential write 16 KB

76'178

87'147

50'381

45'463

sequential write 32 KB

77'325

94'126

66'401

44'700

sequential write 64 KB

76'484

104'642

68'319

41'458

sequential write 128 KB

77'204

78'664

84'472

43'977

sequential write 256 KB

77'257

87'703

89'999

49'007






sequential read 4 KB

240'626

61'252

20'062

39'566

sequential read 16 KB

240'505

69'315

21'810

43'703

sequential read 32 KB

239'662

74'521

32'627

37'019

sequential read 64 KB

239'235

81'718

39'331

39'904

sequential read 128 KB

242'603

36'893

53'627

40'527

sequential read 256 KB

240'404

71'301

49'105

43'344






random write 4 KB

48'194

4'521

3'005

1964

random write 16 KB

75'127

7'356

7'573

5640

random write 32 KB

79'788

12'493

13'012

8'853

random write 64 KB

81'474

22'158

20'132

12'756

random write 128 KB

79'147

33'757

24'931

17'808

random write 256 KB

80'460

47'147

35'806

23'862






random read 4 KB

30'287

2'617

1'203

362

random read 16 KB

95'552

4'504

4'331

1'451

random read 32 KB

149'029

7'291

7'057

3'094

random read 64 KB

197'591

12'840

11'432

5'775

random read 128 KB

267'814

24'534

14'976

10'169

random read 256 KB

205'981

26'387

18'218

12'364

0 Kudos
meistermn
Expert
Expert

I think 3.583 means "three thousand, five hundred and eighty three".

I dont think that spotlight, perfmon and sysinternals process monitor show a false used of memory. But i am not sure.

What if from 4 GB only 2 GB can be used.

4 GB in Windows menas, that 2 GB is used for Usermode and 2 GB for Kernelmode.

This means to me that in 32 bit Windows Standard Edition an application cannot use no more than 2 GB in usermode.

Kernelmode can use usermode memory, but usermode cannot use kernel memory!!!

The only was to use more usermode memory is to use the switch /3G in the boot.ini. I do not no if this works for the standard edtion of windows 32 bit

0 Kudos
Craig_Baltzer
Expert
Expert

The SAN controllers typically have a bunch of cache already on them, however "more is better" holds true when it comes to cache.

Single threaded, sequential IO is definitely latency sensitive, and direct attached disk will typically outperform anything else just because of the physics (signal has to travel less than 1 meter and go through 1 controller rather than travelling 10s of meters and go through multiple controllers. That being said it is extremely rare to find any multi-user application (or any application at all for that matter) that is single threaded now days.

The thing that sticks out for me looking at your perfmon disk results is that reads look reasonable, but the writes are horrible. That usually points back to some kind of write cache issue (cache disabled, not being used or the application is requesting a synchronous "write through" that is being honoured.

0 Kudos
Craig_Baltzer
Expert
Expert

I don't think that perfmon and sysinternals are "wrong" per se, I just think they're looking at counters that don't correspond exactly to the ESX/VC counters.

Spotlight's estimation of "hard" page rates just isn't supported by the perfmon disk numbers. If you're hard paging 3582 pages of memory per second, then it has to be going to the page file which is either on C: or G:. The whole G: drive is showing no IO at all so its not going there, and the C: drive is showing 4000 bytes/sec. The OS is certainly not paging 1 byte at a time, so the 3582 pages/sec number as well as the "cache hit ratio" that Spotlight is coming up with is suspect.

0 Kudos
meistermn
Expert
Expert

I removed the pagefile, which was on g: partition and allthough removed the g: partition.

Now there is only one pagefile on the c drive.

Then I changed the registry key DisablePagingExecutive in HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management .

.

Over the day only a pagefile of 10 MB (before 90 MB) was needed.

Then when the gupta db starts in the night, spotlight showed that form the 10 MB pagefile 2000-3000 pages /s were read.

Then I asked me , what is using the pagefile and I started the sysinternals process monitor while gupta was running and configured the path to c:\pagefile.sys.

Boom!!! A part of the dbntsrv process (gupta sql ) was paged out in pagefile. So is it now bad programed?

Only Physical Power will help: Fusion IO

0 Kudos
mreferre
Champion
Champion

So assuming the same behaviour applies to the physical servers you used to have .... are you saying that this usage pattern (well designed or badly designed - whatever it is) is very much penalized when running on ESX ?

Massimo.

Massimo Re Ferre' VMware vCloud Architect twitter.com/mreferre www.it20.info
0 Kudos
meistermn
Expert
Expert

We tried a new solution for Gupta application, because the customer was not satisfied.

It was implemented on a physical server.

We used a HP DL 360 G5 with 8 GB memory , a P400i with battery cache 256 MB and 3 x 36 GB (one spare) raid 1 and a second raid controller P800i

with 512 MB Cache with an external DAS with 10 x 36 GB Disks 15 K and Raid 5.

Now on we get 130 MB/s on the DAS!!! So at the moment DAS is Faster than any SAN!!!

I think it has definitly to do with the cache on the P800i controller and the different raid level 10. 5, 6

So in January 2009 we will test with the 4 x 16 GB netapp performance accelaeration cars, if can get the same or better performance on a san!!!

0 Kudos
CWedge
Enthusiast
Enthusiast

I've had problems with HP gear and ESX for years...

I have proven over and over again that you take ESX running to a Netapp NFS store create a VM, you get meager performance..

Remove Esx from said machine and install unbuntu and you get blazing speed.. Same hardware, same NFS store.

The answer....Esx binds the USB irqs to the console which only uses CPU 0. Because the irqs are shared with your storage controller and nics it also restricts all those to CPU 0.

We disabled USB and made sure the driver didn't start in esx. We got a 5x Througput increase

We went from 70MB/s on local vmfs storage to 333MB/s and were able to sautrate 1GBe at 100MB/s to the netapp 3070.

This has been true on any HP servers from like 5 G3+ that i've used.

0 Kudos
CWedge
Enthusiast
Enthusiast

See this thread of my suggestion being a success.

http://communities.vmware.com/message/1271112#1271112

0 Kudos