SQL Server 2008 on ESXi 4.1 and EqualLogic SAN - t...

Tpgit · ‎08-18-2010

We’re testing our first soon-to-be production servers with vSphere, and running into some performance issues.

First off, the setup:

- Two-tier Sage CRM system, with one dedicated web server (IIS) and one dedicated DB server (SQL).

- Both servers are running Windows server 2008 R2 64-bit, with their own VMFS volume on the SAN. Each VM has two virtual disks, consisting of a thin 30GB OS partition, and a larger (50 and 250GB)thin data partition. Both partitions are ‘VM Aligned’ since Server 2008 starts at 1024K, and the VMFS volumes were created in vSphere.

- No snapshots currently on volumes.

- The DB is SQL Server 2008 R2 64-bit, with 2 vCPUs, 10GB RAM, pvscsi/vmxnet3. The web server is set up with 1 vCPU, 8GB RAM, and pvscsi/vmxnet3.

- Both are running on the Dell R610 servers, Intel Xeon E5540, dual quad-core 2.53, 32GB RAM.

- iSCSI SAN: Dell EqualLogic PS4000E SAN with PowerConnect 6224 switches, 4x Broadcom 5709 nics running as HBAs with 1500 MTU (to utilize the Dell EL MEM).

Performance of the SQL server is less than ideal. Executing large queries takes longer on this setup than on our old server (IIS and SQL on the same box with local RAID5 array, dual core Xeon 5130 2.0GHx with 4GB ram).

I would think that it should be blazing fast. Pretty much a canned install of SQL, but we have the SQL priority boost turned on currently to see if that helps.

What things should I be looking at in terms of finding bottlenecks? The SAN is fast, the procs are fast, and it’s got plenty of memory, so something seems off.

Thanks a lot for any suggestions!

Aaron

RParker · ‎08-18-2010

What things should I be looking at in terms of finding bottlenecks? The SAN is fast, the procs are fast, and it’s got plenty of memory, so something seems off.

Run this: http://www.microsoft.com/downloads/details.aspx?familyid=9a8b005b-84e4-4f24-8d65-cb53442d9e19&displa...

I have had the same doubts myself, but recently I found out that VM's are NOT the problem, using this tool. So something else is not quite working right. There is a layer between the HyperVisor and the VM when SQL is running that I can't quite figure out. I can run very high IO (much like you said) on local disk, and then the SAN should be faster, that's true.

I haven't been able to identify what's causing it either. It makes no sense, so obviously IO is fine, the VM is fine, and your hardware is fully supported, so why is it a mystery? The commands are not translated properly it would appear.. because the SAN isn't lagging (latency) everything seems perfect, yet SQL runs slower on the SAN, and I don't have an answer. At least with this tool you know that it's not VM ware, because the disk throughput is there.

It could be a timing issue, VM's are scheduled, and maybe in between time slices it "misses" a IO call.. I am just guessing. I really don't know what it is.. but it happens on many VM's where SQL / Oracle is involved. So we don't generally run SQL in a VM because of it.

vmroyale · ‎08-18-2010

Hello.

What block size are you using on the VMFS volumes? There was a discussion not too long ago about bad performance when using the 8MB block size.

Good Luck!

Tpgit · ‎08-18-2010

Using 2MB block size on the VMFS.

Tpgit · ‎08-18-2010

I'm running the SQLIO tool now and logging the results. I will post them here when it's done, and perhaps you could give me some feedback on the numbers? I'm honestly not sure what I should expect to see. I'd love to hear that the VM/SAN is not the issue and that it's a SQL thing that we can tune out.

Another forum reply suggesting dropping the second vCPU, any thoughts on this?

RParker · ‎08-18-2010

I'm honestly not sure what I should expect to see.

It varies, but anything over 100Mb should be great. We got somewhere around 500Mb, so the connection, the VM or the SAN wasn't the issue, that's why I don't understand the performance isn't great. WE got around 400-500 Mb throughput, so your range should be between those 2 numbers.

Tpgit · ‎08-18-2010

400-500Mb or MB throughput, and on which test? For instance, I'm seeing almost 14K IOPS on 8KB random reads @ 109 MBPS, versus 297 IOPS on 256KB sequential writes @ 74MBPS...

This is an iSCSI SAN with 14 active SATA 7200RPM drives in a RAID 50 configuration.

SQLIO test results are in. Here are the results:

Random Writes:

8KB IOs: 6500 IOPS, 2ms average latency @ 51MB/sec

64KB IOS: 1150 IOPS, 12ms average latency @ 74MB/sec

128KB IOs: 647 IOPS, 24ms average latency @ 81MB/sec

256KB IOs: 339 IOPS, 46ms average latency @ 85MB/sec

Sequential Writes:

8KB IOs: 7300 IOPS, 1ms average latency @ 57MB/sec

64KB IOs: 1228 IOPS, 12ms average latency @ 77MB/sec

128KB IOs: 595 IOPS, 26ms average latency @ 74MB/sec

256KB IOs: 297 IOPS, 53ms average latency @ 74MB/sec

Random Reads:

8KB IOs: 13966 IOPS, 0ms average latency @ 109MB/sec

64KB IOs: 1784 IOPS, 8ms average latency @ 111MB/sec

128KB IOs: 896 IOPS, 17ms average latency @ 112MB/sec

256KB IOs: 448 IOPS, 35ms average latency @ 112MB/sec

Sequential Reads:

8KB IOs: 13832 IOPS, 0ms average latency @ 108MB/sec

64KB IOs: 1781 IOPS, 8ms average latency @ 111MB/sec

128KB IOs: 896 IOPS, 17ms average latency @ 112MB/sec

256KB IOs: 448 IOPS, 35ms average latency @ 112MB/sec

This is running SQLIO from within a Server 2008 VM. What do you all think?

RParker · ‎08-18-2010

versus 297 IOPS on 256KB sequential writes @ 74MBPS...

there you go. Exactly what I expected. That's great performance. So we know it's not the VM, the iSCSI connection, the SAN or block size. So what is it?

That is the 3 million dollar question.... If you can figure this out you will be a genius. It's obviously not the machine virtual or otherwise, because you have the numbers to prove the data IO is there.

RParker · ‎08-18-2010

This is running SQLIO from within a Server 2008 VM. What do you all think?

Latency is a bit high, should be more like 12-15 ms range, but that's not bad. Other than that, your SQL VM is capable of solid numbers.

J1mbo · ‎08-19-2010

Some thoughts:

- What RAID level is the PS4000E running?

- What storage was in your old server?

- Are the PS4000 LUNs themselves thin-provisioned?

- What does SanHQ report for TCP retransmit for the period covering the test?

- Is flow-control enabled on all iSCSI switch ports?

- Is multipathing configured per the EqualLogic guide with the IOPS=3 ESX iSCSI kernel port tweak (see here)?

- What does IO meter show (see here)?

- Have you tried using the LSI SAS SCSI adapter instead of the paravirtual?

http://blog.peacon.co.uk

Please award points to any useful answer.

Tpgit · ‎08-19-2010

RAID Level: RAID 50 w/ 14 disks, SATA II 7200 RPM

Old storage: RAID 5 x4 disks, SAS 15k RPM

PS4000 LUNS: Not thin-provisioned

TCP Retransmit: 0

Flow control is enabled switch-wide (These are two stacked PowerConnect 6224s)

ESXi iSCSI - not using the software initiator. Using the Broadcom 5709 NICS as iSCSI HBAs since they support iSCSI offload, with the Dell EL/VMWare MEM

SQLIO results are above, IOMeter was showing similar results in earlier tests.

LSI SAS: We were using this and getting similar latency/response, I moved to paravirtual in an effort to improve performance.

On a side note, I've got a second-tier case open with Dell and will be talking to VMWare later today, so hopefully I'll be able to come back with some good answers.

I did try reducing the vCPUs to 1, performance was similar (a hair slower?) but was seeing peaks higher than I'd like. I don't want this SQL server to be CPU-bound. Since this server has 8 cores currently available, I don't think there will be any scheduling conflicts with 2 vCPUs.

Anyone have input regarding what I should see on the average queue depth on my disks in the array? Average is 2 with peaks around 6 that match peaks in IOPS...

J1mbo · ‎08-19-2010

What was the size of the SQLIO test (the results looks like fully cached IO)?

The thing about multi-spindle arrays is that the performance only scales with concurrency. With Oracle I find similar results to you, a local array will beat a PS4000XV. Any FTS will obviously run much slower than from local storage, although hopefully your application won't be causing such.

Not sure that helps though

http://blog.peacon.co.uk

Please award points to any useful answer.

Tpgit · ‎08-19-2010

Yeah, OK, so the size of the test was 100MB =[ I was following a guide to using SQLIO from a blog that was used to test iSCSI performance. Guess neither of us were thinking about the cache! So I'm rerunning the tests with a 5GB file on a separate volume. I'm very curious to see my results.

I'm working with Dell to figure out if the SAN is as fast as it could me. Will report back tomorrow.

Thanks for the help and ideas, everyone.

CraigD · ‎01-25-2012

Wow! I read this thread with hurried breaths... couldn't wait to get to the answer. Any answer? Thanks!

gravesg · ‎02-06-2012

I'm starting to stumble upon instances of Eql array performance issues with large block sizes in google searches. What's going on here?.

In my setup on a ps4100X IOmeter tests perform extremely well when they are IO bound. The moment I kick off a file copy >100KB block size, performance hits the wall. This only happens from Windows 2008 VM's. I've tried a multitude of troubleshooting steps with no resolution. Dell hasn't been able to point me in the right direction

Tried:

-Jumbo frames on and off
-Updated Eql firmware to 5.2.0
-Tried with VM guest on different ESX hosts running 4.0 and 4.1
-Changed RAID type to r10 from r50
-tried HW iscsi over SW iscsi
-ESX delayed ack setting disabled/enabled
-Configured the DELL MEM PSP

- -Updated ESX 4.1 to Update 2 and updated VMtools

- -Updated ESX NIC drivers

- - Disabled LRO ESX Level

-Changed PVSCSI controller to LSI Logic and back

The one thing I notice in SANHQ is that when doing file copies in the Windows 2008 VM's, the Eql 1Gb network ports become saturated at 80%. Windows 2003 VMs only generate enough to saturate the link at 50%.

I need a smoking gun to say this issue can be fixed with a 10Gb array. In the end, why should a file copy cause an array bottleneck? *shrug*

DanRibar · ‎05-06-2013

Aaron --

I am in the EXACT same place as you. I was reading and thought it was my post as we seem to have the same environment.

Did you get to any resolution?

dwilliam62 · ‎05-11-2013

Good morning,

If you haven't already, please open a case with Dell/Equallogic support. If you haven't installed SANHQ yet, please do. That's downloadable from the Equallogic support site. www.equallogic.com

Since this original posting, support has found some common configuration changes that can improve performance.

1.) Update to most current build of ESX. (I've seen updated network drivers help)

2.) Disable Delayed ACK

3.) Disable Large Receive Offload

4.) If you are using VMDK/RDM disks, create a virtual SCSI adapter for each one. (Up to 4 per VM)

5.) For SQL, Exchange, Sharepoint, consider using the MS iSCSI adapter directly connecting to an EQL volume vs. VMFS

6.) Align the NTFS partitions (if not done by default), AND set the filesystem cluster size to 64K. This will match the 64K stripe size on the array 7.) Also, you can try the "Paravirtual" Virtual SCSI adapter on the data VMDK disks or Raw Device Mapped drives. On active disks, the paravirtual can provide better performance.

8.) The switch infrastructure configuration should be reviewed as well.

Hope this helps.

Regards,

Don

Nicbyte · ‎09-10-2013

Hi, Not sure if you have been sorted. Only came across your post now. Based on your setup you only have about 280 RAW IOPS available. Let me explain, a 7200RPM disk gets around 80 IOPS. You have 14 of those ( 14 * 80 = 1120), divide that by 4 because of your RAID50 penalty and you get to 280 usable IOPS on your array. Of course cache helps but only depending how much your workload is requesting. Let's say your SQL environment is requesting 2000 IOPS, your disk wont keep up with your cache. Hence your cache will run full and you will start queuing on disk. I would suggest looking at your avg Disk Write/sec and avg Disk Read/sec, these should be below 20ms, or < 0.0020. If above 20ms you have disk latency, start looking at your Disk Read/sec and Disk Write/sec to see what your R/W balance is and spec your disk accordingly.

dwilliam62 · ‎09-10-2013

FYI: There is now a document that covers best practices with ESX v4.x/5.x and EQL iSCSI storage.

It's available here:

EqualLogic Best Practices ESX

Regards,

Don

All

SQL Server 2008 on ESXi 4.1 and EqualLogic SAN - tuning/performance issues