Re: New VMware, HP LeftHand SAN Environment SQL IO...

jlaco · ‎06-23-2011

Environment:

ESXi 4.1(HP version with mgmt agent for sim) embedded(on flash)

Advanced License

ESXI hosts

3 DL360 G7

No local storage

gigabit nics

jumbo frames enabled on iscsi vswitches and kernel ports

Flow control enabled

Round Robin multipathing

iops = 1 instead of 1000 on all devices(recommended by HP engineer)

Route based on IP hash

1 vSwitch with 2 teamed nics for VM\Mgmt, 1000mb full, one network

1 vSwitch with 2 teamed nics for vMotion, 1000mb full, one network

1 vSwitch with 4 teamed nics for iSCSI, 1000mb full, one network

All datastores are VMFS

4 datastores are 1tb

all others are below 1tb

There is at least 20% free in each datastore

Each datastore shows 4 paths

Switches

2 cisco 3750

flow control and jombo frames

VLan for iSCSI

VLan for vMotion

VLan for VM\MGMT

SAN

4 HP LeftHand p4500 storage nodes

ALB bonding

Jumbo frames enabled

Not sure about queue size

All SQL Data\LogFiles and Backup volumes are network raid 10 except for the os drive volumes

VM

OS: Windows 2003 Enterprise R2 x64

All drives except system drive are 64 aligned

We are not using any iscsi initiator within the vm

One, four core cpu(used advanced option to achieve this)

20GB of memory

VMXnet3 nic

Application on VM: SQL Server 2005

Drives on VM:

System, c:\, 100gb, resides on SAN raid 5 volume via vmfs datastore

DB Transaction Log, F:\, 100gb, resides on SAN raid 10 volume via vmfs datastore

DB Transaction Log, G:\, 100gb, resides on SAN raid 10 volume via vmfs datastore

DB Transaction Log, J:\, 130gb, resides on SAN raid 10 volume via vmfs datastore

DB Datafile, I:\, 1000gb, resides on SAN raid 10 volume via vmfs datastore

DB Datafile, M:\, 1000gb, resides on SAN raid 10 volume via vmfs datastore

DB Datafile, U:\, 1000gb, resides on SAN raid 10 volume via vmfs datastore ------ Heaviest hit

Temp DB, T:\, 250gb, resides on SAN raid 10 volume via vmfs datastore

Backup to Disk(Used for SQL backups), H:\, 1000gb, resides on SAN raid 10 volume via vmfs datastore

Restore Drive, N:\, 250gb, resides on SAN raid 10 volume via vmfs datastore

Issue:

We have a brand new VMware\SAN environment and we are experiencing slow IO issues. The VM itself from an OS point of view, zips along. But some, not all, SQL queries that had low wait times in the physical environment have really high wait times and we see many pageiolatch_sh wait states in SQL. I also used the PAL tool with the SQL threshold to analyze the performance logs we have for the new VM db server and it has entrys like: "Greater than 25ms logical disk READ response times" and "Greater than 25ms logical disk WRITE response times". There are also a couple "Greater than 900ms - Slower than a 3.5 inch floppy drive" for the two heaviest used sql data file drives. We also see slow read write times when using IO meter to look at the sequential and random read\writes(Can get numbers if needed). We are getting along but barely and I really think we should be getting more IO performance with the setup we have. The other thing that concerns me is the SQL VM is the only VM right now and we will be adding more VMs(low resource consumption and basicly no IO demand). I have our DBA's looking at queries and stored procs and looking for blocks to increase performance from that end as well.

Before moving into production, I engaged an HP consultant to analyze our environment and the main thing he suggested was changing the iops from 1000 to 1. We seen an increase in performance but not where we need to be. We tried changing the bond type on the SAN to 802.3ad but it was worse. We also paid another "VMware expert IT services" company to analyze everything and the IO is still below average. Let me know if you need more info... I welcome any and all suggestions\feedback! Thanks!

vmroyale · ‎06-23-2011

Hello and welcome to the forums.

Note: This discussion was moved from the VMware ESXi 4 community to the VMware vSphere Storage community.

Good Luck!

jlaco · ‎06-28-2011

I thought I would have more replies on this! But I guess it has been beaten to death or maybe my posting style is no good.... Anyways, I was looking at some different things to try and find an issue with my storage config and I noticed something in the esx.conf that seemed wierd to me.

/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000e6]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000e6]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000e8]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000e8]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000ea]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000ea]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f0]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f0]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f3]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f3]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f6]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000000f6]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000126]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000126]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000138]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000138]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b000000000000013b]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b000000000000013b]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b000000000000013e]/preferred = "iqn.1998-01.com.vmware:localhost-5c1cc3b0-00023d000001,iqn.2003-10.com.lefthandnetworks:companynamehere:318:backuptodisk,t,1-naa.6000eb3891805a5b000000000000013e"
/storage/plugin/NMP/device[naa.6000eb3891805a5b000000000000013e]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b000000000000013e]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000141]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000141]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000223]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000223]/rrIops = "1"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002d8]/preferred = "iqn.1998-01.com.vmware:localhost-5c1cc3b0-00023d000004,iqn.2003-10.com.lefthandnetworks:companynamehere:728:backuptodiskdb,t,1-naa.6000eb3891805a5b00000000000002d8"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002d8]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002db]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002de]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002e0]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002e2]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002e4]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b00000000000002e6]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000772]/psp = "VMW_PSP_RR"
/storage/plugin/NMP/device[naa.6000eb3891805a5b0000000000000772]/rrIops = "1"
/storage/swIscsi/enabled = "true"

The lines in green represent the device that is in fact listed in my 3 host cluster on each host. The one in red is not. Could it be possible that Esxi is trying to look at that device somehow even though it doesn't exist when I look in vCenter? Also, should I have any "preferred" lines in the file at all since I am using RoundRobin as my psp?

Thanks in advance for any insight or feedback you can give!

heldstma · ‎07-13-2011

We have recently purchased an 8 node Lefthand P4500 SAN. We configured our environment similarly to yours with a few exceptions:

* We did not decide to use Jumbo Frames as we would if our SAN fabric was 10gb Ethernet

* We left the IOPS patch changeover setting at 1000 IOPS instead of changing it to 1 as you have

Duncan Epping (Yellow-Bricks blog) talked about the IOPS=1 setting on a blog entry about 17 months ago at

http://www.yellow-bricks.com/2010/03/30/whats-the-point-of-setting-iops1/

* We left the default iSCSI Port Group settings alone and did not set a load balancing policy exception for IP Hash as you did.

* We are thinking about moving our SQL server to the Lefthand (it is on a Proliant server using DAS right now), but haven't done it yet.

The HP Assessing Performance in Lefthand SANs whitepaper talks about identifying bottlenecks in MS SQL environments on page 6 using the Microsoft SQLIOStress application.

HP's White Paper is at: http://h20195.www2.hp.com/v2/GetPDF.aspx/c01770507.pdf

Josh26 · ‎07-13-2011

"4 teamed NICs for iSCSI" doesn't sound like Multipathing.

http://fojta.wordpress.com/2010/04/13/iscsi-and-esxi-multipathing-and-jumbo-frames/

Have a look there.

The "route by IP hash" should not be relevant after following that guide.

logiboy123 · ‎07-13-2011

The following documents may be of some assistance to you.

I would have thought that 2 x 1GB NIC would have been enough for your storage network per host.

Configuring LeftHand with vSphere implementation document:

http://www.scribd.com/doc/24586958/Configuring-Left-Hand-ISCSI-and-VSPHERE-MPIO

Multi-vendor iSCSI SAN with vSphere implementation document:

http://virtualgeek.typepad.com/virtual_geek/2009/09/a-multivendor-post-on-using-iscsi-with-vmware-vs...

heldstma · ‎07-13-2011

I guess I assumed that when you said 4 teamed NICS for iSCSI, you meant 4 pNICs on 1 vSwitch, but with 4 VMkernel port groups running 1 active and 3 unused ... alternating through all 4 port groups?

Is that correct?

Matt

jrush1 · ‎07-14-2011

Hi Matt. Thanks for your response and sorry for my delay. Yes. The iscsi vswitch has 4 vmnics and 4 vmks. Each vmk has 1 vmnic listed as active in teaming and all other nics for that vmk are listed as unused. If we look at the next vmk, another vmnic is listed as active and the others are unused and so on. So it's a 1-1 relationship there. Using psp RoundRobin and IOPS are set to 1. After each iop, it's on to the next path. I have tried some different IOPS to see what would yield the best results and they all stay around 117mb/sec. I had a Cisco guy in yesterday to confirm the networking is correct. I have a support request open with vmware and this is what they said.

"Hello Jamey,

Thank you for your Support Request. in summary of our findings during the webex torubleshooting session: the 114 or so MBps you are getting with that test is to be expected, even with Round Robin load balancing. This is a multipathing technology to load balance multiple streams of IO, but it does not increase the throughput of any SINGLE stream of I/O. Any single read/write strem is still only going to move as fast as the line it is on. In this case, that line is 1Gbps, which equals out to about what you are getting (~120MBps).
True I/O increase through multipathing cannot be achieved through the technology we have implemented in ESX, but there are third party solutions. I do not know of any that are for the Lefthand SAN, but you could look into it. Another option, and most likely the more viable, would be to check if the SAN can handle 10gb NIC connections and upgrade to 10Gb NICs and connections to the SAN.

Please let me know if you need any further clarification regarding this matter. Thank you for utilizing VMware Technical Support!"

Here is my response to their determination.......

"Hi and thanks for your response. I have a couple questions.

First, I want to make sure I understand correctly.... There is no way I can get more than 112mb/sec throughput when I am using iscsi, 1gb nics in esxi no matter what version or configuration right?

Questions:
-Is this a limitation of my version of esxi 4.1(advanced)

-What guest OS iscsi iniators are supported by vmware? MS Iscsi iniator, HP DSM MPIO, etc...... I am using HP P4500 G2 SAN.

-If I use an iscsi initiator within the guest OS to get past the 112mb/sec, can I still use vMotion, HA and DRS?

-Using a third party guest os based iscsi initiator for local vm disks for a sql server, will I be able to get more than the 112mb/sec throughput cap I can get via native multipathing in esxi?

Thanks in advance for your help."

Does anyone have any other info or input? If we can't get over the 117mb/sec throughput we are going to look at a higher power solution(10gb fiber channel).

Thanks again for your input and info!

jrush1 · ‎08-11-2011

I wanted to follow this up with a status post. After reviewing the original resource utilization report that the vendor used to base their virtual infrastructure hardware recommendation on, we found that they put the peak IO throughput was 14mb/sec. I have perfmon logs back to a year ago that show our IO profile as having spikes of up to 400mb/sec during regular hours and up to 900mb/sec during backups. So all hardware they recommended, 4 HP LeftHand P4500 nodes (12, 600gb, 15k, sas disks, two 1gb nics), 1gb switches and 1gb nics in the hosts was way under what we need for acceptable performance in a virtual environment.

A vmware engineer elaborated on how round robin works and told me that the IO throughput is limited to the 1gb nic PER LUN even though we have 6 nics in each host.

My Tests:

-I used SQLIO to do a sequential read against one disk that was sitting on one lun that was presented to the VM through ESXi. The throughput was around 130mb/sec.

-I then used SQLIO to do the same sequential read test on a disk that was comprised of many volumes and many disks. Here is what I did to test this.

1. Create ten raid 10 volumes on the lefthand san

2. Create the datastores

3. Create a virtual disk on each datastore

4. Go into the VM and use windows disk manager to create a striped volume over all of the 10 disks

The most disks I added to the striped volume was 20 and I seen 490mb/sec. I would never have any production sql server data\logs sitting on a volume like this. This was strictly for POC.

Long story long, after the vendor seen our real numbers they came back and said they do not recommend putting our SQL Server in a virtual environment. They now would like to cluster it over two physical hosts and use the DSM MPIO for the LeftHand to get the IO. I believe it is possible for sure but it would require a more powerful storage network across the SAN, Switches and hosts.

It has been a journey and I have learned a TON! All of this started from a little mistake on a little IO profile report!

All

New VMware, HP LeftHand SAN Environment SQL IO Slow