ESX 4.1 - MS clustering disk performance

coop2012 · ‎05-30-2012

Hi all,

I have installed 2 virtual machines over 2 ESX 4.1 hosts configured with MS cluster services.

These 2 virtual machines have installed MS windows server 2008 R2 and configured as a cluster for SQL server 2008 R2.

We are using a HP EVA4400 SAN for all storage.

These machines have 2 network adapters (VMXNET3)one for the LAN the other for the cluster heartbeat. Both are configured to use same physical ports.

Our problem at the moment is that we are having throughput issues with all disks that are clustered.

Ive been using the SQLIO.exe tool to test. The command:

Sqlio.exe -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt timeout /T 10

If I run a I/O test on a non clustered RDM attached disk I get the following results:

C:\Program Files (x86)\SQLIO>sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt

timeout /T 10

sqlio v1.5.SG

using system counter for latency timings, 3579545 counts per second

parameter file used: param.txt

file l:\testfile.dat with 2 threads (0-1) using mask 0x0 (0)

2 threads writing for 10 secs to file l:\testfile.dat

using 8KB sequential IOs

enabling multiple I/Os per thread with 8 outstanding

using specified size: 100 MB for file: l:\testfile.dat

initialization done

CUMULATIVE DATA:

throughput metrics:

IOs/sec: 15642.80

MBs/sec: 122.20

latency metrics:

Min_Latency(ms): 0

Avg_Latency(ms): 0

Max_Latency(ms): 52

histogram:

ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+

%: 60 38 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If I then add the disk to the cluster I then get the result of:

C:\Program Files (x86)\SQLIO>sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt

timeout /T 10

sqlio v1.5.SG

using system counter for latency timings, 3579545 counts per second

parameter file used: param.txt

file l:\testfile.dat with 2 threads (0-1) using mask 0x0 (0)

2 threads writing for 10 secs to file l:\testfile.dat

using 8KB sequential IOs

enabling multiple I/Os per thread with 8 outstanding

using specified size: 100 MB for file: l:\testfile.dat

initialization done

CUMULATIVE DATA:

throughput metrics:

IOs/sec: 943.03

MBs/sec: 7.36

latency metrics:

Min_Latency(ms): 0

Avg_Latency(ms): 16

Max_Latency(ms): 1056

histogram:

ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+

%: 63 34 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

If we run this tool on all other servers we get adequate results. It just seems to be when they are clustered.

Anyone shed any light on this?

vGuy · ‎05-30-2012

couple of quick thoughts:

--> ensure you're not using RR as the load balancing policy for your shared RDMs.

--> ensure you're using LSI logic SAS as your SCSI adapter..

reference and some additional guidelines: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103795...

coop2012 · ‎05-30-2012

Hi Thanks of ryour help.

When you say "ensure you're not using RR as the load balancing policy for your shared RDMs." where is this configured? on teh SAN or VM?

vGuy · ‎05-30-2012

It can be configured from

vCenter --> select host -> configuration -> Storage Adapters -> select HBA -> scroll and right on the RDM LUN -> manage paths -> Path Selection.

(this needs to be done for the LUNs used by MSCS VMs and on all the hosts in the cluster)

there is also a good description of the diff. multipath policies than I could provide in here: http://kb.vmware.com/kb/1011340

coop2012 · ‎05-30-2012

Just had a look. They are all set to Round Robin. What shoudl they be set to? By changing them will this affect anthing else?

beckham007fifa · ‎05-30-2012

RR is not supported, use either MRU or Fixed

Regards, ABFS

vGuy · ‎05-30-2012

EVAs are ALUA aware therefore you have the option to use either RR, MRU or FIXED_AP.

since RR is not supported with MSCS VMs, the next preferred policy is MRU.

take note that FIXED in vSphere 4.x is not ALUA aware and is not recommended for EVA arrays.

Update:

regarding impact, you may want to change it only for the LUNs used by MSCS VMs.

there should not be any impact to other VMs.

good luck and let us know how it goes...

coop2012 · ‎05-31-2012

Good news. I changed the policy to MRU and we instantly got a difference with throughout. Looks liek its fixed our issue.

I have one more question. Can i change these policies on the RDM;s that are attached to my live clusterred servers with affecting them?

Thanks for your help.

Matt

vGuy · ‎05-31-2012

Glad to hear that!

if you are using cluster accross box, it maybe more safer if you change the RDM path policy first on the host running passive cluster node.

once done you may failover the resources onto passive node and then modify the policy on the remaining host.

coop2012 · ‎06-01-2012

Thanks all working

vGuy · ‎06-01-2012

Great!! and thanks for updating the results..

All

ESX 4.1 - MS clustering disk performance