Hi all,
I have installed 2 virtual machines over 2 ESX 4.1 hosts configured with MS cluster services.
These 2 virtual machines have installed MS windows server 2008 R2 and configured as a cluster for SQL server 2008 R2.
We are using a HP EVA4400 SAN for all storage.
These machines have 2 network adapters (VMXNET3)one for the LAN the other for the cluster heartbeat. Both are configured to use same physical ports.
Our problem at the moment is that we are having throughput issues with all disks that are clustered.
Ive been using the SQLIO.exe tool to test. The command:
Sqlio.exe -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt timeout /T 10
If I run a I/O test on a non clustered RDM attached disk I get the following results:
C:\Program Files (x86)\SQLIO>sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt
timeout /T 10
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file l:\testfile.dat with 2 threads (0-1) using mask 0x0 (0)
2 threads writing for 10 secs to file l:\testfile.dat
using 8KB sequential IOs
enabling multiple I/Os per thread with 8 outstanding
using specified size: 100 MB for file: l:\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 15642.80
MBs/sec: 122.20
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 0
Max_Latency(ms): 52
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 60 38 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
If I then add the disk to the cluster I then get the result of:
C:\Program Files (x86)\SQLIO>sqlio -kW -s10 -fsequential -o8 -b8 -LS -Fparam.txt
timeout /T 10
sqlio v1.5.SG
using system counter for latency timings, 3579545 counts per second
parameter file used: param.txt
file l:\testfile.dat with 2 threads (0-1) using mask 0x0 (0)
2 threads writing for 10 secs to file l:\testfile.dat
using 8KB sequential IOs
enabling multiple I/Os per thread with 8 outstanding
using specified size: 100 MB for file: l:\testfile.dat
initialization done
CUMULATIVE DATA:
throughput metrics:
IOs/sec: 943.03
MBs/sec: 7.36
latency metrics:
Min_Latency(ms): 0
Avg_Latency(ms): 16
Max_Latency(ms): 1056
histogram:
ms: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24+
%: 63 34 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2
If we run this tool on all other servers we get adequate results. It just seems to be when they are clustered.
Anyone shed any light on this?
couple of quick thoughts:
--> ensure you're not using RR as the load balancing policy for your shared RDMs.
--> ensure you're using LSI logic SAS as your SCSI adapter..
reference and some additional guidelines: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=103795...
Hi Thanks of ryour help.
When you say "ensure you're not using RR as the load balancing policy for your shared RDMs." where is this configured? on teh SAN or VM?
It can be configured from
vCenter --> select host -> configuration -> Storage Adapters -> select HBA -> scroll and right on the RDM LUN -> manage paths -> Path Selection.
(this needs to be done for the LUNs used by MSCS VMs and on all the hosts in the cluster)
there is also a good description of the diff. multipath policies than I could provide in here: http://kb.vmware.com/kb/1011340
Just had a look. They are all set to Round Robin. What shoudl they be set to? By changing them will this affect anthing else?
RR is not supported, use either MRU or Fixed
EVAs are ALUA aware therefore you have the option to use either RR, MRU or FIXED_AP.
since RR is not supported with MSCS VMs, the next preferred policy is MRU.
take note that FIXED in vSphere 4.x is not ALUA aware and is not recommended for EVA arrays.
Update:
regarding impact, you may want to change it only for the LUNs used by MSCS VMs.
there should not be any impact to other VMs.
good luck and let us know how it goes...
Good news. I changed the policy to MRU and we instantly got a difference with throughout. Looks liek its fixed our issue.
I have one more question. Can i change these policies on the RDM;s that are attached to my live clusterred servers with affecting them?
Thanks for your help.
Matt
Glad to hear that!
if you are using cluster accross box, it maybe more safer if you change the RDM path policy first on the host running passive cluster node.
once done you may failover the resources onto passive node and then modify the policy on the remaining host.
Thanks all working
Great!! and thanks for updating the results..