Tracking down performance issue.

FunkyD · ‎12-05-2010

I have the followign setup:

3 x Dell PowerEdge 2950, 16GB memory

1 x Dell PowerVault AX150 - two storage groups, SG 2 has 4 drives and SG1 has 7 drives and a HSP.

SG1 has 7 LUNS of 250 - 600GB

SG2 has 2 LUNs of 500 and 300GB (some more LUNS are there for storing ISO files and other static data)

2 x Brocade 200E fibre switches

My virtual servers have been sluggish lately and I'm having trouble identifying the issue.

CPU seems fine - none of the machines max out the hosts.

Memory seems fine- All v-servers have enough memory and there is no ballooning

Disk - this is where things are interesting.

I've logged on to both fibre switches to check the stats. One switch show an aggregate throughput of 450Kbps Tx/Rx and the other switch shows about double, this is when things are more or less idle e.g. at the weekend when nobody is using the systems.

As a starting point I loaded IOMETER onto a physical server that is used for backups and connected to the SAN. I ran one test (to simulate SQL which is the most demanding of the servers) for 900 seconds, 16K, 67% read, 100% random and got:

256 IOPS per second

2.7Mbps read per second

15ms response time

334 max I/O response time

On the switch the graps are showing me an aggregate throughput of 5.7MB.

These figures seem really poor to me.

I can't seem to see why I'm getting such poor results?

Virtual servers perform equally poorly. When running esxtop I don't see any queues or anything to indicate a problem - just low transfer rates. The SAN seems ok, no errors and the cache is enabled.

Is this the limit of my storage because I can't seem to find any other explanation. If it was poor cabling then I'd expect to see errors on the switches.

Why aren't any queues forming on the LUNS? My VCentre disk graphs are practically flat for all the hosts and almost no latency. I can't understand why I can't get anything more than a few MB of throughput.

I'd appreciate some guidance to try and get to the bottom of the matter - I'd particularly like to know if there are any better tests to perform.

Cheers.

Edit: I ran the All in One Iometer test on my virtual SQL box and the switches max out at 32MB Tx/Rx throughput whilst preparing the drives. Still seems a bit poor to me?

AndreTheGiant · ‎12-05-2010

Can you give more info on the SG layout and RAID type?

SG2 is composed by the first 4 disks (that contain also the storage OS)?

Each LUN on which SP is assigned? Have you assigned all LUNs from one SG to the same SP?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

FunkyD · ‎12-05-2010

Hi,

Thanks for getting back to me.

Both storage groups are RAID 5

Disk 7 is the hot spare.

SG1 has the following LUNS on disks 4,5,6,8,9,10 and 11

LUN1 - 500GB - SPA

LUN3 - 250GB - SPA

LUN4 - 600GB - Physical Backup server - SPA

LUN5 - 500GB - SPA

LUN6 - 250GB - SPA

VD1 - 250GB - SPA

SG2 has the following LUNS on disks 0,1,2 and 3 that also contain the OS:

LUN2 - 250GB - SPA

LUN7 - 300GB - SPA

Software - 100GB - SPA

As you can see all LUNS are accessed through SPA. As I understand it the AX150 is Active/Passive so you can't use more than SP at a time?

Thanks!

Edit: I've just spotted that Disk Pool 1 is showing as fragmented. I've read that on a SAN you don't need to worry about this so much with virtualisation and many people seem to advise against defragmenting......

AndreTheGiant · ‎12-05-2010

RAID5 is not optimal for I/O write operation... so consider to to change a group to RAID10 to have more performance.

The AX is a active/passive, but on the same LUN... different LUNs can be owned by different SP.

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

FunkyD · ‎12-05-2010

Ok, so perhaps if I balance LUNS across each SP so each is roughly doing an equal amount of work I might get an improvement?

I know about RAID5 - couldn't afford to lose the capacity with RAID10 so kind of stuck with it.

What sort of throughput should I be able to get from the AX150?

Many thanks.

AndreTheGiant · ‎12-05-2010

If you give one RG to a SP and the other to second SP, you give each group of physical disks to a different read cache.

In this way you improve the read speed (the write cache is shared across the SP... so you may increase also write operations, but only why you are using different paths).

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

FunkyD · ‎12-05-2010

Ok - I'm moving SG2 to SPB.

With regards to paths, should I be setting preferred paths on each ESX host so that traffic is balanced through my switches i.e. one switch is preferred for LUNS on SPA and one switch for LUNS on SPB?

I was checking the SAN Config guide from VMware and noted that they recommend Disk.UseDeviceReset = 0 on all hosts. On mine it is set to 1 on all hosts. I figure I should change this as I'm not using SCSI?

Many thanks again.

AndreTheGiant · ‎12-05-2010

Remember to leave the default multipath policy (MRU).

So you have iSCSI? Have you configured the Jumbo Frame? (on 3.5 are experimental, but can give more performance).

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

FunkyD · ‎12-06-2010

No, I don't have iSCSI. Does that mean I can leave Disk.UseDeviceReset? I guess I can as I am fibre so it should have no effect.

I think things are as good as they are going to get - I can now get about 42MB of througput on my fibre switches. What I might do is change my LUN layout a bit:

Move SQL on LUN2 from SG2 to SG1 (more spindles).

Create a new LUN on SG2 and move low-load Vmachines from SG1 to SG2 e.g. virtual workstations that don't need much I/O. Also move one of the file servers to SG2 perhaps as these are low-load also.

Things should be slightly better balanced and I might get a bit of a performance increase.