mbender71
Contributor
Contributor

Super slow iSCSI performance vSphere 4.0 Update 1

Hi everyone,

We recently purchased an EqualLogic PS6000X for testing on our existing ESX 3.5 stack. Since I

am new to iSCSI configuration management (strictly fiber channel before) I decided to do all the

reading I could. My reading lead me to the conclusion that now would be a good time to move to

vSphere 4.0 due to the much improved performance with iSCSI storage. I updated our virtual center

to vCenter 4.0 and converted one of our spare ESX 3.5 servers (Dell 2850) to vSphere 4.0 Update 1.

The software install and configuration went smoothly but I have one HUGE problem : the software

iSCSI initiator on vSphere 4.0 Update 1 is SUPER SLOW.

--> Dell PowerEdge 2850, Dell Equallogic PS6000X SAN, Dell PowerConnect 6224, vSphere 4.0 Update 1

From the ESX console (the 2850) I can not get writes greater than 30MB/sec and reads greater than 10MB/sec.

For all those gurus out there that think I'm getting good results please don't reply since you're way off base.

I have a MacBook Pro hooked up to the same PowerConnect 6224 and It gets 110MB/sec read and write via

GlobalSan iSCSI initiator (software) via Cat5e. The problem is NOT the Equallogic SAN or the PowerConnect.

The problem is definitely with vSphere 4.0 Update 1. Initially I had a perfectly configured multi path IO (3 GigE's)

via Dell's vSphere + EqualLogic technical report (which is MUCH different than ESX 3.5). The strange behavior

there was I saw perfect load balancing via round robin but each GigE would max out at ~25MB/sec giving me

a maximum throughput of ~70MB/sec. I kept reducing the configuration until I decided to start from scratch and

kept it SUPER simple. The final configuration was a clean install of vSphere 4.0 Update 1 configured with three

virtual switches on three difference NICs (2 for service consoles and one only for a VMKernel port for iSCSI).

A single Cat5e was connected directly to the PS6000X SAN as to get the PowerConnect switch out of the loop.

The results?? THE SAME. AWFUL performance that prevents me from rolling this out to our other VMware stacks.

I'm testing the read/write via the console only (no VM's). I'm using 'time dd' for testing and using the PS6000X

performance monitors to check on the SAN side. No jumbo frames enabled (not even close to theoretical performance

limits of GigE over 1500MTU to worry about jumbos). Dell has pretty much agreed that I've done nothing wrong on

the SAN or the switches and that the only culprit could be vSphere.. Oh, I'm using a quad port intel PCI-X NIC

for the iSCSI traffic but I also tried using the onboard GigE on the 2850 and got the SAME results. One other

strange observation : when writing to the SAN the network traffic is twice as much as the actual throughput, that

is I'm getting 30MB/sec write from the console but the SAN see's 70MB/sec network traffic. This has to be a clue

somehow since the overhead should not be 100%. There are no packet errors of any type on the SAN and the

log files for the software iSCSI initiator on the console are super clean.

I really do appreciate any help and/or suggestions!

Sincere Thanks,

Mike Bender

Orlando, FL

Tags (3)
0 Kudos
6 Replies
asp24
Enthusiast
Enthusiast

Slow console performance in ESX 4.x is a known issue. The VM's will not be affected by this. Try testing from a VM, and I think you will be pleased with the performance.

This patch should also improve console performance

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=101629...

BenConrad
Expert
Expert

When using ESX you really should not try to perform SAN testing with dd or any tool that is trying to stress test the VMFS file system. Testing the VMFS file system inside the COS will never give you the same performance as a VM on that VMFS file system. Fire up a VM, setup IOMeter and do a run with a 4GB test file, 64 outstanding IOs, 32KB blocks and 100% reads. You should then start seeing 100+MB/s.

Your COS write speeds on VMFS are good, the read speeds are not good but I believe you can fix this with the most recent patch that came out 1-2 weeks ago (fixes slow COS read speeds on VMFS). This will help with COS based backup agents but it won't make a difference on the VMs.

Ben

mbender71
Contributor
Contributor

Hi Ben,

Thank you for your help!! Last night I noticed that I indeed had some updates to install

that were from the beginning of January (I thought the vSphere 4U1 DVD would have that

already but it didn't). I find it interesting that the COS isn't expected to have good throughput.

This means that my migration from an older FC SAN is going to take a LONG time since I

can only copy VMFS's from the COS (this is QUITE disappointing). I've never had this speed

problem before in previous versions of ESX but then again I've only been using FC for the past

(4) four years.

I installed IOmeter on a Win2k3 VM and set it up as you specified (well almost, I had to specify

the size in sectors and not in GB). My Total MBs per Second reached 71MB/sec. I was watching

the performance monitor from the PS6000X and noticed that the three GigE lines were nicely

balanced at ~28MB/sec which is right for 71MB/sec with packet overhead. I still don't understand

why this number is so low...... very disappointing....

I then played with the access specifications in IOMeter and changed to 32KB with 50% reads.

The results were definitely interesting now since the total throughput was 180MB/sec. This

is more like it. Performance monitor on the PS6000X shows three GigE lines balancing at ~65MB/sec

each which is right for the 180MB/sec throughput. I'd like to know why vSphere 4 is running better

with mixed read/write (if I'm interpreting the 32KB @ 50% correctly) then with straight read alone??

Obviously the GigE lines are approaching closer to 100MB/sec each but yet there is still a limit.

You might say "Well your SAN can't handle it" but you would be incorrect since I've had linear

speed reads from a single GigE from the PS6000X at 110MB/sec. I still think there is something

fishy in the vSwitch and iSCSI on vSphere 4.

0 Kudos
BenConrad
Expert
Expert

Regarding migrating from the FC SAN, you can take one (or more) of your hosts and put it on both the FC SAN and the iSCSI SAN then do sVmotions to the EQL. After you are all done you can disconnect the FC links. You will want patch ESX400-200912001 so you can increase the read speeds.

In IOMeter, what did you set # of Outstanding I/O's to? To get a good throughput result you should have it set to 32 or 64.

Ben

0 Kudos
mbender71
Contributor
Contributor

Hi again Ben,

Ok I feel guilty now for not thinking about sVmotion. I've only used it once or twice way back when

it was CLI only but now sVmotion in vSphere 4 is very stable and usable. My bad! It may take awhile

but I guess I don't have to worry about downtime. I have much to learn about vSphere 4 (though right

off the bat I know my favorite new feature is the VMFS expand).

In IOmeter I set the number of IO's to 64 just as you requested. Since the last time I posted I have begun

to cable my iSCSI setup the way it is supposed to be with dual switches and more connects (all for failover).

I decided to keep IOmeter running to test the failover since I'd be plugging/unplugging connects between the

SAN and the switches plus connects between the ESX host and the switches. The failover worked perfectly

although the time to pick up seemed a little long (about a minute). Although this made me happy and confident

I learned something MUCH more interesting during this exercise: total throughput INCREASED while I was

rewiring! It seems that either the vSwitch or the MPIO algos react unpredictably when the paths come and go.

Let me explain a scenario that is repeatable:

(a) start IOmeter with three GigE connects to the iSCSI switch fabric. 32KB 100% read w/64IO = stable 72MB/sec

(b) remove two GigE connects and wait a minute, data continues and IOmeter now = stable 112MB/sec

(c) stop IOmeter and restart with same configuration (settings and connects) = stable 112MB/sec

(d) reconnect 2 GigE connects while IOmeter is still running (after waiting for vSphere to add lines) = 105MB/sec

(e) restart VM and start IOmeter with three GigE connects and previous config = stable 72MB/sec

For some reason a disruption is required in the MPIO to give the VM a 50% increase in throughput. Once the

disruption happens the MPIO/vSwitch remains happy with the high throughput until the VM is restarted or

perhaps when high throughput disappears. Thank you again for your help and suggestions!

0 Kudos
J1mbo
Virtuoso
Virtuoso

Hi

My testing echos yours, actually I posted a query on this here. Something odd with the paths seems to happen with controller failover.

Re your performance, you should be able to go considerably higher by enabling multipath - configure two or more vmkernel ports for iSCSI on the same vSwitch but one physical NIC only each, then bind them all to sw iSCSI initiator (console job), set each LUN for vmware round-robin, then set IOPS to 3 (again console job). With the lower-spec PS4000 I see about 170MB/s like this, but the final part of that tweak needs to be reapplied whenever the host is restarted. Fortunately a decent chap has posted a bash script to do just that on this thread.

HTH

Please award points to any useful answer.

0 Kudos