Solved: Re: Disappointing 10gbe performance -- HELP!

SBruggeman · ‎01-28-2011

So we recently implemented a new ESX/iSCSI environment and we're seeing worse performance than we were in our multipathed 1gbe environment.

We have 3 x Dell R910 Servers with 2 x Intel Dual Port 82598EB 10gb CX4 NICs each.

Each of the servers are connected to a Force10 C300 switch. We are using 1 port on each NIC of each server for iSCSI traffic and each one is connected to a seperate Force10 C300. The C300's are then cross connected.

We have 2 x Dell Equallogic PS6510 arrays also cross connected to the two C300 chassis.

We have a seperate vSwitch configured on each host and are using a VMK for each physical iSCSI NIC. We are using jumbo frames and are able to vmkping a jumbo frame to the portal and each target address on the SANs with no problem.

When running performance tests in redhat using dbench we are only seeing around 550mb/s throughput either directly to the local vmdk disk which is stored on a round robin multipathed datastore or using a software iSCSI initiator within Redhat also multipathing.

We hooked up our old SANs for comparison and in an otherwise identical configuration we're seeing a consistent 630mb/s out of our ESX hosts utilizing 4 x 1gbe from the host to the switch and 3 x 1gbe to the PS5000.

So, with only 4 1gbe NICs on the host, and 3 1gbe NICs on an array with 32 less drives we are seeing an 80mb/s INCREASE in performance over our new 10gbe hardware... what gives?

When switching the path selection policy from Round Robin to Fixed on the datastore we see speeds drop to around 130mb/s

I feel like we must be missing some huge configuration step to see such terrible performance.

Ideas?

taylorb · ‎02-01-2011

Have you considered the bottleneck may not be the network media? What makes you sure your storage system can deliver more I/O than your Gigabit network was delivering?

View solution in original post

depping · ‎01-29-2011

I am not sure why you are seeing it. Have you tried completely disabling jumbo frames to see what that results in?

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

SBruggeman · ‎01-29-2011

I've not tried disabling Jumbo frames yet. I'll give that a shot and see where it gets me.

SBruggeman · ‎01-29-2011

I just reconfigured my VMKernels to use a standard 1500 MTU and am seeing slightly lower performance over jumbo frames in my 10gbe infrastructure as I would expect.

Any other ideas?

-Steve

IRIX201110141 · ‎01-29-2011

A single IOmeter Thread with a Workload of a 32KB,100%Read,0%Random i get around 710MB/s from a single PS6010X. Iam preparing a new deployment with 2xPS6010XV within the next week and can report my experience. Instead of CX4 we are using SFP+ together with Intel x520 DA2 in the R710 and PC8024F switches.

Be sure that you have enable flow control and disable anykind of "iscsi optimization when you have dedicated switches for your storage network.

We have also 2xPS5000E but i cant see how you get 630MB/s out of these.

Regards

Joerg

depping · ‎01-30-2011

Wondering if you are running identical tests against both arrays and if there is a huge difference in cache or spindles backing those LUNs?

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

SBruggeman · ‎01-30-2011

The tests that I'm running are identical on both arrays. The array that performs at around 630mb/s is a PS5000E array with 16 x 1tb drives

The slower 10gbe array is a PS6510E array with 48 tb drives and performs in the 550mb/s range.

I have been testing using dbench in Redhat Enterprise 5 but made a discovery this morning. Previously my VMs had 2 vCPUs each. After upping my test VM on my 10gb SAN to 8 vCPUs I am now seeing over 3x the performance! I am getting around 1600mb/s using round robin iSCSI targets within the VM and around 1500mb/s on the local VMDK disk which is also on the 10gb SAN.

However, after making the same changes to my VM on my PS5000 1gbe SAN I am seeing nearly identical results. While it is great I'm seeing better results I would still expect my 10gbe infrastructure to blow the PS5000 setup out of the water.

I've found a few other things I can try tuning but I'm sure I'm still missing something...

bmorbach · ‎01-30-2011

I recommend that you question your measurement method.

A PS5000E has 2 controllers one of which is active with a maximum of 3 x 1 GE interfaces.

A 1 GE interface can perform a sustained bandwidth of about 100 MB, maybe a little more but you should not be able to get alot more than 300 MB/s over these 3 interfaces. If you manage to get 1500 MB/s you are bypassing physics which I doubt you are capable of.

That looks alot like you are seeing caching effects somewhere in your setup.

SBruggeman · ‎01-31-2011

I don't agree with this assesment at all. To say that a 1gbe interface is not capable of sustaining more than 1/10th its potential bandwidth just doesn't make any sense. To achieve 1.5gb/s would only require the interfaces to perform at 50% which is especially more possible in a round robin configuration. I am not seeing caching effects when running tests for 30 or 60 minutes at a time...

bmorbach · ‎01-31-2011

We should not be mixing bits and bytes here.

A 1 gbe interface is capable of transmitting 1000 megabits/s, that theorectically computes to 1000/8 = 125 megabyte/s - I typically see ~110 megabytes/s

Hence a PS5000E will saturate its 3 x 1 gbe interfaces at either 3000 megabits/s or ~330 megabytes/s.

A bandwidth of 1500 megabytes/s is physically impossible on a PS5000E, a bandwidth of 1500 megabits/s (~ 180 megabytes/s) is no problem with active/active multipathing on the ESX host.

However if you are only able to achieve 1500 megabits/s on a 10gbe interface (capable of transmitting 10000 megabit/s) you do have a problem in your network. The 710 megabytes/s stated above are absolutely realistic with 10 gbe.

taylorb · ‎02-01-2011

Have you considered the bottleneck may not be the network media? What makes you sure your storage system can deliver more I/O than your Gigabit network was delivering?

SBruggeman · ‎02-17-2011

BMorbach, I apologize I was mixing up bits and bytes and there is obviously a pretty huge difference there.

After some more extensive testing and comparison we've found that in fact the network is NOT the bottleneck. Our storage system can't even come close to saturating the media.

What I ended up finding was when I was able to generate tons of load on the SANs using IOMeter and monitoring statistics in SANHQ that as the load went up and reached a certain point we would begin queuing IO operation on the SAN and latency also increased, which basically just means we've reached the limit of our spindles. I guess we expected more performance from 48 spindles but we take a decent performance hit, especially on random IO by using RAID6.

I reconfigured our two PS5000 arrays in a RAID10 storage pool and while the sequential performance is slower than the PS6510 10gb arrays they are about 20% faster in random read/write operations which means we were never network bound, even at 1gb speeds.

The real benefit we're going to see in our environment is by utilizing two seperate PS6510 storage pools (cross replicating for HA) and our PS5000 RAID10 pool and have the ability to see more throughput / IOPS to a single host than we ever could in our 1gb environment.

In addition our new Force10 switches over considerably lower latency. We're in the process of testing whether our lower latency limit is going to be defined by our disks or our network.

I've got lots of interesting benchmark results from IOmeter that I'll post once they're complete.

We've also had an opportunity to run the same benchmark tests against a PS6000 storage array and MD3200i.

IRIX201110141 · ‎03-24-2011

Just for the docs,

our 2xPS6010XV goes into production. With an unrealistic load pattern 8192Sectors/1024KB/100%Read/Sequentiell/0%Random we are pulling 1920MB/s out of the cache when running IOmeter in a VM on a single ESX. Verified with esxtop and SQL SanHQ.

We got very weird numbers and high latency when using a single vSS with 2 VMKs and a 1:1 binding to the 10GbE DA2. There was a huge decrase of performance after placing the 2nd EQL into the pool.

After changing the setup to two vSS performance goes up and we see what we expected. EQL suggest not to use a vDS for iSCSI.

Regards

Joerg

All

Disappointing 10gbe performance -- HELP!