Re: MSA2312i and ESX4.1 slow performance - Page 2

Syl20m · ‎02-17-2011

Hi all,

I have a big problem of performance with my customer configuration.

In fact we experienced very slow large files transfer rate from windows computers to windows servers VM: max transfer rate for a 2GB file is 15MB/s. I assume we must expect over 100MB/s transfer rate

We used an MSA2312i with 7 disks of 450GB 15k in RAID5, 2 HP DL360g7 servers with 12 NICS as ESX4.1 servers. For having a fault tolerance, we used 2 dedicated HPprocurve 1810G for connecting the 2 iSCSI NICS of ESX's and the 2 controllers of the MSA.

On the MSA we have 2 vdisks (1 of 2TB and 1 of 700GB)

Our 6 VMs are running on the 2TB VMFS datastore.

I made a lot of tests that puzzles me...

_ configuring software iSCSi : no change

_ using another switch: dlink DGS-3100: no change

_ using Write-back or Write-through caching on the MSA volume: no change

_ changing mutltipathing to round robin: no change

_ changing NICS used for software iSCSI: no change

_ installing another ESX server (different hardware) in VMWARE ESX4.0, migrating a VM to this host: no change

_ migrating a VM in the local ESX datastore: WHAOOOOO!!! transfer rate between 50MB/s and 80MB/s

_ configuring a server Windows 2008 ( physical) with software initiator on the MSA2312i using the same switch than ESXs. I formatted the 700GB vdisk from VMFS to NTFS in order to show it on the server: WONDERFUL, I copy a 2GB file at 80M/s in less than a minute!!!

With those tests, I know that MSA is good because in windows iSCSi environnment I have no problem. But I can't explain why in Vmware environment I experienced a such poor performance?!

Any help or any idea on this case will be really appreciated!

For information, my VMWARE case is opened since 24/12/2010!!! and not closed at this time.

An HP case was closed since a month because of our test with the windows server 2008!!!

I'll give you any further information if needed!

Thanks in advance,

Sylvain

binoche · ‎03-15-2011

the below messages look like strange to me, not sure it is something wrong?

Mar 15 04:43:23 ESX1 vmkernel: 47:05:55:37.685 cpu12:4269)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:23 ESX1 vmkernel: 47:05:55:37.733 cpu13:4271)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:23 ESX1 vmkernel: 47:05:55:37.737 cpu13:4267)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:23 ESX1 vmkernel: 47:05:55:37.740 cpu13:4272)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.729 cpu15:4264)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.753 cpu8:4262)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.770 cpu15:4274)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.781 cpu12:4269)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.786 cpu12:4273)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:25 ESX1 vmkernel: 47:05:55:39.791 cpu13:4271)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.849 cpu13:4272)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.910 cpu15:4264)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.918 cpu22:4270)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.921 cpu17:4268)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.926 cpu13:4277)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0 
Mar 15 04:43:26 ESX1 vmkernel: 47:05:55:39.929 cpu13:4275)WARNING: vmklinux26: __kmalloc: __kmalloc: size == 0

opbz · ‎03-16-2011

Have you looked at the /var/log/vmkiscsid.log?

I have not used the MSA boxes but I have seen similar issues with Equallogic boxes and in my case the issue was down to misconfiguration.

I would suggest the following:

ensure you have lattests version of the iscsi config guide for vmware for the MSA. I seen cases where config details where copied over from different devices and that caused issues. ALso check for firmware update versions....

If its an active passive storage you will most likelly need to have different subnets

if its activie active you only need 1

check with esxcli -l vmhbaXX to ensure all your vmknics are assocaited the correct VMHBA

If you have jumbo frames ensure it is enabled throught your network for iscsi. Also check for any particular settings you have on your network.

by the way the vmkiscsid.log throw out a lot of rubbish when its connecting to iscsi.... but after proepr connections are made it ussually stops.

hope this helps...

DaIceMan · ‎03-16-2011

Syl,

we are satisfied with our current setup as it won't give more throughput. We get about 30-40MBs copy from VM to VM on different hosts which is acceptable. From VM to physical machine we can get around 80MB/s. We will change our MSA with a new generation one with 12x 2.5" 600MB SAS disks where I will run further tests on a RAID10 volume for best overall performance. Our present write limit on a slower RAID6 volume is about 60MB/s. In read we can get a little over 110MB/s maxing out the Gbit connection.

Regarding your problematic, I would suggest to take it one step at a time. I would first disable all jumbo frame support from vswitches, vmkernels, NICs and your 1810 switches and revert to enabling Flow Control on the ports where your iSCSI NICs are connected and your 4 storage ports which is more important than having Jumbo Frames if you cannot support both (and the 1810, like our 2810 cannot) and run some tests.

The MSA controllers are actually active/active, though they adhere to ALUA specifications. This means that the 2 SPs present all LUNs to all ports, though in reality connect directly only to the ones they own while internally route the connection to the owning SP if a request is made indirectly. This means that the SP which receives the request and does not own the LUN simply hands it over to the SP that does internally (the 2 SPs are interconnected with a BUS inside the MSA). This supposedly maintains the path active in case of a failure. Personally, I have not yet found any relevant performance data on how much this impacts the "slave" SP respect to having directly connected paths to the respective owning SPs and if enabling Round Robin vs MRU is really helpful in this situation - but this is another story.

If you have any doubts on your MSA's behaviour and if you can, try this test which works with max 4 ESX hosts:

One at a time, disconnect one SP port and one iSCSI Nic, and connect directly to the SP bypassing the switch (be sure you have auto MDI-X capable ports or use a cross cable, in our case they were all auto MDI-X through blade interconnects) and disconnect the second iSCSI Nic from the switch. Leave the kernel time to failover and re-establish another path before doing the next (wait about 1 minute), and "Rescan All" from the storage adapters. Do this for all 4 SP ports. In the end you will have only 4 iSCSI NICs connected directly to 4 SP ports (if you are using 4 hosts of course). With the MSA behaviour, all the hosts will be able to see the LUNs. Note that the MSA path detection apparently behaves differently if you have the second ports of both SPs on different subnets than if they are on the same, but this is another story. Now you should see each vmhba and it's relevant direct path. If you don't while you are doing this, the SP may be malfunctioning as happened to us, so go ahead and restart the relevant SP from the MSA web interface and "Rescan All" after. If everything looks ok, you can try I/O tests from physical to virtual and from virtual to virtual and see how it performs.

Another note which may or may not be relevant for your setup regarding the MSA behaviour is that if you are using a LUN 0 then this LUN will be shown on all other hosts, who do not have explicit access to the LUN as "Enclosure" so don't be alarmed. Access to the LUN, if not explicitly enabled (or by default behaviour) from the MSA is not allowed.

You can monitor the vmkwarning file from the console (tail -f /var/log/vmkwarning) while running tests, disconnecting and reconnecting paths.

These tests will help you isolate the problem, if it is MSA, switch or ESX related (configuration issue or malfunction).