Re: MSA2312i and ESX4.1 slow performance

Syl20m · ‎02-17-2011

Hi all,

I have a big problem of performance with my customer configuration.

In fact we experienced very slow large files transfer rate from windows computers to windows servers VM: max transfer rate for a 2GB file is 15MB/s. I assume we must expect over 100MB/s transfer rate

We used an MSA2312i with 7 disks of 450GB 15k in RAID5, 2 HP DL360g7 servers with 12 NICS as ESX4.1 servers. For having a fault tolerance, we used 2 dedicated HPprocurve 1810G for connecting the 2 iSCSI NICS of ESX's and the 2 controllers of the MSA.

On the MSA we have 2 vdisks (1 of 2TB and 1 of 700GB)

Our 6 VMs are running on the 2TB VMFS datastore.

I made a lot of tests that puzzles me...

_ configuring software iSCSi : no change

_ using another switch: dlink DGS-3100: no change

_ using Write-back or Write-through caching on the MSA volume: no change

_ changing mutltipathing to round robin: no change

_ changing NICS used for software iSCSI: no change

_ installing another ESX server (different hardware) in VMWARE ESX4.0, migrating a VM to this host: no change

_ migrating a VM in the local ESX datastore: WHAOOOOO!!! transfer rate between 50MB/s and 80MB/s

_ configuring a server Windows 2008 ( physical) with software initiator on the MSA2312i using the same switch than ESXs. I formatted the 700GB vdisk from VMFS to NTFS in order to show it on the server: WONDERFUL, I copy a 2GB file at 80M/s in less than a minute!!!

With those tests, I know that MSA is good because in windows iSCSi environnment I have no problem. But I can't explain why in Vmware environment I experienced a such poor performance?!

Any help or any idea on this case will be really appreciated!

For information, my VMWARE case is opened since 24/12/2010!!! and not closed at this time.

An HP case was closed since a month because of our test with the windows server 2008!!!

I'll give you any further information if needed!

Thanks in advance,

Sylvain

AndreTheGiant · ‎02-17-2011

Welcome to the community.

Have you enabled jumbo frames?

Have you follow HP recomended practices for configuring the environment?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

Syl20m · ‎02-18-2011

Hello Andre,

Thanks for your quick response,

In fact, I forgot this test in my post, but there is no change using Jumbos or not. I heard that jumbos could increase performances by 5-10% but my performances are so poor that I don't see any improvement.

To configure iSCSI, I used Vmware iSCSI SAN configuration guide: www.vmware.com/pdf/vsphere4/r40/vsp_40_iscsi_san_cfg.pdf

And Hp Storageworks MSA Best Practice: http://h20195.www2.hp.com/V2/GetPDF.aspx/4AA2-5019ENW.pdf

AndreTheGiant · ‎02-18-2011

Have you tried also to use a guest initiator inside a VM? Just to see if you reach similar performance as the physical Windows Server case?

MSA firmware is up to date?

Andre

Andrew | http://about.me/amauro | http://vinfrastructure.it/ | @Andrea_Mauro

depping · ‎02-18-2011

Did you check with esxtop if there are any latency issues? DAVG? KAVG? LAT/rw and LATrd? What about QUED?

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

Syl20m · ‎02-18-2011

No, I didn't try that! That's a good idea, I'll made this test afternoon.

I have also added a new disk in the MSA in order to create another vdisk in RAID0. I'll see if I have better performance in RAID 0 than in RAID5.

Sylvain

Syl20m · ‎02-18-2011

Yes, I can see with esxtop that the DAVG is high (between 10 and 100ms), GAVG and KAVG are all the time lower than 10ms (average = 1ms). Other values seems to be good except MBWRTN/s that is still very very low!

Sylvain

DaIceMan · ‎03-03-2011

Sylvain,

did you find any solution to this? We are having similar latency problems with oru ESX 4.1. We have and MSA2312i dual controller with 6 x 1TB SATA disks and one vdisk split into 4x 16GB LUNS to boot our 4 diskless blades with QMH4062 iSCSI HBAs and the rest split in 2 datastores/LUNs. We have a total of 8 VMs on one datastore and 4 on the other. While storage vMotioning a VM between the two we can get on one host from 100-300 DAVG/cmd with about 60-100 GAVG - similarly while copying any file. When this happens the whole vmware infrastructure becomes sluggish and sometimes unresponsive. It appears to be a SAN issue but as like you, when we connected a physical client to a LUN, the copy speeds reached 80MB/s. We also followed the documentation for configuring the SAN with ESX. We have 2 seperate subnets for each port group (A1B1 / A2B2) cross connected to 2 2810Procurve switches. We tried setting round robin or back to MRU but we noticed not changes. Typically, the copy can start the first seconds even at 40MB/s (especially after a copy restart, due to caching), but then drops drastically to a few hundred KB/s. We also tried enabling jumbo frames everywhere (switches, vswitches, vmknics, and VM vmxnet3 nics, iscsi HBAs as they are independent HW) without any change. There are no particular warnings in the vmkwarning log (we saw some MTU problems initially as we forgot to push the MTU of a vswitch, but that was quickly resolved). The latency problems seemed to actually worsen after this, so we reverted back to 1500 on our VMs, but we left the iSCSI HBA MTU to 9000 with all the networking and MSA Host iface still at 9000.

I noticed that in the documents it is stated that to enable jumbo frames on the vmknic(s) it has to be removed and recreated, but I saw that issuing

esxcfg-vmknic -m 9000 "VMkernel 2"

the command is successfully accepted and is retained between reboots. Also, we have independant hardware iSCSI HBAs which do not make use of vmware networking (in fact Jumbo frames are enabled via their BIOS and a esxcfg-hwiscsi -l vmhba0 and 1 shows this correctly).

Would there be anything else we can try to analyse in the logs to see what could be slowing the SAN access so much?

Thanks for any feedback.

PimMolenbrugge · ‎03-03-2011

Similar problem with lefthand iSCSI nodes, any progress yet sylvain/Dalceman ?

depping · ‎03-03-2011

Hi DAVG usually indicates that the delay is at the array side. Not sure if you can monitor the array side, see what it is doing.... but anything regularly above 20ms is suspicious in my opinion

Duncan (VCDX)

Available now on Amazon: vSphere 4.1 HA and DRS technical deepdive

DaIceMan · ‎03-03-2011

As a follow up, after some further testing we noticed that the transfer or write speed is about 25MB/s IF the source and target VMs are on different hosts. If they are on the same host, the speeds start at 20 or even 40MB/s but after just a couple of seconds drop to a couple of MB/s and sometimes less than 1.

So performance is better with VMs between hosts than on the same host.

All hosts are identical hardware (BL490c), with the same QMH4062 HBAs, CPUs and RAM.

Our switches are procurve 2810-24 ones. I read that there could be an issue with Jumbo frames and flow control enabled simultaneously though we haven't enabled the latter. I will try disabling Jumbo frames on all iSCSI HBAs, the MSA2312 and 2810s first, then enable flow control on the relevant iSCSI ports and see what happens.

jordan57 · ‎03-03-2011

Have you looked into the network side of things? Check your duplex settings, maybe force the switch ports and your pNICs to 1000/full or what ever your using.

Blog: http://www.virtualizetips.com Twitter = @bsuhr

DaIceMan · ‎03-04-2011

To bypass any switch related doubts for testing, we connected directly the 4 hosts to the 4 MSA2312i ports using one QMH4062 port per host (2 on one SC and 2 on the other SC) as the MSA presents all LUNs on all ports. The latency and bandwidth problems still persist (not more than sustained 12MB/s) . We also see in the vmkwarning many of these entries when copying from 2 hosts to the storage:

Mar 4 16:45:28 vh5 vmkernel: 0:01:08:31.345 cpu1:4378)WARNING: LinScsi: SCSILinuxQueueCommand: queuecommand failed with status = 0x1055 Host Busy vmhba1:0:3:8 (driver name: qla4xxx) - Message repeated 625 times

It can be 50-100-400 and over 600 times, which is definately not a good thing. This simply appears if we copy a file or 2 from between 2 VMs, or just install updates on a server, nothing else is going on.

Could it be an interrupt sharing problem? The BL490c has only 2 mezzanine slots and there is no word about any preference on slots. We also have installed a quad port NIC though I must say that these latency problems appear with or without it installed.

Attached is a dump of the interrupts with a cat /proc/vmware/interrupts.

Thank you for any additional help.

ats0401 · ‎03-04-2011

What kind of speed will you get if you transfer the file from one of the windows machines directly to the datastore (bypassing the VM)?

Is your VM using VMXNET3 driver?

*nevermind*, I see when you move the VM to the host local datastore the issue dissappears. Seems to rule out any vm level networking issues.

This KB may help you decode the SCSI errors in your log

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902

binoche · ‎03-05-2011

no idea what is wrong, could u please upload /var/log/vmkernel when u hit again?

DaIceMan · ‎03-07-2011

Status Update:

after some further debugging, we narrowed the issue to one of the 2 MSA controllers. Somehow, someway, the second storage controller was presenting the LUNS regularly but, did not allow access to one of the LUNs. In ESX it would appear as available but not connected, and it wouldn't be owned by anybody and connections would apparently simply time out while in MP through the switch (the MSA presents all LUNs on all ports). Not even after an HBA/vmfs rescan. After restarting the 2nd SC everything went back to normal. This must have severely messed up the Multi-pathing.

While we were at it, we ran some file copy tests from physical to virtual and from virtual to virtual. To a Z200 workstation an 8GB copy from a virtual server (2008 R2) was 110MB/s sustained, which is about the limit for a 1GB connection (Host was directly connected to one port of one SC). The opposite, a write, would max out around 55MB/s (it is a 6 disk RAID 6 vdisk with 6 1TB SATA disks split up in 8 small boot LUNs and 2 larger ones - we are running max 20 low I/O VMs so it was enough).

We also swapped out one of our QMH4062s from one of our hosts and used the all software dependant iSCSI (we have a quadport Tigon3 NC325 mezzanine so used one of it's ports through the switch so it was the only "switched" port) to test the difference. With this kind of storage, the write and read performance made no difference but naturally the CPU usage was higher. The DAVG with the software in write bounced from 150ms to 1500 but on average was around 200, while in read it was around 12ms. The write DAVG on the Qlogic hosts on sustained write operations was also around 150ms, indicating the our MSA was the bottleneck (the SATA disks). On read of course the DAVG was also around 12ms with 110MB bandwidth. These tests were all done with direct host to SC connections bypassing the switch and jumbo frames set to ON on QMH4022 controller and SC but not on the software iSCSI so I was expecting more issues on the software but we see that there is actually no difference. I enabled jumbo frames on the relevant vswitch and vmk port and rebooted. The read test speeds indicated a dramatic decrease in performance (less than 30MB instead of 110) so something is not working here, either on the physical switch (procurve 2810-24) or esx side so we decided to disable jumbo frames (we don't have such an intensive and large file workload) and put an end to the pain for the moment.

Syl20m · ‎03-15-2011

Hello,

Sorry for this so late feedback but I was very busy those last days!

I made some further tests and haven't yet identified the bottleneck of my infrastrusture.

In fact, we realized that if we made a copy from a physical server in the VMWare Management network to a VM in the production LAN (in the middle we have a router with a 1500MTU) the copy is more stable but still slow.

When we copy from Physical to VM inside the production LAN, the transfer rate starts over 50MB/s and decrease sometimes to 5-10 MB/s.

So I decided to play with MTU inside VMs and changed all the VMNICs to flexible in order to having the choice of the MTU in the windows driver. If we use a 1300 MTU inside all VMs, the copy between physical and VM inside the production LAN become more stable as the copy from Management LAN to Production LAN (through a router).

I don't know what changes I can do to improve the performances? Do you think migrating the RAID5 in RAID10 will help?

Syl20m · ‎03-15-2011

Hi Dalceman,

I hadn't find any solution yet. Do you solved your problem since your last post? Or did you have made some other tests? Have you called VMware suport?

I analyzed my VM logs and don't see anything. I just have regularly in events of the 2 ESXs (but not at the same time) :"Lost access to volume (Datastore LUN0) due to connectivity issue" and 5 or 20sec later "successfully restored access to (Datastore LUN0)".

Did you have those same messages?

Thanks in adance,

Syl20m

binoche · ‎03-15-2011

Hi, Syl20m

could you please upload /var/log/vmkernel.log? we can have a check what could be wrong, thanks

Syl20m · ‎03-15-2011

Hi,

Here is the vmkernel log file of the first ESX server. Thanks in advance for your analyze!