Solved: Re: Very slow performance on EVA4400

st3reo · ‎01-13-2011

Hello guys,

I need some help regarding a issue:

So I have a setup consisting of a few HP ProLiant DL380 G5 servers ( 2 x Quad-Core Xeon X5470 + 24GB RAM) with dual-port HP FC2242SR 4Gb/s HBAs (running vSphere 4.1) connected to a HP EVA4400 made up of 24 15K 300GB FC disks.

The servers each have 2 NICs connected to a Cisco 2960G switch and connected to a dvSwitch inside vSphere.

The disks are grouped in a single DiskGroup on the EVA, and I have a 1040GB Vdisk (LUN) created with RAID 5 level.

Alright, now I have a VM with Windows Server 2008 R2 acting as a file server. It's virtual hardware consists of 4GB ram and 4 cpu cores and the LUN I mentioned above is mounted as a VMFS datastore to the host and devided up as a 40GB vmdk for the operating system (with LSI SAS controller) and another vmdk of ~1TB on a separate Paravirtual controller ...intended for actual storage of files. Both disks are setup as Thin Provisioning

And also it has a single VMXNET3 nic.

The problem is...that the file server performs very bad, it's about 10-12 MB/s for writes and 20-30 MB/s for reads.

I tested this by transfering ~2, ~3, and ~4 GB .iso files to and from the server.

I tried it from my workstation, other VMs & phisical servers...the performance is the same.

I also vMotioned the file server to other hosts....still same thing.

I figured maybe it's something wrong with the VM itself....so since I have another windows 2008 R2 VM residing on a smaller (150GB) LUN I did the same test by copying .iso to that..and the performance was the same. And I also did this to a linux VM ....same thing.

Any ideea what the problem is here....because that just seems extremely poor performance...it's like a lot slower even than a normal PC

Basicaly there's almost no traffic on the fiber-channel network most of the time...so it's not congestion or anything.

The EVA and FC-switches are running the latest firmware.

Please help.

Thanks,

idle-jam · ‎01-13-2011

what bout the cache batter on EVA are they fully charged? or is there any disk failuer that raid rebuilding is being done at the background.

View solution in original post

idle-jam · ‎01-13-2011

what bout the cache batter on EVA are they fully charged? or is there any disk failuer that raid rebuilding is being done at the background.

Josh26 · ‎01-13-2011

Hi,

What version of ESXi are you running? 4.1 has some new options around this.

If you select "manage paths", what does it tell you about the SATP and Path Selection Plugin?

st3reo · ‎01-13-2011

well, I'm not sure how to check the battery status but there aren't any warning lights on it or anything.

And there haven't been any disk failures. the CommandView interface reports everything as "green".

Oh and I'm running ESX (4.1).

I don't have access to it right now but as I recall it shows as SATP_ALUA ....by default path set to MRU but I tried setting it to RoundRobin..and that seemed to increase the speed...but just slightly..basicaly insignificant.

idle-jam · ‎01-13-2011

hmm do you have any chance of having local storage VMFS? i would try on it and see if it's a VM specific issue or an actual storage. From vCenter you could look into the storage latency "performance tab" and you could interpret something and hopefully getting the root cause there.

J1mbo · ‎01-13-2011

As an aside I would reduce the vCPU count for the file server VM considerbly (like to 1, and then monitor it). Ensure vmware tools are installed in the guest and possibly increase it's RAM if it is 64-bit and if it will be busy (again, you can monitor it for now). Ensure overall the RAM on the host is not over subscribed.

As said already, move the VM to local disk to help with isolating the problem, but be sure the local disk has battery-backed write-cache and is set to write-back caching policy.

HTH

st3reo · ‎01-14-2011

Thanks for your answers,

I do also have local VMFS datastores on the hosts. They each have 6 146GB 10K SAS disks configured into a single RAID5 volume out of which 30GB is used for the ESX install and the rest (~680GB) is a datastore for VMs. The hosts have 512Mb battery-backed controllers.

I can't actually move the FileServer VM to the local VMFS because for one it would like take forever....and the block size only permits max 256GB files...and the VM is about 1TB.

So I did the same test on another Win 2008 R2 guest that is stored on the local VMFS of a host...and transfer was about ~60 MB/s

The thing is...as I said above..It's not only that FileServer VM that has problems....I also tested on others (windows & linux) that also get storage from the EVA SAN...and it's exactly the same slow speed.

On the FileServer VM I also tried to copy a file from the C: drive to 😧 where the storage is...and the speed was the same...about 10 MB/s ....so i suppose that rules out any network issue.

The resources aren't oversubscribed...I only have like 1-2 guests on the hosts I am testing. And inside the guests neither RAM or CPU are over utilized or something.

st3reo · ‎01-14-2011

hm, I just noticed that the indicator light on one of the batteries on the controller enclosure keeps blinking, and the manual says that means:

Blinking green = Maintenance activity in progress, such as testing or charging

but the management interface does not report any kind of problem.

Could this be the source of that terrible performance?

Unfortunately I just recently created this setup...so I can't compare the performance to anything because I`ve never tested it before

*** Edit: I removed the battery and re-inserted it..and it stopped blinking...but no performance improvement.....and after a while the battery started blinking again.

idle-jam · ‎01-14-2011

just as i suspected. sometimes if the battery is not charging you will need to replace it. a new battery will also take few hours to get charged. i would advise getting it replaced asap ..

st3reo · ‎01-14-2011

I see.

Well I`ll see what I can do, but I might have a hard time getting HP to replace it on warranty since there isn't any actual warning or failure reported on the EVA

Thanks a lot,

idle-jam · ‎01-14-2011

i'm not sure but in my country with such case and that it's being slow and the blinking symptom i would be able to have an engineer on site. You can also have the HP rep who sold you the unit to assist in the escalating the support case.

Good Luck

pinkerton · ‎01-17-2011

The blinking battery LED is most likely due to a firmware problem, see this for more info:

http://translate.google.de/translate?js=n&prev=_t&hl=de&ie=UTF-8&layout=2&eotf=1&sl=de&tl=en&u=http%...

winetou · ‎01-17-2011

How about your resource allocation? Maybe you have cpu or memory resource limit for this VM - this may be reason of poor VM (also hard disk) performance...

st3reo · ‎01-17-2011

Hey,

Interesting, seems similar to my problem...but the thing is I already have installed all the latest firmware on EVA and FC-switches.

XCS version:	09534000
XCS build:	CR18CBlep
Management firmware:	mmp-0001.4200-CR0670

This monday...when I got to work, first thing I checked the controller enclosure..and the blinking on the battery stopped. I checked again a few times during the day but I didn't see it blinking again.....which is probably going to make it even harded for me to get it replaced on warranty....and makes me wonder if that even is the source of the problem.

Anyway....I rebooted the controllers.....and I even completely powered off the EVA and started it again...and with just 1 VM guest using it ...the performance was still low.

And no...there is no resource limitation or oversubscription...and as I said I tried from multiple VMs that have storage on the EVA and it's exactly the same....on the other hand to VMs hosted on local datastore of the hosts ..transfer speed is around 60-70MB/s

Paul11 · ‎01-18-2011

I don't think you have a problem with your cache batteries. The blinking is normal. The EVA daily checks the cache batteries and when it is blinking you will see the status "Charging battery" instead of "Holding charge" when you look under the Enclosure -Tab of the Controller A (or B) in the EVA Command View. I think you will also see messages like this in the Controller Event Log: "The status of the battery assemly '1' has changed."

That's normal and no problem. I would try to look into the EVA with evaperf to analyse if the EVA is the bottleneck. Try "evaperf vdg -cont" and "evaperf hps -cont" and look at the "Average Read Hit Latency" and "Average Read Miss Latency" and "Average Write Latency". All these values should be below 10ms.

Be sure you have Read cache "On" and Write-back" cache enabled on you Vdisk. (It's the default)

You can also monitor the EVA with perfmon, if you prefere a grafic display. If you have high latencies on your EVA, I would look at the SAN-Switch counters with the "porterrshow" command. Reset the counters with "statsclear" and look if some error counters will increase very fast. Maybe you have a problem with one of the LWL-Cables or a Gbic. Analyzing performance problems is alway a time consuming job. Good luck.

Paul

J1mbo · ‎01-18-2011

Reading this thread back it seems the testing has been focused on client-to-server performance? I would suggest running local performance testing within the VMs with IOMeter (this may help) focusing in this case on 32K sequential read and write workloads with 32 outstanding IOs. Ensure there are no snapshots on the VM before testing and that the LUN providing the storage is not serving any other IO.

HTH

janeks · ‎01-24-2011

In general we have good performance from our EVA 4400, but what really boosted it was this document:

Configuration best practices for HP StorageWorks Enterprise Virtual Array (EVA) family and VMware v...

In short

The default "Path Preference" when presenting a new LUN from an EVA, is "No Preference".
To make sure the load is evenly distributed between the EVA controllers, select "Path (A|B)-Failover/Failback" when presenting new LUNs.

Even numbered LUNs goes to "Path A-Failover/Failback"
Odd numbered LUNs goes to "Path B-Failover/Failback"

Round Robin

To set the default of a newly installed ESX, log on to the server and run the following commands:
esxcli nmp satp setdefaultpsp --satp VMW_SATP_ALUA --psp VMW_PSP_RR

In addition to using Round Robin, the Load Balancing Selection should be set to IOPS with a value of "1".
This has to be done for every LUN presented to the host. New as well as old.
The following command will configure the EVA LUN load balancing:
esxcli nmp roundrobin setconfig --type "iops" --iops 1 --device naa.xxxxxxxx
For the exact device name, you can look in the vSphere Client or under "/vmfs/devices/disks".

To make the change for all EVA LUNs at a time, you can use the following command:
for i in `ls /vmfs/devices/disks | grep naa.600` ; do esxcli nmp roundrobin setconfig --type "iops" --iops=1 --device $i ; done

/Jan

alecprior · ‎01-25-2011

We made this change on our EVA6200 about 18 months ago. Performance improved, as observed above. This also will help stop the LUNs swapping ownership between controllers (which was the problem that made us aware of the need to change the pathing setup).

bolsen · ‎01-25-2011

Alec Prior wrote:
We made this change on our EVA6200 about 18 months ago. Performance improved, as observed above. This also will help stop the LUNs swapping ownership between controllers (which was the problem that made us aware of the need to change the pathing setup).

What made you aware of the problem? We have all our LUNs set for no preference and the controller CPUs are well balanced.

alecprior · ‎01-25-2011

We had the problem on ESX3.5 whereby we had static paths defined on the hosts but no preference set at LUN level. In some cases different hosts had different paths specified to the same LUN. The EVA assigns LUNs to controllers based on which controller is using the particular LUN most. The problem we had was that the traffic wasn't consistent so the LUNs would move between controllers quite often and each time they did it would take down one or more hosts. It only really showed itself when we had a large vmotion event such as maintenance mode causing the traffic bias to change between controllers.

A lot of this is solved with the better pathing in vSphere, so you should be able to run OK with no preference on the LUNs so long as you have round robin set on the hosts to avoid them having a static path though what is essentially the passive controller for the LUN.