Re: What is wrong with my infrastructure ?

Macomar · ‎07-03-2013

hello, to everybody.

currently I build up following infrastructure on vmware vsphere 5.1 Enterprise... the following is given:

hardware:

7x hp DL 380 g7 server with 2 sockets cpu and in each case 192 GB ram.
every server is equipped with 12 network cards.
as storage is used a netapp 2240-2 with 2 controllers and 24 hard disks.
the netapp is equipped with a mezzanine card per controller. thereby 2x 10gb nics are available per controller.
the 10gb nics on the netapp are bundled up as a virtual interface on each controller.
used switches are cisco ws c3750x 24. these are also equipped with 10gb modules.
the cisco switch model is certificated from netapp. the cable connection from storage to switch is offered about netapp directly.

software:

each hp server is installed with vmware esxi 5.1 release 799733.
the esxi installation file is from hp with the drivers from the server compiled in it.
the netapp has the data ontap release 8.1.1P1 7-mode
we use software iscsi hba on the esx server

now the big problem:

i converted some physical servers through vmware converter to the the vm infrastructure.
some servers are running ms sql server und oracle on it.
the problem is now, that the storage i/o traffic over iscsi is now slower then before.
im posting screenshots from iometer in the posting.

virtual networking in vcenter is built up as follows:

four nics from a server in the vm cluster sends the storage traffic to the cisco switch.
the cisco sends the traffuc about his 10gb nic modules further to the netapp.
i've use the howtos to build iscsi multipathing von the vswtich, that sends the storage traffic to the netapp.
the virtual machine vmdk's are stored in one big aggregate on the netapp.
the aggregate is splitted in three volumes (one for the netapp wafl system and two for vm data).
the two data volumes has each one lun in it, where the vmdks are stored.

now i'm experimenting with the vmware best practice guide "oracle databases on vmware". interesting is, that my performance is the same bad on each server and virtual machine.

it's not the matter, if the virtual machine is with ms sql, oracle or just a windows 2008 server with nothing on it. iometer still brings bad results on the 4k blocks read and write.

now the question in here ist, what is wrong ?

many thanks for a solution

marc

TomHowarth · ‎07-03-2013

when you converted the servers did you run post migration clean up processes?

things like clearing our shadow devices etc

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

logiboy123 · ‎07-04-2013

I wouldn't P2V a database server personally. I know that you can do it and I know it can be successful, but I'd rather build a new server and migrate the databases to it manually. This will typically net a much better result.

TomHowarth · ‎07-04-2013

agreed, but a amalgam of both usually works.

whenever I am P2V'ing a DB or Exchange server (read any server with heavy datachange) I get a migration window where I can stop all necessary services.

This leads to a clean migration. Never do a DB hot

This is the main reason that AD controllers cause so much issues, their services can not be shut down so are never in a quiesced state

Tom Howarth VCP / VCAP / vExpert
VMware Communities User Moderator
Blog: http://www.planetvm.net
Contributing author on VMware vSphere and Virtual Infrastructure Security: Securing ESX and the Virtual Environment
Contributing author on VCP VMware Certified Professional on VSphere 4 Study Guide: Exam VCP-410

Macomar · ‎07-04-2013

it is not only the P2V machine. we also created brandnew machines on the infrastructure to build a test environment.

we only converted the production servers through the vm ware converter.

the performance is the same bad on the converted and on the new created vm's on the cluster.

actually i did a new disk marking,based on the vmware oracle best practive howto.

first i created a seperated volume on the netapp with a single lun and mapped it to one esx server on the test cluster.

next step was to add a new harddisk on the virtual machine with the virtual device node SCSI (1:0) and change the new iscsi controller to paravirtual.

on the windows machine i added the new harddisk and formated the NTFS cluster size as 64K (64 kilobytes)

as u can see with esx top, in my benchmarking tool, the data is pumping through my 4 iscsi multipaths

but the performance is miserable. specially the 4k block for databases.

jdptechnc · ‎07-05-2013

I think you have done everything right on the vSphere side.

What is the NetApp storage system telling you? Are you hitting any CPU spikes, back-to-back CP's, are you overdriving your spindles? What type of disks are you using? Any errors showing up in the messages file? I think your storage system is the most likely bottleneck at this point. Have you tried reaching out to NetApp? Their performance support folks are pretty good.

Please consider marking as "helpful", if you find this post useful. Thanks!... IT Guy since 12/2000... Virtual since 10/2006... VCAP-DCA #2222

Macomar · ‎07-05-2013

hello to all,

i've did a lot of testing the last two days and found out this:

on another infrastructure i have two physical hp dl380 g7 servers running for oracle rac and red hat linux enterprise.

the server are connected via iscsi (netword bonding an the redhat with 2 nics) on the same cisco switches (3750) and netapp storage (2240-2).

i found an i/o testing script on this page http://benjamin-schweizer.de/measuring-disk-io-performance.html an run it on one oracle server.

the results were:

in realtime the sysstat von the netapp says the following:

so this tells me, me performance on the iscsi of an physical server an my netapp is ok !

now it comes to the thing i dont understand.... i run the same test on a linux system in virtual infrastrukture:

looks ok, but in realtime on my netapp there is this going on:

so this is weird.... my linux in vmware tells me, i have good iops and throughput an my virtual system, but actually down on the netapp there is nothing going on.

how could this be, when i do this on a real hardware cluster with the same iscsi connection via the ciscos and the same netapp storage, that the perfomance is SO BAD ???????

something must be in the vmware that is breakting out my performance to the storage system, and no the switches are ok. we compared the configuartion of the ciscos in both enviromnets.

for the result:

physical servers -> 2 nics via bondig connected to the ciscos -> cisco puts its data through a trunk of 4 ports to the netapp -> netapp all 4 nics are connected to a virtual interface

virtual infrastruture -> 1 esx server -> 4 nics go to the cisco via iscsi multipathing (see config screenshot above) -> connection via netapp an ciscos 10GB/E over the mezzanine card from netapp an the 10GB modules from the cisco.

physical servers -> iops and througput ok

virtual infrastrukture -> iops and throughput miserable and not acceptable

PLEASE HELP !!

many thanks for help in advance.. im getting mad over this !

greetings marc

zialex · ‎07-05-2013

Multipathing policy was changed to Round Robin?

Macomar · ‎07-06-2013

Yes, all storage datastores who are added are set to Round Robin (Mulitpathing) and all channels are set to active I/O

stainboy · ‎07-17-2013

Hi. The RR was there by default or did you changed it? How did you do it? On CLI so that every new added LUN would be RR or manually on the vCenter??

Macomar · ‎07-17-2013

Hi, first we created a vSwitch in vCenter. After that we configured the multipathing for iscsi. Then we created the software-iscsi hba and mapped a lun from the netapp. After that, we went in the properties auf die mapped lun (datastores) and change in "manage paths" from last used (VMware) to Round Robin. All these we did manually on the vCenter.

Macomar · ‎07-17-2013

News from the HP Support, they think that the Quad Nics NC 364T are not supported for the DL 380 G7 Server series... ha ha ha... On the production page of HP there is a compatibily list and look at this -> HP Product Bulletin -> HP ProLiant DL380 G7... uuups !!!

From the VMware Support we get the answer to our call, that maybe the Software iSCSI HBA is the problem, and with that use VMs are limited to an IOps of 4000. They told us to use a real Hardware iSCSI HBA. Dear VMware Support, the DL380 G7 has 4 Broadcom OnBoard Nics which are shown as HW iSCSI HBA in the vSphere Client, and no one can tell use how to use them as Hardware iSCSI.... so now ???

stainboy · ‎07-17-2013

Ok, you could just run a command to do it to all your luns but that is OK. Next thing I would do is to install the netapp plugin in vcenter and apply all recommended settings. If I remember, MPIO, NFS and another one I don't remember but it is important. Just go apply ALL recomended settings. It will change queu deth and some more. Thats is for sure the NepApp recommended settings so start there. I had a similar issue with a FAS 2240 with iscsi and vmknic binding for multipathing. When I changed to RR.... huge latencies, a lot of problems. Apply the settings reboot.

stainboy · ‎07-17-2013

No. It shows HW iSCSI but it is not. Try to configure it on vCenter and you'll see. Just be sure to follow the instalation of the netapp plugin, the VSC and apply the settings, reboot and try.

stainboy · ‎07-17-2013

oh and another thing... everything is on the sabe network right? no L3 in the middle...

Macomar · ‎07-17-2013

Hello. We already installed the netapp VSC plugin on the vcenter. After the reboot from the servers all three Settings (MPIO, NFS, and the other one) are status "green". What i am wondering about is, next day, when i look into the venter -> netapp plugin, the settings of the Server for example MPIO is Status "red"... what is this ??? is doesn't stay at the green state.

Macomar · ‎07-17-2013

sorry for my misunderstanding, but what do you mean with "sabe network" and "L3 in the middle" ?

stainboy · ‎07-17-2013

sorry. same network.... no routing in bettween.

so a red state tells you something is not correct. The MPIO is the settings for the multipathing. Makes no scense calling it MPIO in my opinion but thats another discussion.

So go through the details and check for the red lines that show you exactly where the settings were not applied and could point you in the direction of your problem. You can check that by going to the same place where you apply the settings and the is somethings like "details". It is a LONG list of ALL settings so it can be tedious but in many cases worth the time. It will show you in red which settings failed to apply. In the mean while, I'll try to remember what I did to mitigate that problem. But check it out.

stainboy · ‎07-17-2013

your netapp is active/active or active/passive with ALUA ??

stainboy · ‎07-17-2013

nevermind that.