2 vCPUs Hardware Interupts

Wadebum · ‎10-15-2009

We have decided to move our Oracle Forms servers to VMs. We built a 2 vCPU with 4GB of Ram Windows 2003 SP2 VM on a new ESX 3.5 host using its own LUN on a EVA8100. Our DBAs started to do some forms compiling on the VM and noticed that it was taking around 4 times longer on the VM than it does on physical servers. They wanted me to add more CPUs so I added 2 bringing the server up to 4 total. I watched the performance as they reran the forms compiling and noticed that the actual process doing the compiling was only using about 25% of the CPU (my thinking is that the process does not take advantage of multiple cores). I put the server back down to 2 vCPUs and after rerunning the compile it was using about 50% CPU (adding to my suspicion that it does not take advantage of multiple cores).

Since it was a new Oracle install we thought maybe there is something diffrent between the physical server and the VM, so I did a test with a script that gathers all the email addresses in our Active Directory and puts them in a file. I added a function to the script to put a time stamp at the top and bottom of the file so I could compare between the physical and virtual server. The physical server finished the script in around 6.5 min but the virtual server took around 16-17 min to finish the AD script.

Since we were about to upgrade to ESX 4 we decided not to spend much more time figuring this out. Our upgrade to ESX 4 is now complete and I have upgraded the VMware Tools and Hardware of the VM to the latest versions. It still takes around 16-17min to run the AD script.

I then built a brand new VM with 1 vCPU and 1GB of Ram in the same LUN as the origional server and installed Windows 2003 SP2 onto it and installed the VMware tools. After that I ran the AD script 5 times and it finished on average at just under 5 minutes. I then installed all available WIndows Updates and ran the AD script again with no significant time increase.

Next I shutdown the VM and added a vCPU. After finding the new hardware I ran the AD script 5 more times with an average around 17min. I watched the server with Process Explorer and the Hardware Interupts are running around 20% the entire time the script is running. I used KernView to see what was going on in the Kernal of Windows and I get processor, ntkrnlpa, hal, tcpip, win32k, e1000325, ndis, afd, fltmgr, and ntfs as the top 10 items.

Using Process Explorer on the Physical Server shows the Hardware Interupts around 2% at the highest.

I have gone so far as to make this the only VM running on an HP BL465 G5 with 2 Quad Core AMD 2.7GHz processors and 32GB of ram and setting up CPU and Memory reservations to what the VM has but it does not perform any better. Does anyone have any idea how to get better performance out of a VM with multiple vCPUs? Or why the Hardware Interrupts are so high?

Wadebum · ‎10-21-2009

Here are some screenshots from ESXTOP of the VM running the script, as well as the output from /proc/interrupts and /proc/vmware/interrupts. I have found alot of things on the internet and VMware's web site about IRQ problems between USB and NICs but I don't see that here. I don't know if I am missing something obvious... Any suggestions would be appriciated.

admin · ‎10-21-2009

Hi,

I'm a performance engineer at VMware, Scott Drummonds forwarded this community thread to me.

>>I have found alot of things on the internet and VMware's web site about IRQ problems between USB and NICs but I don't see that here

You dont have to worry about Interrupt sharing in ESX 4.0 since almost all devices (USB included) is now owned by the VMkernel and therefore interrupts never gets shared with the service console. I also looked at the /proc/vmware/interrupts screenshot and it looks perfectly fine to me.

However your esxtop screenshot picked up my interest.. VM WBVMTest has 36% USED but its %RUN is 123% . Since your screenshot does not expand on the VM group, I cant say for sure why your VM is accumulating more %RUN time than %USED. I have seen this happening usually if you have wrong CPU affinity settings, but that should also raise %Ready time but I don't see that in the screenshot. I also see that %PCPU and %UTIL values are not matching for your system, this happens usually if power management feature is enabled in the BIOS and CPU clock frequency scaling is in effect. Does this host ever ran any FT enabled VM since its last boot? I want this information to make sense of some of the counters.

Also it would help if you could capture, esxtop in batch mode and upload the csv file for the case where you run the workload with 2vCPUs.

Wadebum · ‎10-22-2009

Kichaonline,

Thank you for the reply, I think you may have found the problem already. I have attached two esxtop batch files. EsxtopOUT.csv is of the system with all of the power saving options turned on. EsxtopOUTpower.csv is with the power saving features turned off on the host. The time it takes to run the script is now in line with a physical server 8.5 minutes.

Here are the options I turned off on the host:

Power Regulator for ProLiant

was HP Dynamic Power Savings Mode

now HP Static High Performance Mode

Ultra Low Power State

was Enabled

now Disabled

Low Power Halt State (AMD C1 Clock Ramping)

was Enabled

now Disabled

As for FT we do not have it setup in our environment.

I will get with the Oracle DBAs to test the forms compiles and report back how that works out.

jpdicicco · ‎10-22-2009

If you plan to take advantage of ESX 4's power managment in the future, then you will need to set the power management to "OS Control Mode." I haven't tested the other features in ESX 4 yet, but I have them enabled under the assumption that in "OS Control Mode" only ESX can initiate their use.

JP

Happy virtualizing! JP Please consider awarding points to helpful or correct replies.

admin · ‎10-22-2009

>>I think you may have found the problem already.

On I'm glad that the issue is pretty straightforward (though personally wished it was little bit more challenging )

>>As for FT we do not have it setup in our environment.

Thanks for confirming this. I had to ask this, because in esx40 the way we charge CPU cycles changes when your run a FT VM(s) on the host.

Let me attempt to explain what you are seeing here.

Your workload (custom script that pulls information from AD server) seem to consume 80% of CPU on average. The "Dynamic power savings mode" mode in HP systems controls the processor P-States based on the PCPU Utilization. This feature puts the processor into low frequency state by default and steps up the clock frequency only if the PCPU utilization increases above 60%. When you are using a 2 or 4 vCPU VM you are spreading the load across multiple PCPUs such that average utilization of any PCPU remains below 60%, so the processor always run in lower frequency mode and hence the slow performance. On AMD systems lowest processor frequency could be as low as 50% of the rated processor frequency (i.e. a 3Ghz processor will run at 1.5 Ghz). Ideally this should result only in 2x performance drop but I guess the "Low power halt state" is also possibly impactingh performance (see next paragraph)

"Low Power Halt state" also known as C1E halt puts the processor into more deeper sleep when idle (like for instance processor cache can be flushed to save power). This means the processor will have to pay a penalty (few wasted CPU cycles and cache misses) when it wakes up. This is usually fine if your system is mostly idle but if your processor frequently goes in and out of idle state (typical of bursty or I/O bound workloads) then this could affect performance quite a lot. Since the load spreads across the vCPUs when using a vSMP VM the chances of the processor going in and out of the idle state also increases. So I suspect that this also contributes to the performance loss. Especially since windows ping-pongs single threaded application across multiple vCPUs to evenly distribute the load (even though it has undesirable effects both natively as well as in a VM ).

In esx40 we now measure CPU utilization in two different ways one with respect P-MAX frequency (i.e the rated clock frequency) and other with respect to the current clock frequency. %USED is based on rated clock frequency, %UTIL is based on the varying clock frequency. So (if %UTIL is 100 then it means you cannot any more juice out of the processor, but if the processor is running at half its rated max frequency then %UTIL would be 100 but %USED would be only 50. In your esxtop screenshot I spotted that %USED and %UTIL were not matching thats why suspected power management. Also %RUN of 100 means that the VM was scheduled 100% of the time during the last refresh interval, if %RUN is 100 and %USED is 50 then it means the VM used the CPU all the time but burnt only 50% of the processors cycles with reference to its rated max.

I;m guessing your Oracle Forms App also has similar CPU utlization pattern so disabling power management should fix your problem. In general for benchmarking we recommend all BIOS power management features to be disabled. For production its a personal tradeoff so we leave it to the customer choice. Also as you might be already be aware that single threaded apps are better run in a single vCPU VM. Ping-ponging single threaded apps to multiple CPUs has performance impact both at the processor micro-architectural level and also in the virtualization layer.

ESX 4.0 has power management feature but it is disabled by default. If you want to use it you should set the BIOS option to "OS Controlled Mode" and flip power management in VMKernel (Advanced settings) and should reboot the host.

Hope this helps.

Wadebum · ‎10-23-2009

I think you may have found the problem already.

On I'm glad that the issue is pretty straightforward (though personally wished it was little bit more challenging )

I think you may just get your wish...

When the DBAs ran the Oracle Forms compile it took 4 minutes on the physical server, but on the VM it took 9 minutes. This is using the "same" server I P2Ved the physical server to take the Oracle configuration out of the variables. I removed all the "old hardware" and HP hardware programs from the VM. I also had them try running the Oracle Compile on a VM that was built as a VM and not P2Ved and we got a similar 9 minute time frame. The physical and VM servers are comparable in number of CPUs and amount of ram. I did another esxtop in batch collection while the Oracle process was running on the P2Ved Oracle server (ITBTAPP01) and have attached it.

Thank you for all of the information in your last post, it was very good information and helped me understand what those settings are really doing. The description in the BIOS of each makes it sound like the performance hit of each of them is very minor, but understanding how ESX relates to the BIOS settings is great.

admin · ‎10-23-2009

Can you give me details on the storage infrastructure that the native was using. I'm presuming the storage infrastructure might have changed after the P2V. Also could you point me to the LUNs that are being used by this VM so that I dont have to comb through all the data in esxtop?

Wadebum · ‎10-23-2009

Sure, The Physical Oracle server is using 2 local 72GB SCSI drives in a Raid 1 using an HP Smart Array 6i Controller. The Virtual server is hosted on an HP EVA 8100 in a LUN named EVA8100Lun62 the Identifier for that LUN on vmhba0 is naa.600508b400069fab0001200001de0000 and on vmhba1 it is naa.600508b400069fab0001200001de0000.

I guess those are the same Identifier now that I look at it... I thought they might be diffrent. Also I want to add that the LUN is 300GB. The EVA 8100 has 64 300GB 15K drives in it that are all in one disk group.

Message was edited by: Wadebum

Wadebum · ‎10-28-2009

I just wanted to check on this and see if you have had time to look at the esxtop.

Thanks!

admin · ‎10-29-2009

Hi,

I had a cursory look at the esxtop data that you provided. There are many LUNs with the same id prefix naa.600508b400069fab0001200001de0000 (you missed some bytes in the end) so I looked at all of them. There is not much (I/O going on these luns (only few hundreds IOPs) but whatever I/O that is happening has high latency, Some of them are in the range of 40 ms which is excessive and will definitely have a big performance impact. I dont know how disk intensive Oracle Forms workload is but you may want to monitor the disk latency in esxtop and see if "DAVG" is beyond 15 or 20 ms, if so then you need to look at your storage subystem for performance problems. If the DAVG is consistenly less than 10ms and if you still see performance problem then please run "vm-support -s" when the workload is running and then upload the resulting dump file it creates.

Wadebum · ‎11-04-2009

Kichaonline,

I have attached an output from vm-support -s. After doing some testing using diffrent an EVA 8100 and EVA 4100 I do see a few strange things happening infrequently so I moved the VM to local disk. It still takes about the same amount of time even on the local storage. I am going to work to see if I can figure out what is going on in the EVAs.

Wade

V_2 · ‎11-04-2009

Wade, are the vmdk(s) aligned?

www.vmware.com/pdf/esx3_partition_align.pdf

Wadebum · ‎11-05-2009

V.

Thanks for the link, I had no idea about that stuff. I setup a new LUN and created a new VMDK there that I setup with the partition primary align=64 and 32K allocation size. I ran IOMeter with 32K; 50% Read; 0% random and had a Average Read time of 8.34ms Average Write Time of 2.84ms, but the Maximum Read time was 586.12ms, and the Maximum Write time was 579.11ms.

While I was running IOMeter I also ran evaperf to gather data about what was going on in the EVA 4100 at that time.These numbers are for the LUN that the new drive was hosted out of: Read Hit Latency Max 8.6ms no average is provided but I would say the average looks to be about 5ms for the time IOMeter was running, Read Miss Latency Max 11.9ms with an average of about 10ms, and Write Latency Max 3.8ms with an average of about 2.4ms.

So it does not seem like there is a lot of latency on the EVA almost 12ms is the highest, so I don't understand where the huge 586.12 and 579.11 come from in the IOMeter test.

Wade

admin · ‎11-05-2009

Hi

i havent had a full look at the snapshot data yet but I see there are two VMs registered on the host and both of them are residing on the SAN LUN. Could you confirm if you uploaded vm-support data for the right host and if so please give me a pointer to your VM.

Wadebum · ‎11-06-2009

Sorry about that, the VM is ITBTINB01 and the luns Identifier is naa.600508b400069eca0005400000ae0000. The other VM ITEXDB01 was powerd off durring the testing.

admin · ‎11-08-2009

Wade,

I just got time to look at the performance snapshot. Here is a summary of my observations:

1) Your VM does very little I/O - so we dont have to worry about the EVA issues that you are having. (I do see lot of SCSI aborts in the logs so surely seem to have some storage issues)

2) Looks like your AMD processor is Shanghai which is capable of Hardware MMU but your VM is not using HW MMU. I see that you are running Windows 2003 32-bit guest OS, we choose the default virtual machine monitor type based on the guest OS type and processor type. For windows 2003 32-bit guest we default to Binary Translation and Software MMU due to an operating system issue that results in performance degradation when Hardware assist is used. Microsoft addressed this issue in Windows 2003 SP1. So if you are running Windows 2003 SP1 or above then set the guest OS type accordingly so that the default monitor type is set to Hardware assist with HWMMU. For Oracle database workload you will see a noticeable performance increase with Hardware MMU.

2) Your VM seems to be burn CPU equally such that each vCPU averages around 50% utilization but at any point in time I dont see the VM's cumulative CPU consumption exceeding 100% CPU so I'm suspicious if you are running a single threaded workload inside the guest. Windows ping pongs single process/thread across both the CPUs. Could you check the task manager inside the guest to confirm if this is happening?. A single threaded application that ping pongs between the both vCPUs will perform poorly with Software MMU and Hardware MMU (or moving into single vCPU configuration) will help there.

Wadebum · ‎11-17-2009

Kichaonline,

The VM is Windows 2003 32bit with SP2 so I changed it to Hardware MMU but I did not see any performance improvment. I think you are right about the forms building process only using 1 CPU, when we first started playing around with this the process was using 50% of the total CPU with 2 CPUs, the DBAs wanted 4 CPUs so I gave them 4 and it only used 25% of the 4 CPUs so next I modified the VM so it only has 1 vCPU and changed the HAL to Uniprocessor but did not create any performance improvement.

I wanted to test the VM on local disk and also to try Intel processors so I dug up a HP DL380 G5 with 2 Intel 2.2GHz Quad Core CPUs and 8GB of ram with enough storage for all the virtual machine files. I made all the same power saving changes to this host as were done to the other hosts. I ran the Oracle forms process and it still took just as long. I have attached another vm-support -s output from this HP DL380.

Message was edited by: Wadebum

Looks like there is a new rule about the size of attachments. Do you have another way I can get the file to you?

Wadebum · ‎11-17-2009

Triple post....

Message was edited by: Wadebum

Triple post....

Wadebum · ‎11-17-2009

Triple post....

All

2 vCPUs Hardware Interupts