VMs with High Processor\% Interrupt Time\_Total

XR2K · ‎03-08-2010

Please let me know if this is the wrong place for this post, I am new to the community so please forgive me. In the past couple of weeks I have had, in growing numbers, VMs across 3 ESX Hosts have high processor interrupt time alerts. The issue started off with just two VMs which were reporting high processor interrupt times for about eight days with two alerts each day. Then those same two VMs started to have 4 alerts being generated each day. Nine days later we had another VM start to have the same issue as the previous two. Finally, five days ago we started to have two other VMs begin to report having high processor interrupt times, so at this point we currently have five VMs that are having this issue. Here are some details behind the VMs, ESX Hosts and Alerts:

1. ESX Hosts are Dell 2950 Servers with 2 Quad Core XEON E5405 Procs @ 2.0Ghz each with 16GB Memory and an Emulex LPE 1150 4GB HBA.

- The Service Console memory is set at 400MB, though the recommended is 800MB, we are working on that.

- I have run through the ESX diagnostics data, with dell, on each of these hosts to ensure the issue isn't ESX related and they have told us that our hardware is fine.

- We have run the latest Dell SUU on each of these servers.

2. The VMs are running Windows Server 2003 R2 Service Pack 2 Standard Edition

- NOTE: The VMs are running various programs, which I cannot disclose.

- The VMs have 2GB or 4GB (3.75 utilized) of RAM and most have only 1 processor.

- The VMs have the ACPI Multiprocessor PC HAL on them versus the Uniprocessor HAL.

- The VMs are spread across, evenly, the three ESX hosts according to resource usage

- The VMware Tools are out of date, but this is also the case for those VMs we are not seeing the interrupts on.

- In the VIC we have monitored these VMs

- The Virtual Memory on the VMs OS is set, by default, to an initial of 2046MB with a max of 4092MB and we realize the recommended is 5758MB.

- Each VMs OS partition disk utilization is no more than 72% utilized.

3. The alerts are generated from our SCOM (Systems Center Operations Manager 2007) agents that are installed on each.

- The alerts are all around the same time at night between 11:00PM and 12:00AM.

- They are usually within the same range each day 30% - 42%

- I have run multiple Performance Monitors and each show that there is a drop in Available Memory during that time.

- The Perfmon doesn't seem to pick up the instance when the interrupts occur even though I set the increments at 5 seconds.

- There is one service that always seems to correlate with the interrupts. At the time the Systems Center DPMRA Service is being stopped.

- The Systems Center Data Protection Manager Recovery Agent is a backup solution which is managed by a third party.

- There are other times through the day when we see the same DPMRA System Event occur, however, no alert is generated.

- The Alerts state the following: "3/2/2010 11:29:15 PM Alert description: The threshold for the Processor\% Interrupt Time\_Total performance counter has been exceeded. The value that exceeded the threshold is: 30.5042504310608."

Any help would be greatly appreciated. I am new to VMware, but I have had enough interaction with the software and hardware to do some troubleshooting.

FredPeterson · ‎03-08-2010

Processor Interrupts are IRQ interrupts basically - something is wanting serious access to a device that has an interrupt and that is usually hardware.

At first glance I'd say its disk. Have you checked latency data during the period? What about other utilization in terms of throughput and commands per second?

>The VMs have 2GB or 4GB (3.75 utilized) of RAM and most have only 1 processor.

>The VMs have the ACPI Multiprocessor PC HAL on them versus the Uniprocessor HAL.

This could also be an issue. Windows thinks it can schedule a hardware device (CPU) that isn't actually there. In most cases this isn't an actual problem because CPU usage is generally very low for VM's but if CPU usage spikes for an extended period the scheduler could get all confused as it attempts to ask for hardware access that is not there.

XR2K · ‎03-08-2010

Mr. Peterson:

I am going to run a series of fresh Performance monitors tonight during the time which the interrupts normally occur and I will then run the PAL on them tomorrow morning. I have seen many threads associating VM poor performance with the HAL that is in place, however, from those threads it seems the only way to correct the issues is to rebuild the VM from scratch, which is not an option for us as this point. I am also having our NOC run top commands on the ESX hosts to see if there are any other reasons for this issue. As far as disk latency goes the following are statistics from reports that I have already run during the problem time span:

Physical Disk Read Latency: .004 max

Physical Disk Write Latency: .002 max

Logical Disk Read Latency: .004 max

Logical Disk Write Latency: .002 max

As far as Throughput goes I will post more details tomorrow after reviewing the results of the Performance monitors.

Thank you for your help. I will post again tomorrow.

-XR2K

XR2K · ‎03-09-2010

After running more Performance Monitors on the systems last night here are the results:

There were a total of 5 alerts during our testing period.

\Processor()\% Priviledged Time*

- Max 31% (Spike)

\Memory\Available MBytes

- 2 instances of a decreasing trend; one that was -51MB per hour and another that was -16MB per hour.

\Memory\Pages Input/Sec

- 2 instances where the number of reads exceeded 10 page files per second; one was 12 reads per second and the other was 75 reads per second.

We recieved, again, the same messages about Processor\% Interrupt Time\_Total from our SCOM alerts, however, the Performance Monitor never saw the spike that was indicated by SCOM. I have attached the performance analysis to this thread for review.

XR2K · ‎03-12-2010

Okay, so after further investigation and tireless rounds of performance monitoring I found the following article; which helped me in tracking down an issue:[Technet: User Perfmon to Diagnose Common Server Performance Problems|http://technet.microsoft.com/en-us/magazine/2008.08.pulse.aspx?pr=blog]

I checked the System: Processor Queue Length and found that during normal operations the servers are running a queue of 3-6 and at peak around 14. Based on the article I shouldn't be seeing more than 2 at any given time or in some cases I have seen threads that state "The Processor Queue Length shouldn't be two times the number of processors on the system," and seeing that these servers have 1 virtual proc allocated to them even an average queue of 3-6 is a bad thing, and please correct me if I am mistaken. So at this point we are going to try to add another proc to one of the Virtual Servers to see if that takes care of the performance issue.

I'll post back when I can get this implemented. Any other comments or suggestions would still be greatly appreciated. Thank you.

AlexPT · ‎05-13-2010

You state:

- The VMs have 2GB or 4GB (3.75 utilized) of RAM and most have only 1 processor.

- The VMs have the ACPI Multiprocessor PC HAL on them versus the Uniprocessor HAL.

I am guessing you don't really mean that you've presented 1 vcpu and have the Multi Proc' HAL installed in Windows do you? That's a big deal if you ask me. I am sure this is not the case as you appear savvy. But if it is the case tihs is probably your issue.

You fixed it yet?

PS I have exactly the same issue. My Test VM runs with 2 vcpu and hogs 25% of the CPU when sat idle. Usually Hardware Interupts are the cause viewing in Proc' Explorer. No one seems to have fixed this challenge. There is a PS script which checks the Hal - see attached

XR2K · ‎05-13-2010

AlexPT:

I am savvy, depending on what day it is, downside is I didn't do the initial installation/configuration of these VMs, and the answer to your question is yes. The person who installed these VMs created them with the ACPI Multiprocessor PC HAL rather than the Uniprocessor HAL while only assigning 1 vCPU to each VM. However, at this point we have assigned 2 vCPU per VM and we have not had any issue since. Although, I have a slight fear that this is simply a growing issue and that as time passes we will see these messages again. Another issue is that these are in our Prod environment and correcting the HAL would mean a complete rebuild of the VM, and at this point that is not a possibility. Now I find that our System Processor Queue Length on each of the machines is around 0 or 1 most of the time. There are occasional spikes of 2 or 3, but that is about it. I am not ready to call this issue solved, because it seems to me there is much more to it than meets the eye, so I am waiting to see if we have any further developments so that I can really run this issue down. On the other hand if this is a situation that returns I may be lead to believe that there is a memory leak or a thread issue with an application running on the VMs; which obviously begs for further investigation. However, at this point I am carefully monitoring the VMs to see if we have an answer or just a Band-Aid covering a compound fracture. Thank you for your reply.

Sincerely;

XR2K

AlexPT · ‎05-13-2010

Yeah I'd bet my bottom dollar on it that the HAL mismatch was the issue for you there.

On my side I am seing Average of Proc' queueing figures of 8, max of 42. It's definitely HAL related I feel. My VM was P2V'd I think.

All

VMs with High Processor\% Interrupt Time\_Total