VMware Cloud Community
kwayley
Contributor
Contributor
Jump to solution

Agent java process using 100% CPU

Hi,

I have Hyperic 4.0.1 monitoring 11 platforms and on 10 of them it is working great but on the other 1 the Java process is just hogging all the CPU and I don't know why.

Basically there are 2 web servers that are both being monitored. They use the same hardware and were setup by the same people using the same binaries at the same time. All config files are identical. One machine works perfectly and the other is having this problem with Java.

I have tried restarted the agent and done some testing with new versions of the agent as well as excluding unwanted plugins in the agent.properties file but there is still no change.

The only time I was able to get the CPU back down was but turning off collection of all metrics and restarting the agent. I then added the metrics back one-by-one and the CPU usage stayed at a negligible level. It remained this way for about 2 weeks but has now started to creep back up again and no change to the machine has been made that I am aware.

Does anyone have any ideas what could but causing Hyperic/Java to do this?

Thanks,

Andy
Reply
0 Kudos
1 Solution

Accepted Solutions
SLTB
Enthusiast
Enthusiast
Jump to solution

Although I can't know for sure log tracking has to do with it, as it was in my case, I would try to run the agent without log tracking or using a very high interval for it and comparing the CPU.

Try putting this in the agent.properties:
track.interval=99999

(the units are in seconds)

View solution in original post

Reply
0 Kudos
18 Replies
excowboy
Virtuoso
Virtuoso
Jump to solution

Hi,

what OS are you running on the server ? Are you using a HQ Agent package with the bundled JRE ?
Check if there are special JAVA_HOME settings for the user you are using to start the Agent.
Did you already upgrade the HQ Agent to latest 4.0.3 version on that host ?

Cheers,
Mirko
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Hi,

All the machines are running Red Hat Enterprise Linux 4.

Originally I was using the bundled JRE but I have tried using a different JRE but that had no affect.

I have also tried 4.0.2 and 4.0.3 and still get the exact same symptoms.

There are no special JAVA_HOME settings for the hyperic user that I can see.

The thing I'm finding most odd is that it's not a constant CPU usage at the moment, it seems to be about every 3-4 minutes. At first I thought it was something to do with the agent talking to the server but that should be happening more often so I ruled that out.

Thanks,

Andy

Additional thought: I've change the agent on the server I'm having problems with so all the files have changed and yet the problem remains. Is it possible that the problem could lie at the server end in some way?

Message was edited by: kwayley

Message was edited by: kwayley
Reply
0 Kudos
KPCasting
Contributor
Contributor
Jump to solution

Hi,

I support kwayley. Same problem here.

I have 2 servers monitored with Hyperic 4.0.3 with identical hardware and configuration features. One works perfectly and the other doesn't.

With agent.logLevel=DEBUG can't detect the problem either. Agent Java process is taking up 100% of the CPU during 2 minutes every 10-15 minutes.

Services running in this machine are:

SSH
PostgreSQL
MySQL
NTPD
Apache


Temporarily I have installed hyperic-hq-agent-3.2.6 and now it works right.


Regards,
KPCasting

Message was edited by: KPCasting

Message was edited by: KPCasting
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

I just tried what KPCasting did and installed 3.2.6 on this machine and likewise my CPU load has now dropped dramatically. I can still see the CPU climbing to 100% just as it was but but the duration that it stays at this level for has now fallen right down.

The average CPU usage on this machine is now about 15% but it is about 5% on its twin.
Reply
0 Kudos
mcmesser
Hot Shot
Hot Shot
Jump to solution

Can you try starting the 4.0.3 agent without the Java Service Wrapper? Perhaps the Wrapper is configured in a specific way that is not playing nicely with your system?

./bundles/agent-4.0.3-EE/bin/hq-agent-nowrapper.sh start
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

I tried running hq-agent-nowrapper.sh and although the performance was better it was still not as good as that of the other node. CPU on this one was still hitting 100% whereas on the other it never went about 6%.

Performance running 4.0.1 agent without the wrapper is similar to that of running the 3.2.6 agent as normal. Both leave the CPU averaging at approximately an additional 10% CPU usage over that of the other identical machine.
Reply
0 Kudos
jvalkeal_hyperi
Jump to solution

Is this linux or unix system? If you are familiar with prstat and jstack, try to check which thread is taking most of the cpu cycles.

Java stack could then tell a bit more what part of the agent is causing the load.
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

It is a Red Hat Enterprise 4 Linux system (2.6.9-55.ELsmp)

I don't have prstat. Tried jstack but all I get is a load of errors:

"Thread 28314: (state = BLOCKED)
Error occurred during stack walking:
sun.jvm.hotspot.debugger.DebuggerException: sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.execute(LinuxDebuggerLocal.java:134)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.getThreadIntegerRegisterSet(LinuxDebuggerLocal.java:437)
at sun.jvm.hotspot.debugger.linux.LinuxThread.getContext(LinuxThread.java:48)
at sun.jvm.hotspot.runtime.linux_x86.LinuxX86JavaThreadPDAccess.getCurrentFrameGuess(LinuxX86JavaThreadPDAccess.java:75)
at sun.jvm.hotspot.runtime.JavaThread.getCurrentFrameGuess(JavaThread.java:252)
at sun.jvm.hotspot.runtime.JavaThread.getLastJavaVFrameDbg(JavaThread.java:211)
at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:50)
at sun.jvm.hotspot.tools.JStack.run(JStack.java:41)
at sun.jvm.hotspot.tools.Tool.start(Tool.java:204)
at sun.jvm.hotspot.tools.JStack.main(JStack.java:58)
Caused by: sun.jvm.hotspot.debugger.DebuggerException: get_thread_regs failed for a lwp
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.getThreadIntegerRegisterSet0(Native Method)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal.access$800(LinuxDebuggerLocal.java:34)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$1GetThreadIntegerRegisterSetTask.doit(LinuxDebuggerLocal.java:431)
at sun.jvm.hotspot.debugger.linux.LinuxDebuggerLocal$LinuxDebuggerLocalWorkerThread.run(LinuxDebuggerLocal.java:109)"

There is more than that but you get the idea.
Reply
0 Kudos
SLTB
Enthusiast
Enthusiast
Jump to solution

I am running an agent (4.0.3) on windows with no wrapper.
I have seen agent go up to 50% CPU when monitoring via hq (1 min poling)
If I monitor with perfmon I can see it go even higher (very short but duration but high peaks)

I would like to know:

1. What would be a normal CPU usage for the agent (I would expect not more than 2%)
2. What parameters can be used to optimize it (debug level? excluding plugins? reduce interval of log tracking?...)
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Yesterday morning just after a scheduled backup task that runs every morning and uses 100% of both CPU cores the agent CPU has now dropped to the level I would expect to see given what the other node is doing.

Unfortunately I am still running the old agent and I don't really want to put the new agent back again now until I can feel confident that the CPU isn't just going to shoot up again.

Can anything think why the CPU would drop down all by itself like this?

I have also noticed this when applying the exclude.plugin option to all of the servers that are being monitored. Initially the memory used by the agent increases and then after a day to a week it drops right down for no obvious reason.
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Okay this is really starting to bug me now, at the beginning of last week the agent ran perfectly but since then the CPU usage has been creeping back up.

To give you an idea of what I've been dealing with please see the attached picture. The red line is purely because of Hyperic. The green line shows the equivalent CPU core from the other machine.
Reply
0 Kudos
SLTB
Enthusiast
Enthusiast
Jump to solution

Can you run the agent in debug and upload the log file?
It would be interesting to see if it is related to log tracking.
For me the problem was completely removed when I reduced the log tracking interval. (although on windows)
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Log file attached while running agent 3.2.6.

Would it be helpful to see the log from a 4.0.* agent as well?
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Both versions of the agent are still causing me problems as they are both being far too CPU hungry. Does anyone have anything additional I could try in order to get it fixed?
Reply
0 Kudos
SLTB
Enthusiast
Enthusiast
Jump to solution

Although I can't know for sure log tracking has to do with it, as it was in my case, I would try to run the agent without log tracking or using a very high interval for it and comparing the CPU.

Try putting this in the agent.properties:
track.interval=99999

(the units are in seconds)
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Thanks, I'll give that a try and see what happens. 🙂
Reply
0 Kudos
kwayley
Contributor
Contributor
Jump to solution

Yup, it was log tracking that was causing the problem.

Basically one node was setup to track a HUGE php log and the other one wasn't. Now that I've removed tracking for that log everything has returned to normal.

Thanks very much. 🙂
Reply
0 Kudos
sanjivsingh
Contributor
Contributor
Jump to solution

Hi SLTB/kwayley,

I am facing the same issue.

Curently i am using CentOS 5.5. and Hyperic 4.4.0.

CPU have been hoggeg uotp 100% after each 15 mins for few seconds.and curently no track.interval in defined in agent.properties.and default interval is 5 mins.

If the issue is related to log tracking then it sould happen after each 5 minutes.

waiting for reply......thanks.

Regards,

Sanjiv Singh

Software Engineer (iLabs)

Impetus Infotech (India) Pvt. Ltd.

D-40, Sector-59, Noida - 201307, UP |  (M) +91-9990-447-339 | www.impetus.com

Reply
0 Kudos