VMware Cloud Community
kishorerajput
Contributor
Contributor

Multiple agents gets started and runs out of memory

Hi,
I have an agent running on the same box as server and agent starts successfully but over the time their are more agents created automatically and each agent keeps on consuming memory and finally machine runs out of memory.

I have other agents running on other machines and they are able to communicate with server and they are running perfectly fine and each machine has only one agent running , Do any one have idea why multiple agents gets created for the agent on the same machine as Hyperic server?

Thanks,
Kishore.
Reply
0 Kudos
10 Replies
excowboy
Virtuoso
Virtuoso

Hi,

what OS is this ? Which HQ Agent Version are you running ?

Cheers,
Mirko
Reply
0 Kudos
kishorerajput
Contributor
Contributor

Hi,
The OS is :Solaris 10 Sun Sparc
Hyperic Agent : hyperic-agent-3.2.3-EE

Thanks,
Kishore.
Reply
0 Kudos
excowboy
Virtuoso
Virtuoso

Hi,

could you please upgrade your Agent to the latest 3.x version (3.2.6) or to 4.0.3 and report if the error still occurs ?

Cheers,
Mirko
Reply
0 Kudos
kishorerajput
Contributor
Contributor

Can you please send me the link which mentions the step by step approach of upgrading the Hyperic Client?

I will give it a try.

Thanks.
Reply
0 Kudos
excowboy
Virtuoso
Virtuoso

Hi,

documentation is available right here: http://support.hyperic.com/display/DOC/Upgrade+HQ+Components

Cheers,
Mirko
Reply
0 Kudos
jvalkeal_hyperi

Could this be related to bug in jre, which spawns extra hq java processes. For me it happened when solaris jre did a fork to run external scripts. There is at least 2 support cases in jira for this issue, with workarounds.

Hard to say until there's stack dumps from jre and os, thought.
Reply
0 Kudos
excowboy
Virtuoso
Virtuoso

Hi Janne,

OS users do not have access to JIRA support cases, so could your probably post a workaround ?

Cheers,
Mirko
Reply
0 Kudos
jvalkeal_hyperi

This was the situation:
These are processes shown by ps:
root 20175 1 0 Dec 18 ? 21:41 /opt/hyperic/hyperic-hq-agent-4.0.1-EE/wrapper/sbin/../../wrapper/sbin/wrapper-
root 20176 20175 0 Dec 18 ? 246:46 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 1771 20176 0 Dec 25 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 11942 20176 0 Jan 01 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 15521 20176 0 Jan 03 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 18470 20176 0 Jan 05 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 20349 20176 0 Jan 06 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 24932 20176 0 06:20:16 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..
root 24007 20176 0 17:30:16 ? 0:00 /usr/java/bin/java -Djava.compiler=NONE -Djava.security.auth.login.config=../..

As you can see that original process(20176) is started by wrapper. All other stucked childs are forked by this main process, which you see by comparing parent id's.

Snippets from jstack:
#/usr/jdk/jdk1.5.0_14/bin/jstack 20176
Thread t@58166: (state = IN_NATIVE)
- java.lang.UNIXProcess.waitForProcessExit(int) @bci=0 (Interpreted frame)
- java.lang.UNIXProcess.access$900(java.lang.UNIXProcess, int) @bci=2, line=17 (Interpreted frame)
- java.lang.UNIXProcess$2$1.run() @bci=17, line=86 (Interpreted frame)

#/usr/jdk/jdk1.5.0_14/bin/jstack 1771
Thread t@109: (state = IN_NATIVE)
- java.lang.UNIXProcess.forkAndExec(byte[], byte[], int, byte[], int, byte[], boolean, java.io.FileDescriptor, java.io.FileDescriptor, java.io.FileDescriptor) @bci=0 (Interpreted frame)
- java.lang.UNIXProcess.<init>(byte[], byte[], int, byte[], int, byte[], boolean) @bci=62, line=53 (Interpreted frame)
- java.lang.ProcessImpl.start(java.lang.String[], java.util.Map, java.lang.String, boolean) @bci=182, line=65 (Interpreted frame)
- java.lang.ProcessBuilder.start() @bci=112, line=451 (Interpreted frame)
- java.lang.Runtime.exec(java.lang.String[], java.lang.String[], java.io.File) @bci=16, line=591 (Interpreted frame)
- org.hyperic.util.exec.Execute.execute() @bci=16, line=316 (Interpreted frame)
- org.hyperic.hq.product.ExecutableProcess.collect() @bci=98, line=202 (Interpreted frame)
- org.hyperic.hq.product.Collector.run() @bci=41, line=562 (Interpreted frame)
- edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor.runWorker(edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker) @bci=46, line=1061 (Interpreted frame)
- edu.emory.mathcs.backport.java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=575 (Interpreted frame)
- java.lang.Thread.run() @bci=11, line=595 (Interpreted frame)

At this time below found from agent.log:
2008-12-25 20:17:55,342 INFO [pool-1-thread-12] [Execute] waitFor() interrupted
2008-12-25 20:17:57,359 ERROR [pool-1-thread-12] [ExecutableProcess] [../../bundles/agent-4.0.1-EE-905/pdk/work/scripts/sendmail/hq-sendmail-stat]: Timeout
running [../../bundles/agent-4.0.1-EE-905/pdk/work/scripts/sendmail/hq-sendmail-stat ]
-------------------------------------------------

Workarounds are:
- See http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6276483 and apply workaround #2 (jre/lib/security/java.security) to your JRE
- Add plugins.exclude=ntp,sendmail to your agent.properties (so exclude plugins which runs external scripts)

Modifying java.security resolved my problems.
-----------------------------------------------------

This specific issue was related to x86 solaris. But it may also happen in sparc. I've seen these spawned processes also on sparc once. Unfortunately I was too quick to restart agent and I forget to store jstack and pstack outputs from the processes. So I'm not exactly sure if this is the case.

It's nasty issue with 1.5 java. Only fixed on 1.6 and I believe Sun wont backport the fix to older jre's.
Reply
0 Kudos
jvalkeal_hyperi

Also removing 'security.provider.1=sun.security.pkcs11.SunPKCS11 ${java.home}/lib/security/sunpkcs11-solaris.cfg' from java.security will brake agent.

Jre will expect to find default provider which is the first one. This wasn't that clear in workaround. So after removing security.provider.1 rename security.provider.2 to security.provider.1. security.provider.3 to security.provider.2, etc....
Reply
0 Kudos
jvalkeal_hyperi

I finally found this bug to happen also on Solaris sparc. Process dumps and thread dumps from solaris is showing exact match if comparing to Solaris x86.

I've done same workaround by modifying java.security. We'll see within few days whether this fix works or not.
Reply
0 Kudos