VMware Cloud Community
blackstar_hyper
Contributor
Contributor

Transaction timeout when registering new services

Hi all,

We are evaluating hyperic 3.1.1 enterprise to monitor our various weblogic environement.

When I register a new weblogic admin server, the agent running on the platform hosting that server try to autodiscover the various services (JMS, EJBs, webapp,...) configured within the WLS domain. When the number of services is "high" (in this case around 400), the Hyperic server fail to registred the service set throwing the exception:

2007-09-27 16:46:33,957 WARN [org.hyperic.lather.jboss.JBossLatherServlet] Exec
ution of 'aiSendRuntimeReport' exceeded 300 seconds
2007-09-27 16:46:34,045 ERROR [org.jboss.ejb.plugins.LogInterceptor] Transaction
RolledbackLocalException in method: public abstract org.hyperic.hq.authz.server.
session.Resource org.hyperic.hq.authz.shared.ResourceManagerLocal.createResource
(org.hyperic.hq.authz.shared.AuthzSubjectValue,org.hyperic.hq.authz.shared.Resou
rceTypeValue,java.lang.Integer,java.lang.String,boolean), causedBy:
java.lang.NullPointerException
at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java
:237)
at oracle.jdbc.driver.T4CPreparedStatement.executeForRows(T4CPreparedSta
tement.java:977)
at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStateme
nt.java:1062)
at oracle.jdbc.driver.T4CPreparedStatement.executeMaybeDescribe(T4CPrepa
redStatement.java:839)
at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStateme


I've try the advice found here http://communities.vmware.com/message/1916085#1916085

without success : the registering process still stop after 300 sec

Any advices ?
Reply
0 Kudos
6 Replies
dwellman00
Contributor
Contributor

Any update on this issue? I am running into the same problem. I'm surprised there are not more complaint threads about this since it would seem to be something rather prevalent. Or could there be something unique about our installations? FWIW, we are running the Solaris version.

I also noticed that the discovery process is running through every single service weather it has been previously discovered or not. Is this the expected behavior?
Reply
0 Kudos
admin
Immortal
Immortal

If a single report back from an agent is exceeding 300 seconds it means that either you are trying to discover too many resources, or your HQ server needs to be tuned and/or run on better hardware. What are the specs for your HQ server and database?

To work around this problem for now you can up the timeout setting in:

server-3.1.1/hq-engine/server/default/deploy/lather-jboss.sar/jboss-lather.war/WEB-INF/web.xml

Look for the org.hyperic.lather.execTimeout property. I'd suggest first trying 600000 or 900000. For HQ to pick up the change a restart is required.

-Ryan
Reply
0 Kudos
blackstar_hyper
Contributor
Contributor

Modifying that parameter and the transaction timeout value in ./conf/templates/jboss-service.xml solved my problem. As you mentioned, there is a performance problem. The agent discovers 2542 services in the weblogic domain I use for this evaluation. A quick analysis of the HQ server log file shows that the first services discovered takes around 0.2 sec to be registred. However the time taken to enrolled each new services grows linearly, reaching around 5 sec/service at the end. It took more than one hour to registred that single domain.
I'm using the solaris version of the HQ server 3.1.1-EE running on a dual CPU sparc system (solaris 10) with 4GB of memory. The oracle database is running in the same box.
During the registration process, the HQ server java process takes 50% of cpu load, i.e. a full CPU.

Except for the mentionned parameters, it's a fresh installation.

Is there anything that could be tuned at the JVM or oracle level or is it an expected behaviour given the number of services. I guess that the agent does not need to re-register the whole domain if a make a single modification like adding a new service?
Reply
0 Kudos
dwellman00
Contributor
Contributor

We are using a dual CPU sparc server as well with 2 gigs of ram. The system is only around 4 years old. I've watched CPU and memory usage while adding an agent and the system is not taxed at all.

The issue is every time we add an agent, the server rescans every single "platform" we have configured. We are trying to monitor a bunch of Cisco switches. So to add one switch the server is scanning hundreds and hundreds of switch ports for every single switch we previously added to Hyperic. Even if the switch ports were previously removed from Hyperic - it still scans them all when adding a completely different device. Consequently, I can't keep the total number of "services" low because even if I remove switch ports from Hyperic, the next agent that we add starts this scan process which in turn re-adds all the switch ports we previously deleted. Hopefully I'm explaining that so it makes sense.

FWIW, it only seems to affect "services" and not "servers". I removed some jboss and tomcat "servers" and they always stay removed. Per the above, switch ports are always re-added after a new scan.

Any thoughts?
Reply
0 Kudos
blackstar_hyper
Contributor
Contributor

I've reported earlier that I managed to registred the 2000+ services from the single WLS domain (5 WLS servers) I'm using for this evaluation by increasing considerably the transaction timeout. All of them are hosted on the same platform where one HQ agent is running.
I'm experiencing performance problem that I cannot explain. First the HQ server process consumes between 50 and 100% of the cpu. Second the log file of the agent displays error messages like this;

2007-10-11 13:10:36,986 INFO [SenderThread] Agent measurements no longer backlo
gged
2007-10-11 13:15:59,785 WARN [ConfigPopulateThread] Unable to get entities for
agent: IO error: java.net.SocketTimeoutException: Read timed out
2007-10-11 13:15:59,785 WARN [ConfigPopulateThread] Sleeping for 160 seconds to
fetch entities
2007-10-11 13:20:41,508 WARN [SenderThread] The Agent is having a hard time kee
ping up with the frequency of metrics taken. Consider increasing your collectio
n interval.
2007-10-11 13:20:54,559 INFO [SenderThread] Agent measurements no longer backlo
gged
2007-10-11 13:26:19,805 WARN [ConfigPopulateThread] Unable to get entities for
agent: IO error: java.net.SocketTimeoutException: Read timed out
2007-10-11 13:26:19,805 WARN [ConfigPopulateThread] Sleeping for 320 seconds to
fetch entities
2007-10-11 13:30:59,467 WARN [SenderThread] The Agent is having a hard time kee
ping up with the frequency of metrics taken. Consider increasing your collectio
n interval.
2007-10-11 13:31:12,396 INFO [SenderThread] Agent measurements no longer backlo
gged


I guess that the HQ server is not able to keep up with the measurments send by the agent. SInce that number of weblogic JMX components is typical of our various environments hence I'm suspicious about the behaviour of the HQ server when we will registred one of our production environment. We are mainly interested in JMX monitoring, Weblogic JMX in particular.
I suspect some tuning problem but I cannot put my finger on it. Some random thread dumps show mainly hibernate activity but the Oracle DB instance does not display abnormal activity.
Any idea ?
Reply
0 Kudos
JohnMarkOrg
Hot Shot
Hot Shot

I'm wondering if having the Oracle DB on the same machine contributes to the problem. You said that HQ specifically contributes to the machine load, but still, I would at least try with an offloaded DB.

Also, have you tried turning off some of the service metrics coming in from your WLS servers? While we pride ourselves on the breadth and depth of data compiled by HQ, we also make it possible to shoot yourself in the foot with too much data. Also, you may not need all of the 2000+ services that were auto-discovered.

-John Mark
Reply
0 Kudos