Re: False Positives: Availability of Servers, Proc...

rsteppac · ‎12-06-2010

Hi all,

a while ago I posted

http://communities.vmware.com/message/1943414#1943414
and
http://communities.vmware.com/message/1943415#1943415

Since then we migrated to Hyperic 4.5 server and agent, but the problems of sporadic false positives persist:

- Machines are reported as unavailable
- Windows processes are reported as unavailable
- Tomcat instances are reported as unavailable
- Disk space is reported below threshold

The false positive alerts tend to come up in little bursts, but not always.
They tend to be emitted from machines under heavy CPU load, but some machines are not busy at all.
The time drift between the Hyperic agent and server does not seem to be a general cause for the alerts. We get false positives from machines for which Hyperic reports a drift of<20ms while it is perfectly happy with machines that have a drift of > 7000ms. (We will fix the time synchronization as a next step.)

All servers are Windows 2003 and 2008 running on hardware virtualized through ESX server.

I would be grateful for any hint where to look for the cause of those false positives.

Thanks and regards
Ralf

admin · ‎12-08-2010

Some more info:

On machines with heavy load, which tend to be the ones for which false positive alerts are raised, we see this a lot in the agent wrapper log:

INFO | jvm 2 | 2010/12/08 14:22:08 | - Invoking agent
INFO | jvm 2 | 2010/12/08 14:22:08 | - Agent thread running
INFO | jvm 2 | 2010/12/08 14:22:08 | - Verifying if agent is running...
INFO | jvm 2 | 2010/12/08 14:24:04 | - Agent is running
INFO | jvm 2 | 2010/12/08 14:24:04 | Agent successfully started
INFO | jvm 2 | 2010/12/08 14:25:04 | Wrapper Manager: The Wrapper code did not ping the JVM for 55 seconds. Quit and let the Wrapper resynch.
INFO | jvm 2 | 2010/12/08 14:25:07 | Stopping agent ...
STATUS | wrapper | 2010/12/08 14:25:07 | JVM requested a restart.
INFO | jvm 2 | 2010/12/08 14:25:07 | Success -- agent is stopped!
WARN | wrapper | 2010/12/08 14:25:17 | JVM exited unexpectedly while stopping the application.
STATUS | wrapper | 2010/12/08 14:26:31 | Reloading Wrapper configuration...
STATUS | wrapper | 2010/12/08 14:26:36 | Launching a JVM...
INFO | jvm 3 | 2010/12/08 14:27:14 | Wrapper (Version 3.2.3) http://wrapper.tanukisoftware.org
INFO | jvm 3 | 2010/12/08 14:27:14 | Copyright 1999-2006 Tanuki Software, Inc. All Rights Reserved.

If the agent is not alive because it is continuously restarted due to the lack of a ping, then that would explain why Hyperic Server thinks the machine is dead.

Is there something we can do about this?

Thanks!
Ralf

rsteppac · ‎12-08-2010

Some more info:

On machines with heavy load, which tend to be the ones for which false positive alerts are raised, we see this a lot in the agent wrapper log:

INFO | jvm 2 | 2010/12/08 14:22:08 | - Invoking agent
INFO | jvm 2 | 2010/12/08 14:22:08 | - Agent thread running
INFO | jvm 2 | 2010/12/08 14:22:08 | - Verifying if agent is running...
INFO | jvm 2 | 2010/12/08 14:24:04 | - Agent is running
INFO | jvm 2 | 2010/12/08 14:24:04 | Agent successfully started
INFO | jvm 2 | 2010/12/08 14:25:04 | Wrapper Manager: The Wrapper code did not ping the JVM for 55 seconds. Quit and let the Wrapper resynch.
INFO | jvm 2 | 2010/12/08 14:25:07 | Stopping agent ...
STATUS | wrapper | 2010/12/08 14:25:07 | JVM requested a restart.
INFO | jvm 2 | 2010/12/08 14:25:07 | Success -- agent is stopped!
WARN | wrapper | 2010/12/08 14:25:17 | JVM exited unexpectedly while stopping the application.
STATUS | wrapper | 2010/12/08 14:26:31 | Reloading Wrapper configuration...
STATUS | wrapper | 2010/12/08 14:26:36 | Launching a JVM...
INFO | jvm 3 | 2010/12/08 14:27:14 | Wrapper (Version 3.2.3) http://wrapper.tanukisoftware.org
INFO | jvm 3 | 2010/12/08 14:27:14 | Copyright 1999-2006 Tanuki Software, Inc. All Rights Reserved.

If the agent is not alive because it is continuously restarted due to the lack of a ping, then that would explain why Hyperic Server thinks the machine is dead.

Is there something we can do about this?

Thanks!
Ralf

admin · ‎12-09-2010

Setting longer timeouts in wrapper.conf on the machines with high load seems to solve the ping timeout problem, but it does not solve the problem of the machine and/or processes running on it being reported as unavailable.

rsteppac · ‎12-09-2010

Setting longer timeouts in wrapper.conf on the machines with high load seems to solve the ping timeout problem, but it does not solve the problem of the machine and/or processes running on it being reported as unavailable.

vjavaly · ‎12-10-2010

Hello. We are having the same problem with false alerts. Here's some background info - We recently moved our servers into Amazon EC2, and in the process upgraded HQ Agent from 4.3 to 4.5. I'm not sure if these false alerts are due to EC2 or the latest agent release?? I'm going to try going back to 4.3 on one server to see if the release is the culprit.

vjavaly · ‎12-10-2010

Hello. We are having the same problem with false alerts. Here's some background info - We recently moved our servers into Amazon EC2, and in the process upgraded HQ Agent from 4.3 to 4.5. I'm not sure if these false alerts are due to EC2 or the latest agent release?? I'm going to try going back to 4.3 on one server to see if the release is the culprit.

jreid_hyperic · ‎12-21-2010

I am also having this issue. Machines, and Windows processes and services show unavailable and send out an alert, then randomly, they all appear back up. Meanwhile the server itself was running fine.

We are also seeing the little bursts of unavailability reported by Ralf. If one RESOURCE for a machine is showing unavailable, usually all of them are unavailable for that machine. If one MACHINE shows unavailable, typically all machines are showing unavailable, but not always. As Ralf noted, it doesn't seem to be an issue with time synchronization, our machine's are synchronized. This seems to happen regardless of the load or traffic on the client. In fact it is also happening to our initial test client which is currently doing nothing but running Windows and the Hyperic agent.

While the resource is showing unavailable, we can still execute queries against the resource in Live Exec, there is no delay in the results it shows. I've been testing several scenarios and haven't seen any evidence that there is an issue with communication, and obviously if we can query with Live Exec, it is capable of communication with the client while this is happening in HQ. It's very odd...

Our servers are Windows 2003, 2000, fedora and some XP with no virtualization.

We am using version 4.5, a fresh first-time install.

admin · ‎12-29-2010

I'm not 100% sure these are all exactly the same issue, since it sounds like for certain of these posts, the ping test is failing and setting longer timeouts and/or adjusting for server/agent drift is helping. Under high load, a system may not be able to respond as quickly as it does without the load, so that would make sense.

Moving to EC2, for example, will likely require accounting for timeouts, lag, net congestion, etc. but I'd be curious to see if regressing the agent helped.

JReid OTOH seems to have accounted for this, have you tried regressing the install to narrow this to a 4.5 issue? I assume you were previously using an earlier version so that may be moot.

rsteppac · ‎12-30-2010

Jeremy,

We have had the issue with HQ 3.2.6, which is why we upgraded to 4.5. The upgrade has not effected the false positive issues we are experiencing for the better or worse.

Btw, we receive false availability alerts for the HQ server machine (Windows Server 2003 R2) itself. Network congestion, time drift, etc. should not be the cause here.

admin · ‎12-30-2010

So this is the agent on the HQ server itself, reporting that the HQ server machine is unavailable? And this happens when that machine is also under high load?

rsteppac · ‎12-31-2010

Excatly. The server is not busy at all. Its only purpose is to run the HQ Server. Nothing else apart from a small Tomcat instance is running on that machine.
The attached screenshot shows the last 24h; the bubble is a false positive 0% availability alert.
I crosschecked the ESX server performance charts for the last 24h for the virtual and physical server and they are in line with the hyperic graph. The only thing I can see is that there is a small dip in average CPU usage for about 20 minutes during which the false alert was raised. Network and disk I/O are normal during that time.

All

False Positives: Availability of Servers, Processes, Diskspace