Re: Recent Alerts portlet with "selected resources...

edan_hyperic · ‎05-13-2009

I've been running HQ 4.1.1 since April 29th. After lunch today, I noticed HQ was running MySQL at 500% CPU, with about 65 threads (via mysqladmin processlist) stuck doing some kind of alert query. It turned out a coworker had (after we came back from lunch) set up his Recent Events portlet to this apparently nasty combination:

Last 5 ALL or higher priority alerts within the past MONTH for "selected resources."

My coworker also had a FF plugin doing an auto-refresh every minute, which led to HQ and MySQL spinning out of control while we tried to narrow down the problem. We had been troubleshooting something else which made it unclear immediately why it was spinning.

Doing some testing, it seems that the big problem is trying to filter on "selected resources." We have about 10K alerts according to HQ health and we are set to purge alerts older than 31 days. Performance is excellent if I say "Last 5 ALL within the past month for ALL resources." As soon as I try to filter on "selected," performance degrades. My hqdb data size is 2.1GB.

If I only choose the last day or week of alerts, I can still see MySQL shoot to 100% CPU usage briefly for a few seconds. "The last month" takes 5-10 seconds to process, and oddly, the thread in MySQL is stuck at "0" seconds of time in the "sending data" state. I've got lots of stack traces for the findRecentAlerts() and findEscalatables() "taking a long time" 🙂

Has anyone else seen this? I had never tried to filter on alerts like this so I had never run into it. Performance is otherwise quite good since the upgrade.

Thanks!

edan_hyperic · ‎05-13-2009

I'm running MySQL 5.0.22 from a RHEL 5 package. It looks like the latest package available from our yum server is 5.0.45, I'll at least try upgrading to that in case some bizarre MySQL bug is biting me. I see that 5.0.45 is the minimum for 4.1 according to http://support.hyperic.com/display/DOC/Installation+Requirements ... so I suppose you could say it 'behooves' me to upgrade. 😉

edan_hyperic · ‎05-14-2009

I upgraded to 5.0.45-log via yum this morning and found no change to the behavior. Here's a snippet from mysqladmin processlist:

| 7 | hqadmin | localhost:45478 | hqdb | Query | 0 | Sending data | select alert0_.ID as ID116_, alert0_.VERSION_COL as VERSION2_116_, alert0_.CTIME as CTIME116_, alert |

The full query from SHOW FULL PROCESSLIST was:

select alert0_.ID as ID116_, alert0_.VERSION_COL as VERSION2_116_, alert0_.CTIME as CTIME116_, alert0_.FIXED as FIXED116_, alert0_.ALERT_DEFINITION_ID as ALERT5_116_, (select e.id from EAM_ESCALATION_STATE e where e.alert_id = alert0_.id and e.alert_type = -559038737) as formula2_, (select e.acknowledged_by from EAM_ESCALATION_STATE e where e.alert_id = alert0_.id and e.alert_type = -559038737) as formula3_ from EAM_ALERT alert0_ inner join EAM_ALERT_DEFINITION alertdefin1_ on alert0_.ALERT_DEFINITION_ID=alertdefin1_.ID inner join EAM_RESOURCE resource2_ on alertdefin1_.RESOURCE_ID=resource2_.ID where (resource2_.RESOURCE_TYPE_ID is not null) and (alert0_.CTIME between 1239893640000 and 1242312840000) and alertdefin1_.PRIORITY>=0 order by alert0_.CTIME DESC limit 5955, 5

What I find peculiar is that it seems to stick in "Sending data" state and the query time doesn't update. mysqld in top jumps to 95% CPU while the query is running. I checked my mysql-slow log and it doesn't show up there. (More disturbing is that I've had 3 slow queries on "select measurement_id, value ..." since upgradng to 5.0.45 ... *sigh*

If I run that query by hand, it returns in 0.03 seconds. I suppose that could be because the first query is slow and the 2nd is cached. But I've got a 4GB innodb buffer pool so presumably I have enough room to keep most of it cached...

I guess it's not clear to me if it's a MySQL problem or a HQ problem... but I'm definitely going to try to make sure no one tries to filter on "selected resources" for recent events until I figure it out 🙂

edan_hyperic · ‎05-18-2009

FWIW, I think that pulling alerts in the "Indicators View" seems slow, too. The alerts query is slow enough that I can catch it "in flight" using SHOW FULL PROCESSLIST, but I'm still not sure if the query is slow the first time but fast on subsequent queries.

I'm just not sure if it's appropriate to file a bug for this one, as it certainly *looks* like it could be a buggy/slow MySQL query on alerts. The last time I had a problem with HQ being slow, it was the EAM_EVENT_LOG table, and it was fixed by a revised query. http://communities.vmware.com/message/1930253#1930253

Any ideas?

Dans_hyperic · ‎05-21-2009

Hi.
Before all, i suggest you: Get a list of tables in your database and optimice or compact it with OPTIMIZE TABLE

We had troubles with SQL performance.. but because we had the same machine 2 times... after erase one of its, SQL speed up over 100% (we are monitoring over 150 machines with hi-traffic).

Dans

edan_hyperic · ‎05-21-2009

> Before all, i suggest you: Get a list of tables in
> your database and optimice or compact it with
> OPTIMIZE TABLE

Are you suggesting this specifically for MySQL + InnoDB? I've seen in the logs that HQ is already performing "VACUUM ANALYZE" on a periodic basis. After googling a bit for OPTIMIZE TABLE for mysql/innodb, it sounds like optimize table doesn't always help that much, locks the tables... sounds scary.

> We had troubles with SQL performance.. but because we
> had the same machine 2 times... after erase one of
> its, SQL speed up over 100% (we are monitoring over
> 150 machines with hi-traffic).

What do you mean? You had the same "platform" in HQ twice?

Thanks for answering. I've loaded a restore of my database into the latest 5.0.81-community-log version of MySQL (via RPM from mysql.com). It seems to be a little better, although it still runs up MySQL's CPU for a few seconds when the dashboard refreshes.

The real killer appears to be putting "Groups" into your selected resources. In fact, leaving the Groups and putting in a few services (that I know had alerts) speeds up the portlet refresh considerably. It comes back almost instantaneously, actually. With just a set of Compatible Groups in my selected resources, the query takes a lot more time.

Of course, this test database isn't changing, but it does look like 5.0.81 helps a bit, and that putting Compatible Groups consumes a lot of CPU time, and doesn't show the alerts for the group members!

Or maybe re-importing all of the data into my test database did a real "OPTIMIZE TABLE" defrag and that's what helped ...

All

Recent Alerts portlet with "selected resources" tries to kill MySQL