drdox
Contributor
Contributor

Anyone else interested in MTTR, MTBF and "business day" SLA?

I am working on a simple mechanism to track availability according to an time-based SLA.

The first prototype uses an RRULE to define a time window - then the availability metric is re-computed accordingly. That way if our email service needs to be available between 8am and 6pm, but can be downed for maintenance in the evening - the SLA-based availability would show as 100%.

The groovy script also computes MTTR and MTBF figures whilst doing so.

I'm not sure if we should store the computed values in the database, or just run the computation at runtime. I favour the database, running every 1hr it would work in RRD-style to "smoosh" the metrics.

To provide a richer SLA type feature, we'd need to account for components that operate in serial (inter-dependent) and parallel (disaster recovery) modes.

If you're intersted, I'd welcome any ideas, suggestions and hands-on help (review/test/improve script).
0 Kudos
3 Replies
brian_mcdonald
Contributor
Contributor

On Friday 25 April 2008 3:56 am, drdox wrote:
> I am working on a simple mechanism to track availability according to an
time-based SLA.

We're working on something similar.

> The first prototype uses an RRULE to define a time window - then the
availability metric is re-computed accordingly. That way if our email service
needs to be available between 8am and 6pm, but can be downed for maintenance
in the evening - the SLA-based availability would show as 100%.

For the first pass, we just experimented with custom reports that discount the
availability metrics outside the defined window and recomputed at that time.
Hyperic itself reflects the 24x7 availability in the interface. Your way
sounds more sophisticated. 🙂

> The groovy script also computes MTTR and MTBF figures whilst doing so.

Excellent.

> I'm not sure if we should store the computed values in the database, or just
run the computation at runtime. I favour the database, running every 1hr it
would work in RRD-style to "smoosh" the metrics.

Depends on your workload, I'd say. If the reports get pulled once a month,
doing the calculations on the fly is probably fine, and reduces the
complexity of the data store.

> To provide a richer SLA type feature, we'd need to account for components
that operate in serial (inter-dependent) and parallel (disaster recovery)
modes.

Hyperic is tantalizingly close to being a great tool for this sort of thing.
Nearly every other management system I've worked with (both commercial and
open-source) is obsessed with data collection and loves to show every little
minute detail and metric it's managed to come up with. Hyperic has an
ordered inventory hierarchy model, and the concept *GASP* of applications
which can be measured for availability as a whole. More often than not,
that's what our users really want. They don't need to be alerted when server
xyz.example.com is running low on memory - they want to be alerted when the
applications being hosted aren't available. The memory metric needs to be
there for when the brains of the outfit shows up, looks at the tool, and
starts sussing out what's wrong.

What your suggesting is the buttery icing on top of that. IT departments are,
more and more, forming SLAs with other business units, and that's the real
measurement of service quality.

> If you're intersted, I'd welcome any ideas, suggestions and hands-on help
(review/test/improve script).

Very interested. Anyone else? Hyperic is supposed to be coming out with a
version 4.0 roadmap soon, as well..


Brian
--
Brian McDonald, Senior Consultant
The Occam Group of Professional Computers Services Organization
1919 Birchwood
Troy, MI 48083
Office: 248.528.3770 / Fax: 248.528.3573 / Mobile: 614.209.0260

0 Kudos
drdox
Contributor
Contributor

Great feedback Brian!

This is the great benefit of an open forum and open source. Those with similar goals can share their perspectives and perhaps save duplication of effort.

I've checked out the 4.0 roadmap - and I think some interesting developments will materialise. Especially the run length encoding story for metrics - that would keep the data store requirements down for storing sla metrics.

In our view, metrics needs to be collected for much longer.

We agree that applications are what businesses are interested in. The servers and their sub-components are for the engineering team.

On that topic, We've identified that when an SLA (or similar) alert is fired - then the engineers want information on the present alert and similar situations in the past.

Obviously, the RRD approach reduces the number/granularity of the data points historically - we'd like to copy out minute-by-minute metrics immediately before & after the alert time. That data would be preserved until culled or 12 months.

Please keep your suggestions coming, feel free to drop me a PM if you want to take it up in more detail.
0 Kudos
djinn_fr
Contributor
Contributor

Hi,

I am really interested with this topic.

I don't know if any of you contribute on the forge, but I would be happy to have a look if so.

I agree that Hyperic should mature with a SLA based approach.
Hyperic have all the information in the repository to go this way.
So now, SLA should be calculated on the fly for alerting or for reporting.
The best will be that these SLA are part of the framework as other item, to create management dashboard and escalation scheme..

I am dreaming... but I think Hyperic is not that far to be able to do that.

Thanks
0 Kudos