On Friday 25 April 2008 3:56 am, drdox wrote:
> I am working on a simple mechanism to track availability according to an
time-based SLA.
We're working on something similar.
> The first prototype uses an RRULE to define a time window - then the
availability metric is re-computed accordingly. That way if our email service
needs to be available between 8am and 6pm, but can be downed for maintenance
in the evening - the SLA-based availability would show as 100%.
For the first pass, we just experimented with custom reports that discount the
availability metrics outside the defined window and recomputed at that time.
Hyperic itself reflects the 24x7 availability in the interface. Your way
sounds more sophisticated. 🙂
> The groovy script also computes MTTR and MTBF figures whilst doing so.
Excellent.
> I'm not sure if we should store the computed values in the database, or just
run the computation at runtime. I favour the database, running every 1hr it
would work in RRD-style to "smoosh" the metrics.
Depends on your workload, I'd say. If the reports get pulled once a month,
doing the calculations on the fly is probably fine, and reduces the
complexity of the data store.
> To provide a richer SLA type feature, we'd need to account for components
that operate in serial (inter-dependent) and parallel (disaster recovery)
modes.
Hyperic is tantalizingly close to being a great tool for this sort of thing.
Nearly every other management system I've worked with (both commercial and
open-source) is obsessed with data collection and loves to show every little
minute detail and metric it's managed to come up with. Hyperic has an
ordered inventory hierarchy model, and the concept *GASP* of applications
which can be measured for availability as a whole. More often than not,
that's what our users really want. They don't need to be alerted when server
xyz.example.com is running low on memory - they want to be alerted when the
applications being hosted aren't available. The memory metric needs to be
there for when the brains of the outfit shows up, looks at the tool, and
starts sussing out what's wrong.
What your suggesting is the buttery icing on top of that. IT departments are,
more and more, forming SLAs with other business units, and that's the real
measurement of service quality.
> If you're intersted, I'd welcome any ideas, suggestions and hands-on help
(review/test/improve script).
Very interested. Anyone else? Hyperic is supposed to be coming out with a
version 4.0 roadmap soon, as well..
Brian
--
Brian McDonald, Senior Consultant
The Occam Group of Professional Computers Services Organization
1919 Birchwood
Troy, MI 48083
Office: 248.528.3770 / Fax: 248.528.3573 / Mobile: 614.209.0260