I have been tasked with measuring and reporting availability for a number of core services provided by our organization , i.e. shared file, email, directory services...., to be provided to a senior management level as a monthly report with a 99.5% service target. We are reporting these at the overall service level as opposed to the component level(i.e. server/host) as the technologies are obviously now resilient to a point that most component failures are insignificant.
Depending on the service I am using nagios to simulate a transaction, which could be anything from a read/write test, ldap query or mail flow test, at regular intervals 5 minute. The results of those transactions are then reported on a daily basis for service availability(i.e. how many seconds in an OK state sort of thing).
Server Hosting is a service that we also would like to report on specifically the core hosting infrastructure, understanding that within that infrastructure there could be guests or vmhosts that maybe in trouble and significant to someone, but not to the overall health and availability of our virtual environment.
What metrics within vrops would best reflect the current health of the environment that I could poll at regular intervals? The Data Center Health Badge seemed like a likely metric but the only definition I could find was:
Overall score for health. The final score is between 1-100. Where Green - 100, Yellow - 75, Orange - 50, Red - 25, Unknown: -1. The score is derived from the criticality of alerts in the Health category.
It would be nice to be able to have a better definition to provide as to what goes into calculatiing the health badge number.
I'm already using the vrops api to poll other metrics so that's not an issue.