VMware Cloud Community
OsburnM
Hot Shot
Hot Shot

Cluster Availability % for Uptime SLAs

I've been looking into ways of providing a cluster uptime SLA and it's turning out to be a lot harder than you'd think.  The existing metric "Cluster Availability %" in vROps isn't IF the cluster is available (ie, there's at-least some hosts available to poweron/run VMs); rather, it's referring to the % of hosts available in the cluster (ie, in a 4-node cluster, if one host is down, its 75% available).

So I started looking at VMware's posted SLA doc for the VMC on AWS SDDC (here:  vmw-cloud-aws-service-level-agreement.pdf (vmware.com)) and something stuck out (page-2's unavailability definition):

Unavailability and SLA Events
A service component will be considered “Unavailable”, subject to the Service Level Agreement
Limitations set forth below, if VMware’s monitoring tools determine that one of the following events
(each, an “SLA Event”) has occurred.
The total minutes that the service component is Unavailable for a particular SLA Event is measured
from the time that the SLA Event has occurred, as validated by VMware, until the time that the SLA
Event is resolved such that the service component is no longer Unavailable.
If two or more SLA Events occur simultaneously, the SLA Event with the longest duration will be
used to determine the total minutes Unavailable.
Each of the following will be considered an SLA Event for the VMware Cloud on AWS service:
SDDC Infrastructure:
a) All of your virtual machines ("VMs") running in a cluster do not have any connectivity for
four consecutive minutes.
b) None of your VMs can access storage for four consecutive minutes.
c) None of your VMs can be started for four consecutive minutes.

So VMware's own SLAs for their SDDC seem to indicate that as longs as there SOME hosts available for VMs to run, that's their metric.  Makes sense to me.  That's the whole point of vSphere HA afterall.

Which brings me to my point...  Does anyone have any idea how VMware monitors this themselves or has anyone come up with any supermetrics or other means to provide a cluster availability % like described above (as opposed to the way vROps shows it currently)?

Thanks!  Looking forward to some thoughts on this one.

0 Kudos
2 Replies
OsburnM
Hot Shot
Hot Shot

After thinking about it over the weekend, I thought about the possibility of creating a supermetric that would look at the existing "cluster availability %" and use an IF in the SM.  The idea being, if the cluster availability % is greater than 0, then it means there should be at-least one functioning host in the cluster capable of powering on VMs (even if its massively oversubscribed).  So then the supermetric would read IF > zero then set the metric to 100, else set it to 0.

I tried creating the supermetric and using the avg operator but I keep getting an error...

avg(${adaptertype=VMWARE, objecttype=ClusterComputeResource, attribute=summary|cluster_availability, depth=3}>0?100:0)

Error:  Cannot convert aggregated result to number.

Not sure why cuz if I take the avg out, it seems to work.

Anyone have any thoughts/ideas?

 

0 Kudos
sxnxr
Commander
Commander

It is an attribute not a metric. Try using powered on hosts/connected hosts metric and see if that works

0 Kudos