In vSphere 6.5, there can be multiple PSCs replicating among themselves. Although vCenter can be pointed to a single PSC, there can be a replication agreement among it and other peers. For example, if a vCenter is pointed at a single PSC with a separate PSC in the same site and SSO domain to which it is being replicated, both PSCs show up in the System Configuration portion of the Flex client.
In this case, vc-01 is pointed to psc-02. psc-02 is replicating with psc-01 and their status is good. Both PSCs are in the same domain and same site.
root@psc-01 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w VMware1!
Partner: psc-02.domain.com
Host available: Yes
Status available: Yes
My last change number: 1525
Partner has seen my change number: 1525
Partner is 0 changes behind.
The question becomes how does one gain health insight into the replica partners? If these PSCs were behind a load balancer, the pool would show one of the nodes down in the case where either the entire appliance failed or some of the services failed. But in the case where there is no load balancer, vCenter does not seem to alert on critical health changes of the replica. For example, if I were to stop vmafdd, replication between the PSCs would then fail and you would see the following when interrogating the current PSC.
root@psc-02 [ /usr/lib/vmware-vmdir/bin ]# ./vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w VMware1!
Partner: psc-01.domain.com
Host available: No
If one of the critical services gets stopped, the health status would also change in the Flex client listing one or more services as critical, yet no alarm would be triggered. The only alarm definition that seems to correspond is the PSC Service Health Alarm which is by default defined as follows:
It seems even when stopping pschealth on the replica partner this alarm is not triggered. Adding additional OR conditions for other service names (service name equals component ID, I'm guessing) does not seem to trigger the alarm.
Has anyone been successful in gaining some sort of health status insight through vCenter (or any other tool that doesn't involve manual scripting) for the PSC replicas? The main use case is where the architecture doesn't allow for or someone does not have a load balancer and the status of the PSC replica is unknown until it comes time to repoint. If the status has been in a failed state, a repoint would fail and vCenter would potentially be stuck in an unavailable state. That obviously is something that needs to be mitigated in a design, but there don't appear to be any mechanisms that allow one to understand this health relationship and be alerted on it if it falls into a degraded state.
Since there are really no solutions for monitoring of replication, I decided to create my own using Log Insight and vROps. I write about the problem and show how to create an alert from relevant log entries, then forward that to vROps. For anyone interested, the post is here Detecting PSC Replication Failure with Log Insight and I also include the vRLI alert that can be easily imported into your own environment.
Just to also state that I am familiar with William Lam's script available in his blog post here, but this is both unsupported and not an ideal solution for a variety of reasons. My preference would be to somehow make either the aforementioned vCenter alert function properly, or design a new one that does, or, secondarily, use vROps to understand the health state.
Since there are really no solutions for monitoring of replication, I decided to create my own using Log Insight and vROps. I write about the problem and show how to create an alert from relevant log entries, then forward that to vROps. For anyone interested, the post is here Detecting PSC Replication Failure with Log Insight and I also include the vRLI alert that can be easily imported into your own environment.