One thing i've noticed about VCSA 6.5 HA setup's is that most of the HA setups stop working after a while. Some run fine for weeks before trouble starts. Some run fine for months before getting into trouble.
I've setup a whole bunch of installations by now, using version 6.5 for the last 5 or so. And only 1 setup is still running in HA mode without problems today. All others have withered away over time.
In some cases, it starts with Postgres not syncing anymore.
In other cases, postgres sync works fine but everything rsync based stops working (message that the appliance config, state and sqlite db are out of sync)
In other cases, the peer node went offline and never comes back, despite being rebooted.
Those are the main 3 categories of HA problems that I see in the field. At the moment i've stopped recommending using VCSA 6.5 HA mode all together.
As I can find little information on actual troubleshooting, besides trying stuff like manually starting postgres and other things are like reboots that rarely work, can somebody point me to good sources of information. I cannot find blogs or relevant KB articles. I cannot imagine me, or my customers, being the only ones with these issues though.
I want to learn how to troubleshoot HA issues myself as VMware support is not very helpful. You get some 1-level guy on the phone, that only ever collects logs and the HA setup is down for weeks, ending with the advice to simply destroy the peer and witness nodes and start over. I've become used to doing "destroy-ha" on the active node and starting fresh...
It's always a replication issue. The active node is fine. No disk-full issues or anything. Replication just stops.
At the moment, I have two separate installations (one was installed 3 months ago, the other is about 7 weeks old) with the exact same problem where vCenter itself works fine, all Nodes are up, but both say "PostgreSQL replication is not in progress. Verify if PostgreSQL server is running on the Passive node and that the Passive node is reachable on the vCenter HA network."
Also, the other 3 replication items config, state and sqllite are all out of sync. Rebooting the passive appliance or disabling en re-enabling HA has no effect at all. Disk-space is not an issue.
If I go on the commandline of the passive node, I see that only "vmware-statsmonitor, vmware-vcha and vmware-vmon" are running. The service "vmware-vpostgres" is not running and will not start either (service is masked error).
Concrete questions: if, like in the example above, Postgres sync stops working: how to get it back up running again? Same for the other 3 sync-relationships "State, config & sqllite DB". Is there any documentation?
Thank you for reaching out to VMware support.
There are cpl of things i observed with vcha failures in my experience. Please find the below two
1. when the vcha password expires
2. When the time on the ESXi host and the VCSA on which the active, passive and witness residing are out of timesync, the network flip and active and passive are out of sync.
Hope this helps you.
I was wondering if you had any luck getting any more information?
I feel the same way, I was beginning to wonder if I am the only person using it and whether I am better off without it. This is my second failure in the last couple of months. Both times for me it's been after a failover (usually network blip) and it's a Postgres replication failure.
Failover works which is good but I don't want to have to destroy the HA cluster every time it happens.
Today I solved my "PostgreSQL replication is not in progress" issue.
1. Login to Passive node console.
2. Enter shell.
3. Try to make "vmware-vpostgres", "vmware-vcha" running
a. using "service-control --status" to show running service
b. using "service-control --start SERVICENAME" to start service
1). it could fails several times, try it several minutes after
Hope this will help.