VMware Cloud Community
KeirL
Contributor
Contributor

vCenter Server Appliance High Availability - passive DOWN

Hi

I'm currently testing VCSA HA in a lab and I am seeing the following situation.

When I initiate a manual failover everything works fine and the Passive vCenter server becomes the active server and the old active server becomes the passive server. I can also fail back all fine.

However, when I simulate an uncontrolled failure of the active vCenter instance (eg power off the vm) the failover works fine, but the failed vCenter instance fails to rejoin the cluster when powered back up. This has happened every time I have done this.

Is it common for the VCHA cluster to need to be destroyed and recreated after such a failure as I see this is quite a common way to remediate issues?

I'd like to troubleshoot this scenario rather than just to rebuild the cluster each time, but I'm not sure where to look.

If I log into the vCenter server that is failing to rejoin the cluster (the passive node) I can run the Service-Control --status --all command and I notice that not many service are running. Is anyone able to tell me which services should be running on the passive node? In particular should vmware-vcha be running? as when I try to start this it returns the response that the service type is not set to automatic and skips it.

From the VCHA monitoring screen, it shoes me the Active and Witness nodes as 'UP' and the passive as 'DOWN' and suggests I check the passive node is online and accessible over the heartbeat network and I can ping the passive node from the active node all ok using the heartbeat ip address. It then says to check the replication is ok - how do I do this? I can see the vmware-postgres service is running (I had to start this manually) but what more can I do to check the replication is in synch.

Any thoughts would be very much appreciated. I'm most keen to understand what services should be running on the passive as I feel this is going to be the issue.

kind regards

Reply
0 Kudos
8 Replies
Vijay2027
Expert
Expert

From my experience I had very limited success with vCHA. I ended up destroying vCHA nodes each time there was a failover.

I would suggest you to use VAMI based backup which is more reliable.

AFAIK on a passive node you will see postgres, vcha services running

Reply
0 Kudos
sjesse
Leadership
Leadership

Is your vcha netwok on an isolated unroutable network? If not you should fix this, we missed this in our first attempt and saw something similar. Make sure that you have all of the steps on

vCenter HA Hardware and Software Requirements

Also give it time, don't fail over and then fail back immediatly, I think there is a replication that needs to finish even if it says its up to date. I'd wait 15 or more minutes at least until failing back.

Reply
0 Kudos
KeirL
Contributor
Contributor

Thanks both for the replies

The problem I have is that the failed vCenter server never recovers and so it's not an issue of failing back to quickly - I can't fail back at all sadly.

Thanks for the info on the services - and I think that's the key problem. The vmware-postgres service runs when I start it manually and is also fine following a reboot, but the problem is with the vmware-vcha service which I can't get started...... and I'm stumped at this point.

I run:

# Service-Control --start vmware-vcha

#Service vmware-vcha startup type is not automatic. Skip

In vCenter web client the vcha service is already set to a startup type of automatic but this is on the active vcenter server and I can't find how to set the VCHA service on the passive node via command line. If I could get the Service-Control --start vmware-vcha command to compete successfully that might be what I need to do.

The other thins I notice is that the eth0 doesn't have an IP address. I'm thinking this might be the correct condition as this is the passive node and it shouldn't be accessible on the network until it becomes active - but it would be useful to know if this is correct.

I'm not sure if the HA network is routable - I'll need to check with the network team. Perhaps that's it.

thanks

Reply
0 Kudos
Vijay2027
Expert
Expert

eth0 interface will not have any IP address in passive node.

Reply
0 Kudos
MartinTillbrook
Contributor
Contributor

I've got this exact same problem. When trying to start the HA service (vmware-vcha) i get an error saying "Service vcha startup type is not automatic. skip"

I'm guessing there must a command to change the startup type of this service from the shell but i can't find anything online about it.

Please help!

Reply
0 Kudos
Vijay2027
Expert
Expert

cd to /usr/lib/vmware-vmon

Sample output from my lab:

root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -s vcha

Name: vcha

Starttype: DISABLED

RunState: STOPPED

HealthState: UNHEALTHY

root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -S AUTOMATIC -U vcha

Completed Service State Update request.

root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -s vcha

Name: vcha

Starttype: AUTOMATIC

RunState: STOPPED

HealthState: UNHEALTHY

root@vcsa1 [ /usr/lib/vmware-vmon ]#

Kahonu84
Hot Shot
Hot Shot

FWIW I spent nearly a week trying to get VCHA working. In the end the vmWare support

person I ended up working with suggested I abandon VCHA as it's not reliable.

Reply
0 Kudos
wrobertson1
VMware Employee
VMware Employee

From what I've found the following services need to be running on the passive nodes:  vmware-statsmonitor, vmware-vmon, vmware-vpostgres, vmware-vcha.

You can get all the above services running again by copying the /etc/vcha directory from the active vCenter to the previously active vCenter.  However, there still remains a problem with database synchronization.  If you look at the passive node postgresql logs, they show WAL entry required to sync the database.  That's about as far as I've gotten so far.

Reply
0 Kudos