8 Replies Latest reply on Jun 19, 2019 6:48 AM by wrobertson1

    vCenter Server Appliance High Availability - passive DOWN

    KeirL Lurker

      Hi

      I'm currently testing VCSA HA in a lab and I am seeing the following situation.

      When I initiate a manual failover everything works fine and the Passive vCenter server becomes the active server and the old active server becomes the passive server. I can also fail back all fine.

      However, when I simulate an uncontrolled failure of the active vCenter instance (eg power off the vm) the failover works fine, but the failed vCenter instance fails to rejoin the cluster when powered back up. This has happened every time I have done this.

      Is it common for the VCHA cluster to need to be destroyed and recreated after such a failure as I see this is quite a common way to remediate issues?

       

      I'd like to troubleshoot this scenario rather than just to rebuild the cluster each time, but I'm not sure where to look.

       

      If I log into the vCenter server that is failing to rejoin the cluster (the passive node) I can run the Service-Control --status --all command and I notice that not many service are running. Is anyone able to tell me which services should be running on the passive node? In particular should vmware-vcha be running? as when I try to start this it returns the response that the service type is not set to automatic and skips it.

      From the VCHA monitoring screen, it shoes me the Active and Witness nodes as 'UP' and the passive as 'DOWN' and suggests I check the passive node is online and accessible over the heartbeat network and I can ping the passive node from the active node all ok using the heartbeat ip address. It then says to check the replication is ok - how do I do this? I can see the vmware-postgres service is running (I had to start this manually) but what more can I do to check the replication is in synch.

       

      Any thoughts would be very much appreciated. I'm most keen to understand what services should be running on the passive as I feel this is going to be the issue.

       

      kind regards

        • 1. Re: vCenter Server Appliance High Availability - passive DOWN
          Vijay2027 Expert
          vExpert

          From my experience I had very limited success with vCHA. I ended up destroying vCHA nodes each time there was a failover.

          I would suggest you to use VAMI based backup which is more reliable.

           

          AFAIK on a passive node you will see postgres, vcha services running

          • 2. Re: vCenter Server Appliance High Availability - passive DOWN
            sjesse Master
            vExpert

            Is your vcha netwok on an isolated unroutable network? If not you should fix this, we missed this in our first attempt and saw something similar. Make sure that you have all of the steps on

             

            vCenter HA Hardware and Software Requirements

             

            Also give it time, don't fail over and then fail back immediatly, I think there is a replication that needs to finish even if it says its up to date. I'd wait 15 or more minutes at least until failing back.

            • 3. Re: vCenter Server Appliance High Availability - passive DOWN
              KeirL Lurker

              Thanks both for the replies

              The problem I have is that the failed vCenter server never recovers and so it's not an issue of failing back to quickly - I can't fail back at all sadly.

              Thanks for the info on the services - and I think that's the key problem. The vmware-postgres service runs when I start it manually and is also fine following a reboot, but the problem is with the vmware-vcha service which I can't get started...... and I'm stumped at this point.

              I run:

               

              # Service-Control --start vmware-vcha

              #Service vmware-vcha startup type is not automatic. Skip

               

              In vCenter web client the vcha service is already set to a startup type of automatic but this is on the active vcenter server and I can't find how to set the VCHA service on the passive node via command line. If I could get the Service-Control --start vmware-vcha command to compete successfully that might be what I need to do.

              The other thins I notice is that the eth0 doesn't have an IP address. I'm thinking this might be the correct condition as this is the passive node and it shouldn't be accessible on the network until it becomes active - but it would be useful to know if this is correct.

              I'm not sure if the HA network is routable - I'll need to check with the network team. Perhaps that's it.

               

              thanks

              • 4. Re: vCenter Server Appliance High Availability - passive DOWN
                Vijay2027 Expert
                vExpert

                eth0 interface will not have any IP address in passive node.

                • 5. Re: vCenter Server Appliance High Availability - passive DOWN
                  MartinTillbrook Lurker

                  I've got this exact same problem. When trying to start the HA service (vmware-vcha) i get an error saying "Service vcha startup type is not automatic. skip"

                   

                  I'm guessing there must a command to change the startup type of this service from the shell but i can't find anything online about it.

                   

                  Please help!

                  • 6. Re: vCenter Server Appliance High Availability - passive DOWN
                    Vijay2027 Expert
                    vExpert

                    cd to /usr/lib/vmware-vmon

                     

                    Sample output from my lab:

                     

                    root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -s vcha

                    Name: vcha

                    Starttype: DISABLED

                    RunState: STOPPED

                    HealthState: UNHEALTHY

                     

                    root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -S AUTOMATIC -U vcha

                    Completed Service State Update request.

                    root@vcsa1 [ /usr/lib/vmware-vmon ]# ./vmon-cli -s vcha

                    Name: vcha

                    Starttype: AUTOMATIC

                    RunState: STOPPED

                    HealthState: UNHEALTHY

                    root@vcsa1 [ /usr/lib/vmware-vmon ]#

                    • 7. Re: vCenter Server Appliance High Availability - passive DOWN
                      Kahonu84 Expert

                      FWIW I spent nearly a week trying to get VCHA working. In the end the vmWare support

                      person I ended up working with suggested I abandon VCHA as it's not reliable.

                      • 8. Re: vCenter Server Appliance High Availability - passive DOWN
                        wrobertson1 Lurker

                        From what I've found the following services need to be running on the passive nodes:  vmware-statsmonitor, vmware-vmon, vmware-vpostgres, vmware-vcha.

                         

                        You can get all the above services running again by copying the /etc/vcha directory from the active vCenter to the previously active vCenter.  However, there still remains a problem with database synchronization.  If you look at the passive node postgresql logs, they show WAL entry required to sync the database.  That's about as far as I've gotten so far.