3 Replies Latest reply on Aug 15, 2019 6:45 AM by dyadin

    NSX Edge heartbeat

    chadc1979 Novice

      has anyone run across an issue with rebooting a SAN with the proper timeout of 60 seconds set and all the VMs recover just fine as well as VMs that are part of a MS cluster but the Perimeter Gateway fails to send a heartbeat to the standby member and they both go into an unknown state neither being primary or standby and the file system becomes read only until a reboot.

       

      I’ve got OneArm LBs using NSX Edge in an active/standby configuration and they have no issue as well as the DLR but the Perimeter Gateway just seems to die. Can the heartbeat be changed to 1 minute or more? That’s the duration needed to restart and failover controllers.

        • 1. Re: NSX Edge heartbeat
          HassanAlKak88 Expert
          vExpert

          Can you advise if you have an ESG appliance with HA ?

          and is your request how to decrease the dead time ?

           

          and please share the NSX version used.

          • 2. Re: NSX Edge heartbeat
            chadc1979 Novice

            It's 6.4.5, they appliances are in HA mode along with the DLR and 3 other ESG appliances that aren't having an issue.

             

            The settings for OSPF are configured as recommended in the Perimeter-Gateway and DLR

             

            Hello interval: 30

            Dead interval: 120

             

            The OneArm-LoadBalancers minus the OSPF configuration are exactly the same, nothing changed from defaults and do not experience the issue.

             

            The Perimeter-Gateway will say it didn't receive any heartbeats and both appliances will show nothing for active/standby and file system changes to read-only and I have to reboot them.

             

            DLR and OneArm-ESGs keep chugging along though and never have an issue.

             

            I redeployed the appliances as well thinking maybe they were messed up but a controller failover on my SAN seems to always cause it, the failover takes less than a minute and the recommended timeouts are configured on all the ESXi hosts.

            • 3. Re: NSX Edge heartbeat
              dyadin Novice

              What do you mean by rebooting SAN ?

              Are you rebooting the underlay storage Edges are using?  In this case, you must enable vSphere  HA APD to restart VMs or manually rebooting edge after storage are restored.

              The best practice is turn off all the VMs, put ESXi in maintaince mode and then reboot SAN. What you're doing is quite dangerous, you might lose data or corrupt vms.