NSX Edge heartbeat

chadc1979 · ‎08-07-2019

has anyone run across an issue with rebooting a SAN with the proper timeout of 60 seconds set and all the VMs recover just fine as well as VMs that are part of a MS cluster but the Perimeter Gateway fails to send a heartbeat to the standby member and they both go into an unknown state neither being primary or standby and the file system becomes read only until a reboot.

I’ve got OneArm LBs using NSX Edge in an active/standby configuration and they have no issue as well as the DLR but the Perimeter Gateway just seems to die. Can the heartbeat be changed to 1 minute or more? That’s the duration needed to restart and failover controllers.

HassanAlKak88 · ‎08-08-2019

Can you advise if you have an ESG appliance with HA ?

and is your request how to decrease the dead time ?

and please share the NSX version used.

If my reply was helpful, I kindly ask you to like it and mark it as a solution

Regards,
Hassan Alkak

chadc1979 · ‎08-09-2019

It's 6.4.5, they appliances are in HA mode along with the DLR and 3 other ESG appliances that aren't having an issue.

The settings for OSPF are configured as recommended in the Perimeter-Gateway and DLR

Hello interval: 30

Dead interval: 120

The OneArm-LoadBalancers minus the OSPF configuration are exactly the same, nothing changed from defaults and do not experience the issue.

The Perimeter-Gateway will say it didn't receive any heartbeats and both appliances will show nothing for active/standby and file system changes to read-only and I have to reboot them.

DLR and OneArm-ESGs keep chugging along though and never have an issue.

I redeployed the appliances as well thinking maybe they were messed up but a controller failover on my SAN seems to always cause it, the failover takes less than a minute and the recommended timeouts are configured on all the ESXi hosts.

dyadin · ‎08-15-2019

What do you mean by rebooting SAN ?

Are you rebooting the underlay storage Edges are using? In this case, you must enable vSphere HA APD to restart VMs or manually rebooting edge after storage are restored.

The best practice is turn off all the VMs, put ESXi in maintaince mode and then reboot SAN. What you're doing is quite dangerous, you might lose data or corrupt vms.

Please consider marking this answer "correct" or "helpful" if you think your query have been answered correctly. Cheers, Matt Zhang VCIX-NV | VCP-NV-CMA-DTM | CCDA | CCIE R&S

All

NSX Edge heartbeat