VMware Networking Community
vXav
Expert
Expert
Jump to solution

Design concerns/questions: 2 sites - 2 clusters - single vCenter

Hi,

Since cross-vcenter NSX was released it seems to have gotten very popular which would explain why I struggle to find single vCenter multi-site designs.

We are looking into NSX for our 2 sites clusters but as good as it is, the price per socket of NSX Enterprise (Cross-vcenter feature) is a little bit scary.

So I am trying to see if I can do with Standard or Advanced. I am aware of the license limitations involved (Although not sure what "Remote Gateway" and "server activity" are).

See below the picture from the multi-site design guide. Very good document but 95% focused on cross-vcenter NSX -__-

Precisions:

  • Egress active on only one site by default route metrics.
  • Ingress managed by public DNS steering to site 1 or 2.

My questions about the below:

  • Concern:
    1. If Site 1 goes down, I assume the data plane in site 2 isn't impacted and everything keeps being forwarded to the ESG (made passive with rotue metric).
    2. If Site 1 goes Kaboom and is not recoverable, what is the recovery process (in a nutshell)?
  • Question:
    1. Is there a way to move the "Management components" (vCenter, NSX mgr, controller, ...) over to site 2 in order to "Drain stop" site 1? e.g. DC physical maintenance.

pastedImage_1.png

Thanks in advance.

Reply
0 Kudos
1 Solution

Accepted Solutions
bayupw
Leadership
Leadership
Jump to solution

Hi Xavier,

I believe "remote gateway" is the NSX Edge VPN NSX Edge VPN Configuration Examples

Not sure what is "server activity monitoring", there is Activity Monitoring in NSX but this feature is deprecated in NSX 6.3 and Endpoint Monitoring is the replacement

Do you use local egress in this setup? Please note that local egress only works with static route on single vCenter setup as per NSX-V Multi-site Options and Cross-VC NSX Design Guide

If you are not using local egress (which is common in a single VC multi-site setup), that means egress & ingress of North/South traffic will only use one single site.

Don't forget you may also want to think about your stateful services if you have in your setup (North/South firewall, NAT, Load Balancer)

The ESGs in Site 2 are passive and offered only for HA purpose

If you are using dynamic routing (such as BGP) you can set Site 1 with better weight and dynamic routing should be able to handle the route updates during site failure.

This should be achievable when you do a network/subnet level failover.

If you would like to do VM level failover (partial network/subnet) you would need to do a /32 route injection but would be tricky if you have stateful services.

Regarding management components, you would need to recover them manually (if doable) or use stretched management cluster and let vSphere HA restarts the management VMs in Site 2.

Not that I recommend stretched cluster/vSphere Metro Storage Cluster, but this is one of the option.

The components that need to be restarted in Site 2 are:

- NSX ESG (stateful services such as N/S firewall, NAT, Load Balancer)

- NSX DLR Control VM

- vSphere Management VM (such as vCenter)

- NSX Manager

- NSX Controllers

That's for Site 1 failure, you may want to think about WAN router Site 1 failure, DCI/inter-site link failure, ESG failure, etc which most likely you would need to update the dynamic routing weight or reconfigure stateful services on Site 2 and not failovering any management components

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw

View solution in original post

Reply
0 Kudos
7 Replies
bayupw
Leadership
Leadership
Jump to solution

Hi Xavier,

I believe "remote gateway" is the NSX Edge VPN NSX Edge VPN Configuration Examples

Not sure what is "server activity monitoring", there is Activity Monitoring in NSX but this feature is deprecated in NSX 6.3 and Endpoint Monitoring is the replacement

Do you use local egress in this setup? Please note that local egress only works with static route on single vCenter setup as per NSX-V Multi-site Options and Cross-VC NSX Design Guide

If you are not using local egress (which is common in a single VC multi-site setup), that means egress & ingress of North/South traffic will only use one single site.

Don't forget you may also want to think about your stateful services if you have in your setup (North/South firewall, NAT, Load Balancer)

The ESGs in Site 2 are passive and offered only for HA purpose

If you are using dynamic routing (such as BGP) you can set Site 1 with better weight and dynamic routing should be able to handle the route updates during site failure.

This should be achievable when you do a network/subnet level failover.

If you would like to do VM level failover (partial network/subnet) you would need to do a /32 route injection but would be tricky if you have stateful services.

Regarding management components, you would need to recover them manually (if doable) or use stretched management cluster and let vSphere HA restarts the management VMs in Site 2.

Not that I recommend stretched cluster/vSphere Metro Storage Cluster, but this is one of the option.

The components that need to be restarted in Site 2 are:

- NSX ESG (stateful services such as N/S firewall, NAT, Load Balancer)

- NSX DLR Control VM

- vSphere Management VM (such as vCenter)

- NSX Manager

- NSX Controllers

That's for Site 1 failure, you may want to think about WAN router Site 1 failure, DCI/inter-site link failure, ESG failure, etc which most likely you would need to update the dynamic routing weight or reconfigure stateful services on Site 2 and not failovering any management components

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw
Reply
0 Kudos
vXav
Expert
Expert
Jump to solution

Hi Bayu, thank you very much for yor detailed answer.

Here are my comments and questions to your input, I'd really appreciate if you could comment on that Smiley Happy Sorry ..

Do you use local egress in this setup?

No we will only have one egress site. Egress managed by metrics. I guess one DLR connected to a transit LS with no defaut gateway configured and 1 ESGs (HA pairs) on each site connected to the transit LS with default originate enabled and a lower OSPF cost on the active egress site. The convergence time of OSPF is a bit of a bummer but I guess it's better than manual fiddling.

Don't forget you may also want to think about your stateful services if you have in your setup (North/South firewall, NAT, Load Balancer)

We do have virtual load balancer that we will keep using. Haven't defined yet which of ESG or physical firewall will do the N/S firewalling but in case of site one failure we will point our public DNS to the public IP of site 2. We will probably have passive web and work nodes on site 2 and db replication of some sort from site one to a read only in site 2. In case of failure the customer active connections will be dropped (we'll have bigger problems anyways :s )

The ESGs in Site 2 are passive and offered only for HA purpose

Agreed, they are meant to take over in case of failure in site 1 or maintenance by manually altering the metrics. Yet to figure out how to detect failure of firewall/wan router, I guess I will need to stick them in the OSPF area somehow...

If you would like to do VM level failover (partial network/subnet) you would need to do a /32 route injection but would be tricky if you have stateful services.

Na I'm fine Smiley Happy

Not that I recommend stretched cluster/vSphere Metro Storage Cluster, but this is one of the option.

The sites should have a good DCI but too far appart so no MSC.

The components that need to be restarted in Site 2 are:

So without HA (MSC) available, and a failure in site 1, how would we recover the components? Can we leverage async replication somehow? How would the network side of it be handled (No stretched VLAN in ideal scenario?)

WAN router Site 1 failure

Will need to be in the OSPF gang I guess.

DCI/inter-site link failure

All components in site 2 should be read only in normal operations. When the DCI comes back the db replication will catch up with the backlogs.

ESG failure

HA pairs. If failure of an HA pair, OSPF will redirect everyone to site 2 (might lose customers active connections).

Reply
0 Kudos
bayupw
Leadership
Leadership
Jump to solution

Hi Xavier,

Regarding recovering the management components I would say storage replication or via backup software to a standby management cluster in Site 2 with a manual recovery process could be implemented.

But the IP addresses of the management VMs need to stay as I believe changing the IP address of the VMs would break the underlying app (vCenter, NSX Manager, etc).

During disaster recovery, you could probably shutdown the management VLAN at Site 1, bring the VLAN to Site 2 and update physical routing (WAN, etc).

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw
Reply
0 Kudos
Dryv
Enthusiast
Enthusiast
Jump to solution

Hi Bayu

now that Xavier has rated the question I feel I won't be hijacking his thread with a quick question myself...

is there any particular reason you would stay away from stretched cluster? Just looking for your expert opinion.

Dryv

Reply
0 Kudos
bayupw
Leadership
Leadership
Jump to solution

Hi Dryv,

Regarding the reasons, I would say I have to agree with these 2 blog posts:

1. Stretched clusters: almost as good as heptagonal wheels « ipSpace.net by @ioshints

2. Long-Distance vMotion, Stretched HA Clusters and Business Needs « ipSpace.net by @ioshints

Bayu Wibowo | VCIX6-DCV/NV
Author of VMware NSX Cookbook http://bit.ly/NSXCookbook
https://github.com/bayupw/PowerNSX-Scripts
https://nz.linkedin.com/in/bayupw | twitter @bayupw
Reply
0 Kudos
Techstarts
Expert
Expert
Jump to solution

I would totally agree on stretch cluster part at that time (2 years back). But I was wondering it is now quite easy with NSX Cross vCenter technology combined with vSphere Replication technology you can easily achieve it without Using EMC's VPLEX solution. Just my thoughts.

With Great Regards,
Reply
0 Kudos
Sreec
VMware Employee
VMware Employee
Jump to solution

For sure VR can be also used ,however VPLEX is a different beast and if you have RP that is a good integration. Keeping NSX out of equation - Ideally for a Active-Active DC scenario for a better RPO/RTO with full fledged automation- array based solution is better and VPLEX works very well in a stretched cluster situation. Remember every host failure triggers a full sync for VM(VM has to power on next available host) when we use vSphere replication.

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered
Reply
0 Kudos