Solved: Disaster Recovery Procedure with just one vCenter ...

WarlockArg · ‎04-07-2020

Hi guys,

I ask you whether some of you know or have experience, or have any document where it would be described the good practice for a Disaster Recovery plan when you have an infrastructure with one vCenter, one NSX Manager and two different sites.
My scenario is two sites, Site-A the main site and the Site-B the Recovery Site. The vCenter, NSX Manager and Controller Cluster is deployed in Site-A.

Site A:

One Cluster with 6 ESXi

One Distributed Switch

Site B:

One Cluster with 4 ESXi

One Distributed Switch

The Transport Zone spans accross the two DVS.

I have one Logical Switch that extends over the two sites with one DLR connected to that switch.

In each site I have an ESG that connect with the DLR via a Transit Network (also a NSX Logical Switch). Each ESG at each site forms a BGP Neighboring against the DLR to advertise a default route to the DLR. In the DLR I configured the advertisements of ESG of Site-A with a greater weight, so the default route of it it takes precedence over the one advertised by the ESG of Site-B.

The DLR Control VMs (they are in HA) are in site-A.

When I did the first recovery site plan test I found the following problems, that I could solved but I don't know if I did it in the correct way:

1) I supposed that all site A was down.

2) I started up a replica of the vCenter in site B with the last changes made in site A

3) I started up a NSX Replica in site B with the last changes made in Site B

4) At this point I had the vCenter and NSX Manager exactly as I had it in Site A, connected with each other and working ok.

5) When I tried to deploy a new cluster of controllers in Site B I came across with the old controller cluster that were working at Site A, because for the NSX Manager Database those controllers were up and running. Because I didn't have any connection with Site-A (cause I was simulating a total site failure), I couldn't deploy a new controller node because there was no controller up to join the cluster. I neither could delete any of the old controllers because the vCenter didn't have connection to those VMs that were at Site-A.

Solution for this point:

I deleted all the controllers using the API calls and after that, I could deploy the new controller cluster.

Is this the correct way to overcome this controllers issue??? Is the correct design the one I'm using?? I have the 3 controllers in Site-A, but when this site comes down I lose the possibility to deploy a new controller cluster until I delete the old one.

6) After that, the 4 ESXi that were at Site-B appeared as not configured, not prepared, no channel communication. I had to prepared them again. I tried to chose the "Resolve All" to solve the problems but the hosts stayed in red with "Not Ready" label in the NSX Version. However, the NSX communication works Oks. In other infrastructure I had with NSX it happened the same. The hosts preparation tab appears with errors, with communication channel in red, but everything works OK. It seems as it would be a GUI issue that doesn't show the correct host preparation s

7) When I wanted to redeploy the DLR Control VMs I found the truly problem of this plan. Because the redeployment option for the Edges tries to redeploy them at the same resources they were running. But in this case those resources (cluster, hosts and datastores) were not accessible. I couldn't neither edit the Control VM properties in the configure tab of the edge (the option you use when you want to migrate the control VM to another host for example), because in that cause it did the deployment in the hosts of Site-B, but the last step of it is to turn of the old Control VM at Site-A. Because it didn't find those original VMs, the whole process failed.

So, at this point I didn't have any choice to redeploy the control VMs.

Solution for this point:

I had again to use the APIs to redeploy the control VMs, but changing the configured cluster ID, the configured datastore ID and the configured host ID that the edge had configured in the NSX Manager Database for the new values that pointed to the new cluster, datastore and hosts in Site-B.

Bottom Line:

As you can see, I could overcome the issues I found, but it is a hardly procedure. This is the correct procedure recommended by VMware? Is there any one simpler?

Regards,

Guido

Sreec · ‎04-10-2020

Furthermore, there was a mistake in that scenario where it says "In this single vCenter design with local egress, only static routing is supported between the universal distributed logical router (UDLR) and equal cost multipath (ECMP) Edges" at the end of that scenario explanation. There is a concept mistake saying "UDLR" where you are analysing a one vCenter scenario. If there is only one vCenter you wouldn't be able to talk about "Universal" objects (UDLR) cause in order for them to exist you must be talking about a solution with two vCenter and two NSX Managers.

You can still promote standalone NSX role to primary and deploy universal objects . That is the clarification missing in that document. Technically it is possible to do it with a limitation of static routes between UDLR and Edges.

Anyway, that document doesn't mention that this scenario is not suitable for a Disaster Recovery design. There are almost no documents that talk about how to recover in a scenario with only one vCenter and one NSX Manager. This document is focused on the Cross vCenter Multi Site deployment, but nowhere it says that you shouldn't you this or that scenario for a Disaster Solution design. It only says that Cross vCenter Multi Site deployment gives you enhanced DR solution.

To be precise whatever we have to do in a Cross VC NSX design ( not stretched cluster) , we have to do same failover procedure for Single VC design as well ( Unique Site specific Clusters) .But like i said this is not the best of the design .

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

View solution in original post

Sreec · ‎04-07-2020

The key constraint here is single VC with multi site cluster which is not in a stretched configuration which is why you end up restoring VC/NSX etc when Site A goes down followed by controller redeployment to ensure controllers are available in Site two cluster. If you are really looking for resilient model , you should seriously go with below options.

1. Cross VC NSX Design ( Basically each site will have vCenter paired with a NSX with respective NSX manager roles)

2. Single VC with a stretched cluster design ( Benefit is we can rely on native vsphere features like DRS/HA etc to move/restart workloads) - you need to consider the VLAN requirement for management cluster in Site-B as well during DR event since there are high chances we might prefer pinning management components to one site while workload float around based on the design.)

Please do check https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/products/nsx/vmware-multi-site-sol...

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

WarlockArg · ‎04-08-2020

Sreec,

     I was rightly based on this document to make the design of my solution. Specifically my design is depict in chapter 3, scenario 2, "Multi-site with Single NSX/VC Instances and Separate vSphere Clusters".
    Furthermore, there was a mistake in that scenario where it says "In this single vCenter design with local egress, only static routing is supported between the universal distributed logical router (UDLR) and equal cost multipath (ECMP) Edges" at the end of that scenario explanation. There is a concept mistake saying "UDLR" where you are analysing a one vCenter scenario. If there is only one vCenter you wouldn't be able to talk about "Universal" objects (UDLR) cause in order for them to exist you must be talking about a solution with two vCenter and two NSX Managers.
    Anyway, that document doesn't mention that this scenario is not suitable for a Disaster Recovery design. There are almost no documents that talk about how to recover in a scenario with only one vCenter and one NSX Manager. This document is focused on the Cross vCenter Multi Site deployment, but nowhere it says that you shouldn't you this or that scenario for a Disaster Solution design. It only says that Cross vCenter Multi Site deployment gives you enhanced DR solution.

Regards,

Guido.

Sreec · ‎04-10-2020

Furthermore, there was a mistake in that scenario where it says "In this single vCenter design with local egress, only static routing is supported between the universal distributed logical router (UDLR) and equal cost multipath (ECMP) Edges" at the end of that scenario explanation. There is a concept mistake saying "UDLR" where you are analysing a one vCenter scenario. If there is only one vCenter you wouldn't be able to talk about "Universal" objects (UDLR) cause in order for them to exist you must be talking about a solution with two vCenter and two NSX Managers.

You can still promote standalone NSX role to primary and deploy universal objects . That is the clarification missing in that document. Technically it is possible to do it with a limitation of static routes between UDLR and Edges.

Anyway, that document doesn't mention that this scenario is not suitable for a Disaster Recovery design. There are almost no documents that talk about how to recover in a scenario with only one vCenter and one NSX Manager. This document is focused on the Cross vCenter Multi Site deployment, but nowhere it says that you shouldn't you this or that scenario for a Disaster Solution design. It only says that Cross vCenter Multi Site deployment gives you enhanced DR solution.

To be precise whatever we have to do in a Cross VC NSX design ( not stretched cluster) , we have to do same failover procedure for Single VC design as well ( Unique Site specific Clusters) .But like i said this is not the best of the design .

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

All

Disaster Recovery Procedure with just one vCenter and one NSX Manager