VMware Cloud Community
VirtualredSE
Contributor
Contributor

2 Site Cluster DR

Hi

I am creating a DR Plan for a customer that pertains to their VMware environment which we currently support and would appreciate any input on the scenario

Currently, I do not have a great deal of scope from the customer but that should come. Having not had the history or knowledge of the environment design I have been doing some fact-finding and analysis to look at different failure scenarios that could occur and what improvements are possible in its current state

The single logical cluster environment stretched across 2 sites essentially looks like this, storage it seems is not uniform storage across both sites.

Site 1                                                                           Site 2   

11 x ESXi 6.5 Hosts, 264 cores, 4TB Mem              2 x ESXi 6.5 Hosts, 48 cores, 768 Mem

Primary Netapp Storage (HA 4 node)                     Backup Netapp Storage (Single node)

NFS Datastores                                                           NFS Datastores (presented from primary)

Local Storage Snapshots and replicated                Snapshot Vault

VCSA 6.5 with local host affinity

If I consider the worst case as a full site outage at primary I can see the following impact;

Impact: Cluster has only 2 hosts for compute remaining

              Primary HA Netapp storage unavailable

              Network unavailable

              No vCenter Actions automatically

              Guest VMs on 2 hosts continue to run

              Backup Netapp storage local to site available

               Some Network available

              Restore of most recent vCenter required

              Manual mount of datastores from latest Snapshot required

My initial thoughts are to distribute cluster compute more evenly and / or provide more at Site2, although DRS is obviously a consideration here.

Then maybe VCSA in HA would provide some benefits here? depending on DR scenario

HA and DRS are both enabled but just appear to be default in terms of configuration. So, whilst I appreciate it’s not intended to be a DR solution it could make a difference for VCSA to be available.

My thought process being that with that available more could also be done around some HA configuration for some DR events where primary storage/network was still available. It also provides the Netapp plugin for storage management they currently use to mount and restore VMs from primary or backup storage.

I have been reading over a vSphere metro storage cluster practices document for some ideas of recommendations for specific scenarios but would welcome anyone else’s views or experiences in DR or DR avoidance

I realise the scope of this kind of thing is very large but any input would be appreciated!!

Thanks 

Reply
0 Kudos
3 Replies
pavelkovar
VMware Employee
VMware Employee

Hi,

The best practice is to have two vCenter servers (one for each site, but you will need two vCenter licences) or vCenter HA (one licence is enough, but you will need third site for vCenter Witness).

You can also schedule automatic backups of VCSA in primary site and restore it on secondary site in case of disaster (but this will not be automatic and require time to restore).

Regarding to other VMs - did you consider to use vSphere Replication? You can replicate selected VMs from primary site to secondary, it works with one or two vCenter servers and it's included for free with all vSphere editions (except Essentials Kit).

How vSphere Replication Works

Reply
0 Kudos
daphnissov
Immortal
Immortal

I think you're going about this design in a way that is untenable for a DR strategy. Your main problem is here.

The single logical cluster environment stretched across 2 sites

This implies you have a single vSphere cluster with 13 ESXi hosts distributed across two sites with asymmetric storage. This is not at all a good idea and not how clusters are supposed to be designed. A stretched cluster in the way you're probably envisioning entails having stretched storage available, which you don't have.

Secondly, given you should have separate clusters across separate sites, depending on that distance vCHA may not work. Even if it does, it's not designed to provide site-level fault tolerance. What you need here is two vCenters, one per site, with a replication and failover strategy between them. Lots of applications fit this bill:  SRM, Veeam, Zerto, etc. They all involve that second vCenter being available to conduct the failover.

My biggest recommendation is to go and read some design papers on proper vSphere DR designs, because yours should be more-or-less textbook, and the basic strategy I've laid out above is congruent with most of them.

Reply
0 Kudos
VirtualredSE
Contributor
Contributor

Thanks for taking time to respond

I agree it does not make for any kind of DR strategy and that has not gone unnoticed. I have been trying to figure out the thought process behind the design myself as I do not see it.

This is my first initial involvement with this public sector customers infrastructure and our operational responsibilities reside only at the VMware and storage layer. Unfortunately, I do not think anyone from our side to this point has ever questioned or attempted to document how we can recover their VMware infrastructure if they lose Site1. Hence this is where I am now ....

What I was looking to try and do (perhaps somewhat in vein) was to provide some recommendations to improve the current situation in the short term... hence thoughts around increased capacity at secondary site and vCHA availability that may help in some partial failure scenarios not including full site loss or primary storage loss. With that being said the processes and manual work involved even then would be pretty painful and would not be a clear re-testable plan for multiple scenarios.

Maybe better to cut to the chase, report back all my findings and propose some investment in the right application and overall design to get them to a better place. Currently that is not in the pipeline as a project but it may well have to be. We are real advocates of Veeam to customers and so I will for sure take a look at availability suite and how that could work for them

Reply
0 Kudos