This is long, so appreciate your attention.
So, we're in expansion mode and growing our VDI footprint, and I'm tossing around ideas on how to make us safer.
Currently running Horizon 7.12, AppVolumes 2.18.6, DEM 9.9. We have instant clone VDI pools for each Line of Business or major use case, as well as several AppStacks for certain facilities or LOB application groupings. Using an on-prem SQL Always-On cluster for databases.
Our infrastructure VMs run on a separate maintenance cluster, but the compute cluster is a 12-node VxRail cluster. We're running a little less than 70% with just over 1000 simultaneous sessions.
We have an offsite DR hosting facility today, but as a company we are adopting cloud and Azure specifically for DR moving forward.
We are extremely sensitive to IT spend. It took years to convince management that VDI was worth it because it is not significantly cheaper than PCs in terms of hard cost dollars.
Disaster Recovery: There is no DR footprint for any of the Horizon infrastructure.
Business Continuity: There is no infrastructure in place to accommodate planned or unplanned maintenance operations, other than the ability to lose a single node from the VxRail cluster.
Costs: The hardware we use is expensive. There is no way to convince upper management to purchase two clusters for every one cluster's worth of capacity. As we grow, this would quickly become cost-prohibitive. Each cluster is over $800K.
The standard model is CPA with clusters in the primary site and the DR site. Assuming fast, low latency connections to the field and the datacenter, this solution would cover both DR and BC needs. Sadly, and as noted, we can not afford this.
So I'm wondering...
For Business Continuity: Can we implement a completely on-premises cloud pod architecture that leverages a single cluster/pod for maintenance, and then multiple other clusters/pods for capacity and growth? As long as the pods are sized the same, this would allow for the complete failover of any single capacity pod to the maintenance pod. We would be protected from failure or corruption in a maximum of one pod, yes, but two simultaneous capacity pod failures seems unlikely. This effectively adds a "hot spare" pod for the datacenter without duplicating every pod offsite. After the second capacity pod, this is much less expensive.
Note that I understand this does not address DR.
For Disaster Recovery, the thought is maybe we add another single pod in Azure (likely AVS I think) running the minimum 3-node cluster size. That sits there until we have a DR event, at which time we expand the "DR Pod" cluster and perhaps add other clusters for capacity within it.
Does this make any kind of sense? Running one maintenance pod on-prem and one minimally provisioned DR cluster in Azure is significantly less expensive than buying 2X the hardware for every capacity cluster, and therefore easier to convince management, who still wants to compare cost to physical desktops.
If we focused on just BC, would it be better to simplify it as just three clusters' worth of hardware in the same single pod and forego CPA altogether? My thinking with CPA is that is protects against a corruption in the connection servers/UAGs etc. of a single pod, and that it sets us up to more easily add in the DR component later. But I'm new to CPA and don't even know if it can operate in this "maintenance pod" manner given home sites etc.
I'm open to thoughts here. Understand this is an unusual approach.
Yes, a VMware Professional Services engagement would be good. Again, money.
Thanks much for your thoughts!
PS: I'm referring to a "Maintenance Pod" as a dedicated hot-spare, but it could certainly be dynamic. Day-to-day we could certainly spread capacity across all clusters until we wanted to do something like upgrade the VxRail code on one of them or a vSAN fails or something.