Solved: Re: How to ensure enough resources for site redund...

daunce · ‎05-02-2013

Just wondering how to make sure there's enough resources between 2 sites to cover a site failure.

With 2 active/active sites with their own 5.0 vCenters, there's SAN replication between the 2 sites for particular VM's. All we need is SRM. but it always gets knocked back in the budget.

As we add VM's to one site, how can we ensure there's sufficient resources at the second site to run the existing VM's plus the ones that may be failover over during a DR scenario?

Something like admission control across vCenters would be great. I guess i could change the percentage of cluster resources reserved as failover capacity every now and then as i add VM's to the other site, or change the 'host failures cluster tolerates' to n + a guess of the number of hosts the replicated VM's at the other site use, but i'd like something with a bit more science behind it.

Does SRM solve/help with this problem?

What do others do in this situation?

Thanks.

SteveD03 · ‎05-02-2013

SRM does not help or solve the problem. Its best practice to use the change the percentage of cluster resources reserved as opposed to the host option. This gives you much more flexibility and should be recalculated as you add new hosts and VM's. I usually calculate it every couple of monts. During your calculations always leave enough vCPU and vRAM for the hypervisor. Most people forget this.

When doing your calculations you will need enough room for both environments at each site. Consider growth (average growth) plus the hypervisors needs.

One good suggestion is to get the book: VMware vSphere 5 Clustering Technical Deepdive. Its written by Duncan Epping and Frank Denneman. I read it and its great. I keep with me when I design vSphere environments as a reference when I need it. A technical deepdive it truly is!

View solution in original post

SteveD03 · ‎05-02-2013

SRM does not help or solve the problem. Its best practice to use the change the percentage of cluster resources reserved as opposed to the host option. This gives you much more flexibility and should be recalculated as you add new hosts and VM's. I usually calculate it every couple of monts. During your calculations always leave enough vCPU and vRAM for the hypervisor. Most people forget this.

When doing your calculations you will need enough room for both environments at each site. Consider growth (average growth) plus the hypervisors needs.

One good suggestion is to get the book: VMware vSphere 5 Clustering Technical Deepdive. Its written by Duncan Epping and Frank Denneman. I read it and its great. I keep with me when I design vSphere environments as a reference when I need it. A technical deepdive it truly is!

depping · ‎05-02-2013

thanks for the nice compliments 🙂

There is indeed no easy way, SRM doesn't help you with that. Well unless, you would for instance create resource pools and set a reservation on that pool as the minimum your VMs would get. Then map your production to your DR site to that Resource Pool. But it does mean you will need to increase that reservation for your production when the number of VMs increases. But it isn't truly a solution for what you are looking for I guess... It is still all manual math work...

SteveD03 · ‎05-02-2013

No problem, I look forward to the next book when and if it comes out. (Version 6?, I'm sure soon) I used to do the Host fails N+1 or 2 method. Then after reading the 4.1 then 5.0 books, I realized the disadvantages of doing that. I do the calculations for all my designs and then again when I visit a client site.

I too tend to shy away from resource reservations as I'm very sparing with it only on critical VMs. If too much reservations configured, it can cause issues with over commitment if and when a failure occurs. I see some admins go out of control on reservations and I see potential disasters. Talk about memory ballooning!

daunce · ‎05-03-2013

Thanks for the replies Duncan & Steve.

Can i ask how you do the calculations?

SteveD03 · ‎05-03-2013

Well, that can get pretty complicated and I can give you a general idea of the calculations. If I get to elaborate then I would have to right a book in which Duncan has already done. Duncan, feel free to add.

The first and simplest thing I do do is set restart priority for critical servers (PDC, SQL, vCenter, Exchange, Etc..) to high.

I set the lower priority VM's to medium and some I set to low or don't restart (Backup servers, antivirus, Etc.) I can start those on my own. I usually set the default restart to medium, so any unconfigured VM's will be medium. Doing this will allow the most critical servers to start first. If you don't configure it then all the VM's will start at the same time and cause resource contention.

Here is an example of a calculation: 4 Hosts, all with 2 x 8 core CPU's (32vCPU's) and 386GB RAM. Total cluster size is: 128 vCPU's and 1544GB RAM. With DRS enabled, it will do its own calculations and balance out the hosts (if migration threshold is set aggressively) What you need to do is figure out how much "usable" space you need to allow a host failure and DRS to function properly. You need to figure for N+1 or N+2. That depends on the size of your environment and tolerance to downtime.

So lets say that all of your VM's use a total of 90vCPU's and 960GB of RAM. You figure that one host failure is sufficient.(N+1) So if a host fails then you will have only 96vCPU's and 1158GB remaining in the cluster. Therefore, there is enough resources left over to allow all your VM's to run. With this you can configure the percent of cluster resources to 25% This leaves you with plenty of RAM and just enough vCPU's. Don't forget that EXSi needs 2GB of RAM and at least 1vCPU to operate. So always consider that when calculating. If you add another host and the VMs remain the same then you can change the percentage to a lower amount.You should update the percentage often!

One critical thing you need to do is plan for hardware changes. If your VM environment is going to grow then you need too add hosts. In my example you are left with 194GB of RAM and 6vCPU's if a host fails. Add a couple more VM's and now you are left with very little processing power but ok on the RAM. In this example I would add another host if the organization has anticipated growth. Additionally, if a host fails and in this scenario, then you may have a problem if a single VM with 4vCPUs needs to run. Remember, DRS will assist here.

Set anti-affinity rules too. For example, keep the PDC and SDC from running on the same host if you don't set that rule then DRS may put them both on the same host. If that host fails then you lost DNS, DHCP, AD and the global catalog (all h3ll breaks loose). In short set the anti-affinity rules for similar VMs ( Exchange CAS, MBX, clustered SQL's, etc..)

Again this is a simple example to break down how it works. Hope this helps.

Other tips:

Try not to buy dissimilar hosts (a host with 1x8 Core CPU and 128GB RAM in the cluster)

Configure host isolation response

Don't over set reservations and affinities

When your done go have a beer!

All

How to ensure enough resources for site redundancy?