Hello,
My organization is looking at ways to limit the impact of host failures on applications. The most obvious way to do this is to spread applications out across multiple VMs, and to use DRS anti-affinity rules to ensure that vSphere makes a reasonable attempt to keep them all on separate hosts.
We have reasonably large clusters, ranging from 16–28 hosts, and these will likely grow at least somewhat once we make the jump to vSphere 6 (currently on 5.5). So having enough hosts to accommodate this should not be too much of a concern.
However, our application count per cluster is well over 100, and I am somewhat concerned about making the DRS ruleset overly complicated both for performance reasons, and because it will increase our operational complexity (have to manage the rules and create new ones when new apps are created).
Is anyone aware of a hard or soft limit to the number of DRS rules a cluster can have? Does anyone have any formal or informal best practices that they have used in the past for these types of situations?
As per me, there is NO as such limit on number of DRS rules on the cluster. It is recommend to use DRS rules sparingly, hence it is better not to use them unless it is absolutely required. As the number of rules gets increased, it will restrict DRS opportunities of balancing the cluster. It is operationally challenging in managing them as well.
You can think about configuring vSphere HA to minimize the impact of host failures on your applications.
Thanks. HA is enabled across the board, but it doesn't really help when a host fails with all of the components of a tier residing on it (e.g., all web servers or all app servers). Yes, HA does start them back up, but the app is offline in the interim. Having them separated so that the app is merely degraded and not completely down is much more desirable.
I have always operated under the "conventional wisdom" you mentioned earlier: DRS rules should be used sparingly to avoid reducing the options that DRS has in selecting the hosts a VM can run on. Unfortunately, we have had several instances where hosts failed and people naturally are somewhat incredulous to find out that all of the "redundant" VMs in their multi-tier application were running on the same hardware, and did nothing to keep the app online during the hardware failure.
