I'm currently planning a VMWare-based infrastructure, which has been making good progress, thanks in large part to members of this community. I have come to designing the HA and want to make sure I understand the limitations as far as simultaneous host failures is concerned.
Imagine I have the following set up:
Two blade chassis each with 8 blades in them for a total of 16 hosts
All 16 hosts are in one cluster with two nodes primary nodes residing in one blade chassis and three in the other
All requirements for HA are met on each blade (they have access to the same shared storage, virtual network configuration and DNS is configured correctly)
Resource limitations don't prevent all VMs running on 8 hosts
Now imagine one chassis and its 8 hosts fail simultaneously. Am I right in thinking that only the VMs from four of those failed hosts will successfully be restarted by HA on the remaining 8 hosts within the second chassis? I am basing this assumption on the fact that as there can only be five primary nodes in a cluster, no more than 4 can fail. Or will it actually be even less than that, because I may only have two working nodes on the second chassis?
If I alter my set up to consist of two clusters of 8 hosts each (4 in each chassis) am I right in thinking that I can now deal with the loss of a complete chassis as each cluster will only have lost 4 hosts? Or will I still have problems as again I may only have two nodes on the working chassis?
If the 8 host cluster isn't resilient then it would seem that the most hosts that a resilient two-blade chassis set up can have is 4, which is rather limiting.
I will of course test the failure of a whole chassis before putting anything into production on the environment whatever design I go for. If I disable one blade chassis, would all 5 primary nodes remain on the second chassis after the first one returns? If so then I would obviously need to manually move them back after any chassis loss (testing or otherwise).
Will, you are all good. As long as there is one primary node alive somewhere it will restart everything. So as long as all the 5 primaries are not in the same chassis you are cool.
Review to ensure they don't end up all in the same chassis as well as the links from it over to Duncans details.
Rodos
Consider the use of the helpful or correct buttons to award points. Blog: http://rodos.haywood.org/
Personally, I would never go above 8 hosts in a cluster. I had issues early on when I had as many as 14 hosts in a cluster. That's a lot of reservations....
Interesting - we currently have 32 host clusters with no performance issues.
--Matt
We didnt' have performance issues (we did see some san issues, but minimal), our problems mainly revolved around virtual center.HA and DRS became troublesome for the most part, and hosts would interimttently just drop off service. Admittedly, this was a few revisions back, but we noticed when we dropped our number to below 14, the issues were eliminated. Perhaps it's no longer an issue with 2.5....But it was a painful lesson.
Will, you are all good. As long as there is one primary node alive somewhere it will restart everything. So as long as all the 5 primaries are not in the same chassis you are cool.
Review to ensure they don't end up all in the same chassis as well as the links from it over to Duncans details.
Rodos
Consider the use of the helpful or correct buttons to award points. Blog: http://rodos.haywood.org/
Fantastic. I just misinterpreted the description. It basically said that as there are only 5 primary nodes you can only have four failures, but now I'm taking this to mean on primary nodes, not of any hosts. So it seems my proposed two chassis solution does provide resiliency against a chassis failure. Your blog posting is also very useful.
You are most welcome. Let us know how you get on with your project.
Rodos
Consider the use of the helpful or correct buttons to award points. Blog: http://rodos.haywood.org/