Best way to spec hardware for bulletproof availability - intellectual exercise

Hey all,

The new performance gains are very impressive. I couldn't believe my eyes when I saw the 320 powered on VM limit for ESX/ESXi. Now that you can have 64 cores and 1TB of RAM on a server the consolidation possibilities are insane. I just wonder the best way to spec a server so that you don't have a single point of failure.

I had a Purple Screen of Death because of a bad DIMM last month. DRS/HA clusters are nice and all, but it was still an outage for the 25 VMs on the physical host. With FT's 1vCPU limitation, I don't see a perfect solution yet. So the possiblity of a 320 to 1 consolidation ratio that's possible with vSphere running on one host is frightening.

How do you guys spec your hardware? Anyone using RAM in a RAID config yet?

Here's an example of a dream config with a Sun x4450:

4 x 6-core Intel Xeon x7460 CPUs (2.66GHz x 24 cores)

256GB RAM (32 DIMMs x 8GB) in RAID 1 (128GB usable)

2 x 146GB DP SAS in RAID 1

Dual PSUs

4 onboard GbE plus 1 quad port GbE NIC (redundant NICs for each vSwitch across redundant physical switches

Two dual QLogic FC cards connected to the SAN with round-robin multipathing... any other ideas to prevent downtime?

I would expect to need at least 3 of these servers configured with DRS/HA and using FT where applicable.

I estimate I'd get about 100 to 150 average Windows Server VMs per host in this configuration. If VMs were light VDI workloads then possibly 200 to 250 per host.

If the goal is to acheive the highest consolidation, with a balance of performance and high-availability, anyone have any other suggestions or ideas?


0 Kudos
3 Replies

Hi Jon,

You are basically looking for the Scale-up approach rather than the scale-out. No one can claim which approach is better than the other, because the answer will always be: it depends!

It's all about your requirements and priorities.

In my case, I adopted the scale-up, I have similar clusters of your dream config. First they where 2 nodes, and then I added a third one when I realized that it's almost impossible to have a flexible maintenance with two only. In terms of consolidation its huge, you can scale up with your VMs as much as you want really as far as you have the memory. But when it comes to maintenance and availability it's going to be a pain in the nick. Imagine updating the ESX host with the latest patches, you're going to vMotion, let's say 100 VM, from one node to the other two, and then do your patching, and finally vMotion them back. Then you start with the next node and so forth. Now when you have a server failure it's going to be a severe situation, your HA/DRS will need to power on 100 VM and balance them on your other two nodes. 100 VM downtime would be something unpleasant especially during work/peak hours, not to mention that they might not come back health as they were due to the unexpected power-off.

On the other hand, when you go with the scale-out approach, typically with blades, you have much more relaxation in these situations, because normally you wouldn't have more than 10 to 15 VM per blade. But in the same time you have a limitation in the IO and expansions. You can over come that with creating deferent sets of clusters, let's say 10 blades for internal production servers, and 5 blades for your DMZ cluster, and so forth. This also give you a better security since you are not sharing all your VLANs on one box, although they could be physically separate on pNICs on the first approach.

Note also that refreshing your hardware will be very, very hard on the first scenario, while it's very easy and cost effective in the second one.

I would say in most of the cases the blades will do fine, and in fact my new hardware coming for vSphere will be both 4 high-end servers, plus two blade chassis for scaling out gradually.


Hany Michael

HyperViZor.com | Virtualization, and everything around it

Hany Michael
HyperViZor.com | The Deep Core of The Phenomena
0 Kudos

Thanks for the reply,

You make a good point about maintenance scenarios. It becomes almost impractical after a certain point to VMotion all servers off and back. Even if you could practically fit the 320 VMs on each host, you wouldn't want to deal with evacuation the hardware for maintenance. At about 10 seconds per VM to VMotion it would take almost hour just to clear the host and I don't even want to think about a single host failure with 320 VMs!

I guess what I realized from this exercise is that for all practical purposes VMware has hit the consolidation ratio ceiling. I regularly have 30:1 consolidation and I've read some VDI setups have had up to 100:1, but any higher than that has to be for very unique circumstances. So it doesn't make sense anymore to get the most powerful server with all the RAM that you can cram in it. Intel based servers have gotten so huge. Just 6 years ago we bought a IBM pSeries server with 12 CPUs and 64GB of RAM for 1.5M now that can be had for about 25K.

So I think the ideal start for a mid-sized environment would be a blade chassis with 3 blade servers. Configure each blade with a single socket to make the most of vSphere licensing with 64GB of RAM. That should give you up to 100 VMs between 3 servers. Then scale out to add more capacity. Even the largest Windows or Linux VM will probably top out between 16 or 32GB, so why spend more on bigger boxes?

0 Kudos


How do you guys spec your hardware? Anyone using RAM in a RAID config yet?

Yes, I do this using HP hardware. Raid RAM which is stock on many HP systems. ML350/ML370s + DL78x. I also think it is part of some of the DL58x configurations. This gives better uptime.

If the goal is to acheive the highest consolidation, with a balance of performance and high-availability, anyone have any other suggestions or ideas?

FIrst I would go through all the failure scenarios within the hardware to determine how to have increased uptime. For example:

DIMM bad (you covered that one)

PCI bus bad - do multiples exist within the host, will a bad bus bring down your entire virtual network. How are IO devices balanced across buses and cards.

Hard disk Issues (RAID once more to help) but how many disks can you handle in failure, count on 2 going bad within short order and can you recover from this?

Power supply issues? Other Power issues.

Network/FC cable issues/gbic/port issues.

etc. etc.

Once you make the list you will know what you need to do to keep the system running and what extra hardware is required, etc.

Best regards, Edward L. Haletky VMware Communities User Moderator, VMware vExpert 2009, DABCC Analyst[/url]
Now Available on Rough-Cuts: 'VMware vSphere(TM) and Virtual Infrastructure Security: Securing ESX and the Virtual Environment'[/url]
Also available 'VMWare ESX Server in the Enterprise'[/url]
[url=http://www.astroarch.com/wiki/index.php/Blog_Roll]SearchVMware Pro[/url]|Blue Gears[/url]|Top Virtualization Security Links[/url]|Virtualization Security Round Table Podcast[/url]

Edward L. Haletky
vExpert XII: 2009-2020,
VMTN Community Moderator
vSphere Upgrade Saga: https://www.astroarch.com/blogs
GitHub Repo: https://github.com/Texiwill
0 Kudos