The environment:
Okay. 3 hosts in a cluster. 96GB RAM, each box has 2 quad core CPU's. There are no resource pools, all VM's have normal resource allocation shares, no reservations, no limits. Two or three VM's have 2 vCPU's assigned to them, but none have affinity set. Assigned memory ranges from 512MB-2.00GB.
HA & DRS enabled on the cluster. HA allows 1 host failover. DRS is fully automated, moderately aggressive. All virtual machines work from HA & DRS defaults, no custom behaviors.
The question:
When each host in the cluster is at about 27% RAM utilization and 6-15% CPU utilization, I cannot power on anymore virtual machines because "insufficient resources exist for HA." But....I shut down some of the running virtual machines, and incrementally increased their RAM from 512 to as much as 3GB and VI allows them to boot. I add another vCPU to a machine...also boots. So...if I simply want to power on another machine with only 128MB RAM (for testing) and 1 vCPU, it won't allow it, but I can double the memory and processors for other machines. What's the hangup?
How is HA calculating this failover??? As of this moment, DRS has the VM's spread across the cluster like so:
ESX1: 23 VM's
ESX2: 22 VM's
ESX3: 14 VM's
Thanks!!!
It all has to do with "slots"...see HA Failover Capacity for more info...
Ken Cline
Technical Director, Virtualization
VMware Communities User Moderator
Ken,
Thank you. The formula on that site is not matching my environment by a difference of 27 VM's.
I have 3 ESX servers with 1 host failover allowed.
The smallest amount of RAM in an ESX host is 32GB.
The largest configured amount of RAM for a virtual machine is 2GB.
32GB / 2GB = 16 Slots per ESX Host
3 servers - 1 server (failover) = 2 Servers
16 Slots X 2 Servers = 32 VMs across this 3 Host cluster.
Currently, 59 virtual machines are running on the cluster. HA is set to not power on virtual machines if they violate availability constraints. No static reservations are made on any virtual machines.
Any other thoughts???
Thanks.
Are you using Resource pools?
3 Resource Pools are defined by name, but none are configured for reservations and none have machines in them. This cluster mainly serves XP virtual machines for programmers, so there wasn't much of a need to set reservations for any particular VM.
Here is a snapshot of resource allocation on the cluster:
CPU Reservation:46375 MHz
CPU Reservation Used: 15360 MHz
CPU Unreserved: 31015 MHz
Mem Reservation: 89696 GB
Mem Res. Used: 18814 GB
Mem Unreserved: 70882 GB
With all of that memory and CPU time unused, and no explicit reservations, one would think they could power on many more machines????
I would almost suggest disabling HA/DRS and then creating a new cluster and moving the hosts into the new cluster and re-enabling the HA/DRS and see what happens
I am not so much hurting for more capacity at the moment, but seeking a formula to justify to the investor the # of machines that I can fit on the cluster at any given time. They are adding new developers to the team soon, and we are adding 4 more hosts. The estimated # of VM's I can provision will affect their planning for user workstations. When I tell them "I don't know, I just keep adding until VI3 tells me I can't anymore..." I don't sound very informed. Does anyone know if these types of calculations are covered in the official VMWare training classes? I am supposed to attend the FastTrack in May.
Anyone else have an up-to-date formula for calculating?
Thanks everyone.
I think the docs cover
of vCPU's per host (I believe 128 vCPU's per host but don't quote me)
RAM Reservation is now equal to 1/2 of the amount of RAM you allocate to your VM (it used to be a calculation but isn't anymore). I believe this can also be adjusted).
Typically if you need failover in a 2 host environment you'd not exceed 50% of a single host capacity, 3 host was something like 66% capacity/host,etc
If you have 32GB of RAM I'd expect you could put on at least 40-50 1GB machines on a host if they were underutilized. Typical recommendation is 4-5 VM's / CPU (and each core typically only equals about .75% of a full CPU so somewhere around the 30-35 VM range would give you about expected capacity of normal workload machines without overcommitting memory (which in a development world might not be a good thing anyhow).
So, on those boxes I would typically expect to get close to 100VM's if not more...
Okay, I'm thinking what you're thinking...in a nutshell. I'd really like to get detailed info on what's going on instead of breaking down the cluster...there's gotta be something, somewhere, that allows me to assess this problem in detail and troubleshoot the existing cluster. I'm thinking its just the algorithm doing its thing, and unfortunately I don't know what the heck that algorithm is, or why it isn't made public. Someone had to write it, afterall ![]()
I'd suggest opening a support case with VMware and see what they have to say about this...
Ken Cline
Technical Director, Virtualization
VMware Communities User Moderator
Proden, did you ever get anywhere with this? The vmwolf site is now down and I'm having the same problem in my clusters.
2 hosts, 2 dc CPUS, 32GB/per, 30 VMs spread across both, 1/2 have 1GB RAM assigned, none with any more.
Memory unreserved: 57472Mb
CPU unreserved: 18000Mhz
I go to power on VM #31 with 1 vCPU and 512MB RM = insufficient resource error. WTF??
Hi Weestro. I did make some progress on gathering information, but haven't resolved the issue. I opened a case with tech support. Their first suggestion was to move to 3.5 Update 1, as they noted the algorithm changed for HA calculations. I have not applied Update 1 as I see people are having problems with it, I'm waiting for most issues to come to the surface. In addition to this, the support rep did his best to relay the manner in which HA is calculated:
-
"Here are some more details on the HA slot size calculations that we discussed earlier on the phone.
For ESX 3.5.0:
A minimum Memory reservation of 256MB is assumed.
A custom minimum slot size value can be set. To do this, select your cluster, click VMware HA, and click the Advanced Options button.
Add a value for "das.vmMemoryMinMB".
A minimum CPU reservation of 256MHz is assumed.
A custom minimum slot size value can be set. To do this, select your cluster, click VMware HA, and click the Advanced Options button.
Add a value for "das.vmCpuMinMHz".
In ESX 3.5.0, the maximum number of vCPUs is considered, and this is used as a multiplier for the maximum CPU reservation. This multiplier is removed in ESX 3.5.0 Update 1.
The CPU slot size is the biggest CPU reservation, or 256 MHZ, whichever is greater, multiplied by the biggest number of vCPUs. The VM with the biggest number of vCPUs can be a different VM to the VM with the largest reservation.
The Memory slotsize is the biggest memory reservation, or 256MB, whichever is greater, added to the biggest value for overhead memory for any VM.
For each ESX, the total Memory available to VMs (i.e. minus service console memory) is divided by the slotsize to get a number of usable slots.
A similar division is done for CPU slots.
HA then assumes worst case scenarios, i.e. that a failover will take down the ESX with the largest capacity. So the ESX with the highest number of CPU slots is removed from the total number of available CPU slots, and the ESX with the highest number of memory slots is removed from the total number of available memory slots.
The remaining slot numbers determine the maximum number of VMs that can be started. The lower number determines the number of VMs.
In your case, the three ESX servers are identical.
So, find the VM with the most vCPUs, then the VM with the highest overhead RAM, and as you have reservations set to 0 everywhere use 256 as the default reservation values (unless you customized them).
Multiply 256Mhz by the max number of vCPUs - that's your CPU slot size.
Add 256MB to the highest overhead memory, that's your Memory slot size.
This should help you figure out where the limitations are.
Please note that this is based on my current understanding of HA following discussions with a senior engineer. I'm confident that it's fairly accurate but it may not be 100%, as the product is constantly evolving."
-
I have not been able to get workable results from this information, but I would love to know if it worked for you, my math could be wrong. I have no reservations set, so I'm still thinking I should have many more VM's. Please let me know if any of this helped you, my next step is Update 1 but I'm being overly cautious.
Thanks.
Well the cluster I'm testing on is update 1 and I was hit with the same restrictions. I too found the das.x advanced settings and set a custom value of 150 for both CPU and RAM. This allowed me to re-enable strict admission checking with no flags. I didn't mention it before but i only have 19 VMs powered on in this cluster. My largest VM is 2 vCPU/ 1GB RAM. I'll be curious to see how many more VMs this will allow me to power on here. I really wish VMW could give us a definitive response on the calculation used (like you said). It makes architecture and capacity planning very difficult plus wastes a tremendous amount of hardware given the current scheme.
Check my math here but is this what support is telling you? (assuming no das customizations)
CPU slot size = 256Mhz (default) * max number of assigned vCPUs (i.e. the highest # of vCPUs assigned to a single VM)
RAM slot size = largest assigned memory reservation or 256MB (whichever is larger) + the largest assigned RAM value of any VM
Usable memory slots = total RAM available to VMs in the cluster (minus service console) / slot size
Usable CPU slots = total Mhz available to VMs in the cluster / slot size
Supportable VMs = the lower of the 2 slots numbers
So for my environment with my 150 das customizations I get:
CPU slot size = 300 (150 * 2)
RAM slot size = 1174 (150 + 1024)
Usable mem slots = 55 (1174/(65536 - 512))
Usable CPU slots = 60 (18000/ 300)
Total VMs = 55
That look right??
I changed my values to 128 and voila..machines started powering on. However, how far can you push this until you harm the ability to failover? If you are like me and are not currently reserving resources for any system, could you just keep decreasing these values and only suffer from a performance standpoint when a host fails?
That's a very good question, how low is too low? 256 to 150 for me amounted to 20 additional VMs that I can power on. I've posed this question to our VMW account team to see what they can dig up. Interesting that changing 150 to 128 nets 1 additional VM...
