VMware Cloud Community
dtracey
Expert
Expert
Jump to solution

Insufficient resources to satisfy HA failover level on cluster in datacenter

Hi Guys,

We have just migrated our 4 ESX clusters to a new VirtualCenter (2.5 Base Release -> 2.5U4), and one of the clusters is showing the "Insufficient resources to satisfy HA..." error message.

Firstly - is there any way to narrow down which host is causing the issue (or VC), or whether it is a genuine issue with capacity?

I have followed a number of VMTN articles to rule out the usual bugs :-

  • Ensuring all host FQDN and domain names are in lowercase within VC - Host | Configuration | DNS and Routing

  • Ensuring all lowercase within /etc/hosts, /etc/sysconfig/network, /etc/vmware/esx.conf & /proc/sys/kernel/hostname on all 4 hosts.

  • Disabled / re-enabled HA on the cluster

  • Looked in the VC vpxd*.log at the time period during which HA was happening - but not sure what to look for? There are a few warnings such as:

    • 'Locale' 5940 warning] FormatField: Optional unset (vim.event.VmUuidAssignedEvent.vm)

    • 'VpxProfiler' 2752 warning] InvtHostSyncLRO::StartWork took 2203 ms

    • 'Locale' 6032 warning] FormatField: Optional unset (vim.event.HostDasErrorEvent.message)

    • 'PropertyJournal' 5380 warning] ERProviderImpl<BaseT>::_GetChanges: Aggregate version Overflow host-600 name

    • ...etc

There are 4 hosts in the cluster - and capacity looks ok. I've looked at[ vmwarewolfs capacity doc|http://www.vmwarewolf.com/ha-failover-capacity/] , but am not sure I totally understand it.

Our 4 hosts look something like:

1 - 32GB RAM (22 in use) / 17600Mhz CPU (5548 in use)

2 - 64GB RAM (31 in use) / 17600Mhz CPU (7661 in use)

3 - 32GB RAM (18 in use) / 17600Mhz CPU (4210 in use)

4 - 64GB RAM (20 in use) / 17600Mhz CPU (7440 in use)

How do i work out whether there really are insufficient resources or whether its a bug that will be worked around by creating a new cluster and migrating the hosts over in maintenance mode?

Thanks for your time guys!

Dan

0 Kudos
1 Solution

Accepted Solutions
admin
Immortal
Immortal
Jump to solution

Do you have any vms with a high reservation (memory or cpu)? If so, this can result in the HA admission control algorithm being excessively conservative. Check out the vSphere availability guide for details about the "slot" algorithm - http://www.vmware.com/pdf/vsphere4/r40/vsp_40_availability.pdf, especially pages 13-16. Some of the options mentioned there are only available in vSphere 4 but the basic mechanism is also applicable to VC 2.5.

VC 2.5 doesn't expose the slot details via the UI like vSphere 4 but you can get some info in the vpxd logs. Try power on a vm (you should get an error about insufficient failover capacity) and them look in the vpxd logs for "Slot info" - the following lines should have some details - upload those to this thread if you need any help deciphering them.

Elisha

View solution in original post

0 Kudos
5 Replies
hicksj
Virtuoso
Virtuoso
Jump to solution

Its likely not related to your true resource utilization... there probably is some very minor misconfiguration somewhere.

We had problems with HA not coming online post application of U4. Turned out that one of our VMotion interfaces had the incorrect subnet mask (a Class C instead of Class B). That one element not being consistent caused HA on that cluster to choke. (note: vMotion still worked - as the IPs still aligned properly anyway) Check all of your IP settings very closely.

Edit: BTW, there was never any specific in the HA logs that indicated where the problem was. You'd think that consistency checks would dump their EXACT reason for failing.

admin
Immortal
Immortal
Jump to solution

Do you have any vms with a high reservation (memory or cpu)? If so, this can result in the HA admission control algorithm being excessively conservative. Check out the vSphere availability guide for details about the "slot" algorithm - http://www.vmware.com/pdf/vsphere4/r40/vsp_40_availability.pdf, especially pages 13-16. Some of the options mentioned there are only available in vSphere 4 but the basic mechanism is also applicable to VC 2.5.

VC 2.5 doesn't expose the slot details via the UI like vSphere 4 but you can get some info in the vpxd logs. Try power on a vm (you should get an error about insufficient failover capacity) and them look in the vpxd logs for "Slot info" - the following lines should have some details - upload those to this thread if you need any help deciphering them.

Elisha

0 Kudos
T3Steve
Enthusiast
Enthusiast
Jump to solution

The first place I would look is in the cluster setting for HA. If you have a cluster with 4 hosts and failover capacity set to 3, this means your cluster should be able to handle 3 host failures and run everything on the single node.

The formula used to determined by the use of "slots". A slot is/was loosley defined by using the largest VM in the cluster (memory & cpu weighted) also using the memory,cpu reservations for vms.

If you had 100 VMs and all of them @ 512mb or ram and 1 CPU, the slot would match that VM size. If you had 101 VMs and that 101st VM was a 2GB vm and had 4CPU's, the HA calculation would be based on 100 VMs with 2GB and 4CPUs. Not logical but that's how it works loosley. I found at that U4 was a bit more sensible on the calculations.

I'd check for reservations on VMs, determine if they are really needed, do your own HA calculation and put an ESX host or 2 into maintenance mode during business hours. (they never fail off peak hours) Get a first hand view if the cluster can support the load. Use real world data and be happy with it. I have 20+ clusters and some have the HA warning you have. I have tested them all and know for a fact that we can support the VMs at capacity if a failure or multiple failures should occur.

VCP3|VCP4|VSP|VTSP
dtracey
Expert
Expert
Jump to solution

Hi Guys,

Thanks for all the replies - i'm in the UK so apologies for the time difference in answering!

This morning I checked (and double checked) the IP and subnet masks of the VMkernel ports - they are all ok. We are using standard class C masks. One thing i noticed is that we're not using a VMkernel default gateway - but reading other posts I don't think we need one. All hosts in this cluster are on the same subnet etc. Plus my other 3 clusters don't use VMkernel default gateways and are fine.

I also double checked the HA settings to ensure there was no schoolboy error in the failover toleration - this was definitely set to 1.

There were a couple of machines with a memory reservation of 2GB, so I have set these to zero for now (need to undertake another exercise to prioritise and create resource pools, reservations, limits and shares - any good documentation on this will be happily received!).

When i now set my HA Admission Control back to 'Prevent...' from 'Allow..' and I now appear to have a 2 host Failover capacity!

I'm assuming it was the setting of the reservations for CPU and memory back to zero (these were badly configured anyway) that fixed it.

I think i still need to get my head around the whole 'slot size' thing. I have it in my head as a metric that HA works out based on the heaviest VM (say one with 4GB RAM and lots of CPU reserved), and then says "right - i can only power on 8 of you as I only have 32GB physical RAM, therefore my slot size is 8..." Am i close?! Smiley Happy

Thanks guys,

Dan

0 Kudos
T3Steve
Enthusiast
Enthusiast
Jump to solution

You're getting there with the concepts!

If you enable verbose logging on the VC server and restart the VC service you can search through the vpxd logsto find the slot info.

Here's what it will look like.

Slot info:

Slot cpuPerVcpu=256, cpu=256, numVcpus=4, memory=462

Total slots=7, Total VMs=4

Total hosts=1, Total good hosts=1

Slots per host: 7

What I do is sort through my largest VMs and run reports on memory /cpu usage. I then reduce them in size if they are underutilized and gain back slots for the HA calculations.

Vizioncore's vFoglight will show you VM's with x% of memory filled with zeros indicating VMs that are unnecesarily oversubscribing memory.

Good luck,

Steve

VCP3|VCP4|VSP|VTSP
0 Kudos