mrnick1234567
Enthusiast
Enthusiast

Alarm Insufficient Failover Resources - Troubleshooting

Hello,

We have a cluster of 6 Dell R610s running ESXI5. Twice now I've come in to see an email from vCentre saying: "Insufficient resources to satisfy vSphere HA failover level"

CPU usage is very low on all hosts, memory use is 60-70% apart from one host that is 80%. Could it be memory use is spiking to cause this message from time to time?

Where could I look to get more details on what caused this. The alert is very little help. I know the time from the event log - are there any particular logs on the boxes that might help?

Thanks

Nick

0 Kudos
5 Replies
Boloo
Enthusiast
Enthusiast

Hi

Here you have the Troubleshutting to view what is happening, If you do not do it before, disable HA an re-enable it, maybe it solves your problem.

Troubleshooting VMware High Availability (HA) in vSphere

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=100159...

Regards,

Infrastructure Technical Leader at Tui Destination Services
0 Kudos
depping
Leadership
Leadership

mrnick1234567 wrote:

Hello,

We have a cluster of 6 Dell R610s running ESXI5. Twice now I've come in to see an email from vCentre saying: "Insufficient resources to satisfy vSphere HA failover level"

CPU usage is very low on all hosts, memory use is 60-70% apart from one host that is 80%. Could it be memory use is spiking to cause this message from time to time?

Where could I look to get more details on what caused this. The alert is very little help. I know the time from the event log - are there any particular logs on the boxes that might help?

Thanks

Nick

This is has got nothing to do with CPU or Memory "usage" from a vCenter perspective. You have enabled "Admission Control" in your environment which is part of vSphere HA. Admission Control ensures virtual machines can be restarted by vSphere HA by setting aside "guaranteed resources". Guaranteed resources in this case are;

VM level reservations (CPU or Memory)

Memory Overhead Reservations (each VM reserves memory for critical worlds)

So in your case, more than likely you have a large reservation on one of your virtual machines which skews the numbers. I am guessing you are using the "number of host failures" Admission Control policy. Now you can solve this by:

1) removing the VM reservation

2) change the admission control policy to "percentage based"

If you like to know more about this I suggest reading my deepdive article:

http://www.yellow-bricks.com/vmware-high-availability-deepdiv/

or buy my vSphere 5.0 Clustering Deepdive book. (Also available for 5.1, probably preferred to buy that one as it calls out the differences as well between 5.0 and 5.1)

0 Kudos
mrnick1234567
Enthusiast
Enthusiast

Hi Duncan,

Thanks for the pointers (Great blog by the way, it's helped me out a number of times in my VMware learning curve!) I've read the deepdive amongst other things, but I'm still a little unclear on why I'm getting these messages.

You are right in assuming I have admission control set to tolerate one host failure. But I don't have any reservations set for any VM's. Looking at the cluster slot size it's listed as:

32MHz (the default with no reservation I believe)

4vCPU

122MB

Other info is:

2203 slots in the cluster

64 used slots

1771 slots available

368 failover slots.

64 VM's powered on.

What does the "failover slots" value mean? I notice it's roughly (total slots / num servers) so is that the number of slots that would need to be failed over if one host was lost? If so it seems to be there are ample remaining slots.

My largest VM has 4vCPU and 12GB RAM - am I correct in thinking that would take 100 slots ((1024MB x12 ) / 122MB)? Even then there seem to be ample slots,

All physical hosts are the same spec (24CPU, 48GB RAM) so I don't think any variation in server config is squewing things.

Thanks again, sorry if these are dumb questions.

0 Kudos
jklick
Enthusiast
Enthusiast

Definitely seems like some weird behavior. All information you have presented seems to point towards more than enough capacity. In answer to your question about # of slots:

(1771 slots available) + (368 slots reserved for HA to work) + (64 slots already in use) = (2203 total slots in cluster)

Per the previous recommendation, have you tried disabling/enabling HA to see if that helps?

@JonathanKlick | www.vkernel.com
mrnick1234567
Enthusiast
Enthusiast

I havent' disabled / enabled HA on the whole cluster yet, but we had some weird behaviour the other weekend that might be related.


We had noticed after upgrading the 6 hosts to ESXi5, two hosts were not graphing any performance data. During the weekend those 2 hosts had some kind of issue where DRS decided to migrate nearly all the VM off them both. Both hosts were showing 0 secs uptime in vCentre, although that was incorrect, as logging in via ssh showed they hadn't gone down. So it wasn't a HA or host isolation event as far as I know. Nor were the VM;s left on the hosts anything that were eating up CPU / RAM so I cant see how they would have caused everything else to get migrated off.

I manually migrated the last few VM's from these hosts and rebooted them, and now the graphing and the uptime has come back. I wonder if this issue had anything to do with the failover messages? I'm currently using it on percentage resources mode rather than the hosts/ slotcount mode but I might change it back and see if the messages re appear.

Annoyingly the reboots caused the logs to be lost (or so I was told by VMware supportwhen they had a look) so there was not much to go on in terms of fault diagnosis.

0 Kudos