Highlighted
Enthusiast
Enthusiast

VC 2.5 HA bug?

I just noticed that after upgrading to VC 2.5, on both of my clusters (one ESX 3.5, the other 3.02) in the summary screen under HA the "Current Failover Capacity" shows zero. Before the upgrade one cluster showed 4 and the other 2, so I know there's capacity. I have completely disabled then re-enabled HA on both clusters and still no luck. Anyone else seeing this?

0 Kudos
10 Replies
Highlighted
Immortal
Immortal

A few questions for you:

Are there any HA errors on any of the hosts in the cluster?

Is DRS enabled on this cluster?

What is the maximum cpu and memory reservation among all powered on vms in the cluster?

Do you have both 1-cpu and 2-cpu vms in the cluster?

0 Kudos
Highlighted
Enthusiast
Enthusiast

Same problem here.

I have a 4 hosts cluster (DL380 G5 with 16Gb of ram and 2 quadcore cpu). Ram is 40% free on any host...

Prior the VC upgrade, I was able to power on all my VMs without allow HA constraint violations, ... after the upgrade, to power on the same number of VMs (with the same resources) I must allow constraint violations, and "Current Failover Capacity" show 0

- No HA error on any host

- No CPU or RAM reservations

- DRS is enabled

- 1-cpu and 2-cpu VMs

0 Kudos
Highlighted
Enthusiast
Enthusiast

Thanks for replying eziskind,

Are there any HA errors on any of the hosts in the cluster?

No, I have checked every node.

Is DRS enabled on this cluster?

Yes, on both clusters.

What is the maximum cpu and memory reservation among all powered on vms in the cluster?

Not exactly sure what you mean, but one cluster has 105GHz CPU and 132GB RAM in the summary screen, and there is only one resource pool that has a reservation: 10GHz and 20G memory. The other cluster has 49GHz and 90GB and again has only one resource pool with a reservation: 10GHz and 12GB memory. We do not have reservations set on a per VM basis.

Do you have both 1-cpu and 2-cpu vms in the cluster?

Yes, in both clusters.

Like I said before, the capacity was there pre-2.5. Unless the algorithm for determining capacity has changed it looks like a bug to me.

0 Kudos
Highlighted
Enthusiast
Enthusiast

Not sure how HA algorithem works, but if you set it to "allow constraint violations", HA will work regardless.

It happens in my 3.0.1 environment, but I did see it worked.

0 Kudos
Highlighted
Immortal
Immortal

The HA admission control algorithm has got somewhat more conservative in VC2.5 to cover some corner cases. One case where it can be overly conservative is where you have both 1-cpu and 2-cpu vms (in general, vms with mixed number of virtual cpus).

I can verify if this is the problem if you can get some extra logging:

Make sure HA admission control is enabled (to not allow constraint violations).

Enable verbose logging on the VirtualCenter server (Administration->"VirtualCenter Management Server Configuration..."->"Logging Options").

Try power on a vm (this should fail).

Check the vpxd.log file (C:\Documents and Settings\All Users\Application Data\VMware\VMware VirtualCenter\Logs) for a line like this: "[VpxdDas] Slot info". Post the 5 lines that follow this one.

0 Kudos
Highlighted
Enthusiast
Enthusiast

The 6 host cluster:

Das admission check failed. Configured failover: 2, Expected new failover: 0

Slot info:

Slot CPU=256, Slot numVcpus=4, Slot memory=457

Total slots=90, Total VMs=101

Total hosts=6, Total good hosts=6

Slots per host: 21 21 21 9 9 9

Exit DAS_PROFILE CheckPowerOnVm (203 ms)

The 3 host cluster:

Slot info:

Slot CPU=256, Slot numVcpus=4, Slot memory=401

Total slots=44, Total VMs=40

Total hosts=3, Total good hosts=3

Slots per host: 17 17 10

VpxDrmRetrieveDomainConfigInfo: current activation is NULL, skipping privilege checking.

0 Kudos
Highlighted
Immortal
Immortal

Looks like you have some 4-cpu vms in the clusters too. That will really skew things. You're being hit by the combination of 2 new things in the HA admission control for VC 2.5:

1) If no reservation is set for a vm (or it is set to 0), use default of 256MHz, 256MB. (these values can be changed using HA advanced options: das.vmMemoryMinMB, das.vmCpuMinMHz)

2) For the cpu component of the slot, use (max MHz per virtual cpu) * (max number of vcpu's per vm)

The HA admission control algorithm is overly conservative in non-homogenous clusters, ie. ones with vms which have different reservations and/or vcpu number. #2 above makes it more conservative. Given these limitations, its best to try to keep the cluster as homogenous as possible. Is it possible to put the 4-cpu vms in a separate cluster? If not, you can try setting the default vm resources to 0 (using the advanced options in #1). This is how things worked in VC 2.0.

0 Kudos
Highlighted
Enthusiast
Enthusiast

I changed all of our 4-vCPU VM's to 2-vCPU and now the failover capacity on both clusters shows 1. Thanks for letting us know about the HA changes and variables.

0 Kudos
Highlighted
Contributor
Contributor

Is there better documentation somewhere that explains this? The current documentation is a little weak in that regards and not everyone has the luxury available of having 'clean' clusters or to downgrade a machine.

Thanks.

CP

0 Kudos
Highlighted
Contributor
Contributor

Gentlemen,

Thanks for all the info n this thread, we were experiencing the same issues. We had one VM with 4 CPU in a Farm of 1 and 2 CPU VMs. Once we moved the 4 CPU VM back to 2 CPU the Farm showed many servers avail for failover.

Steve

0 Kudos