VMware Cloud Community
pearlyshells
Contributor
Contributor
Jump to solution

Understanding HA and DRS in ESX3.5 and VirtualCenter 2.5

We have ESX3.5 and VC2.5 with HA/DRS enabled. DRS in one of the clusters is in Partial mode.

We had a situation yesterday with a VM in the HA/DRS cluster (only one resource pool). One of our administrators needed to replace a bad memory module on one of the 3 hosts in the cluster. He manually migrated all VMs off that host. Later, another administrator worked an issue with one of the VMs in the same cluster and powered the VM off and on again. He either migrated the VM or somehow the VM migrated to the host where there was no VMs located (the one that was to have the memory module replaced by the earlier administrator).

Note: since the cluster was in Partial mode, I don't know why the VM would automatically migrate to the host where no VMs existed. I believe , at best, a recommendation would have popped up instead. Regardless, the VM somehow got back onto the host where NO VMs should have existed.

Later that day, the first administrator SHUTDOWN (he did not put the host in maintenance mode) the host to replace the memory module. He did not check to make sure there were no running VMs on the host....after all, he had manually migrated them all off the host earlier. From here, things get a bit blurred based on testimony and logs. But, from what I could tell, the sole VM on the shutdown host remained on the host and was powered off. This caused an alert and obviously caused issues for our user community who were on the VM.

From what I read about HA and DRS, HA uses a "worst case scenario" to determine failover capability. This is based on the running VMs on the cluster once one or more of the hosts in the cluster fails. This "worst case scenario" takes the MOST used CPU reservation of any VM running in the cluster and the MOST used Memory reservation and applies that to all the running VMs to calculate total resources for the cluster.......if I read that correctly. So, if the total is thereby exceeded some migrating VMs will not be allowed to power on because they will not meet Admission Control requirements.

If my interpretation of the reading is correct, then that may explain why the VM on the shudown host did not power back up. Is that correct?

0 Kudos
1 Solution

Accepted Solutions
Troy_Clavell
Immortal
Immortal
Jump to solution

are there any events listed in your task&events tab from you cluster that say something to the effect unable to power on guest do to violation of resource constraints?

View solution in original post

0 Kudos
11 Replies
kjb007
Immortal
Immortal
Jump to solution

Partial automation gives you recommendations AND, vm's will automatically power up on the best host. So, if you had a host which was not in maintenance mode, and had no machines on it, DRS partial would have started your machine on that host. This is what maintenance mode is for.

Your interpretation on "worst case" is called slot size calculation. Each slot in the cluster will be calculated using those highest value reservation for CPU and Memory, and determine how many slots are available in the cluster, and how many are actually already in use, telling you if your cluster can handle any more vm's powering on. Do you have automatic startup/shutdown enabled on your ESX host that was shutdown? If so, that would explain why your vm was not forced into an HA event.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
pearlyshells
Contributor
Contributor
Jump to solution

Thank you for your reply. To answer your question...no. We do not have auto startup/shutdown enabled on the host.

0 Kudos
Troy_Clavell
Immortal
Immortal
Jump to solution

here's a couple things that may be useful regarding slot size and calculations. There are some advanced options you can use to push the slot size calculations beyond the default conservative approach put in place.

...on a side note, you cannot use startup/shutdown options for your guests if your Hosts are part of an HA cluster

Message was edited by: Troy Clavell- The new Advanced Options only apply to vSphere4, not VI3

kjb007
Immortal
Immortal
Jump to solution

I've seen those options affect vm's even though they should not.

The shutdown of the ESX host does not in itself constitute an HA event though, at least I don't believe so, because the host still responds to the other hosts querying its HA agent. So that may still be why your vm did not restart.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
Troy_Clavell
Immortal
Immortal
Jump to solution

The shutdown of the ESX host does not in itself constitute an HA event though, at least I don't believe so, because the host still responds to the other hosts querying its HA agent. So that may still be why your vm did not restart.

if your isolation response is set to "leave powered on", if not I would think that if there is no network connectivity to the COS, because of the shutdown of the host, an HA event would be triggered and the guest restarted on another host in the cluster. I can see if the VM was powered off, it won't be restarted.

0 Kudos
pearlyshells
Contributor
Contributor
Jump to solution

Thanks Troy,

we do have the option set to "Leave powered on"

0 Kudos
kjb007
Immortal
Immortal
Jump to solution

That's an isolation response, and not HA. But the point is still taken. The machine should have restarted correctly. I just validated that it would. I'll try the startup/shutdown options as well just to throw that into the mix.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
Troy_Clavell
Immortal
Immortal
Jump to solution

upon reading the tread in full, like a should have done in the first place. Do you have admission control set? If so, and your cluster thinks there are no available slots it won't power on that guest. So, your thinking is correct. With vSphere4, you have the das.slotCpuInMHz or das.slotMemInMB which will help increase the slot count, therefore you can still leave on admission control. For VI3, which doesn't have these advanced options, we just disabled admission control.

0 Kudos
pearlyshells
Contributor
Contributor
Jump to solution

It is set to "prevent VMs from being powered on...."

0 Kudos
Troy_Clavell
Immortal
Immortal
Jump to solution

are there any events listed in your task&events tab from you cluster that say something to the effect unable to power on guest do to violation of resource constraints?

0 Kudos
pearlyshells
Contributor
Contributor
Jump to solution

I'm learning that there are some settings that we probably should edit from what was originally created by the past Pro Services rep and the Task/Event log is one of them.....no entry. The log shows only 1 day at a time. I also gather that it might be best to change the Admission Control to "power on..." AND, we've decided that we don't need to keep the DRS set to Partial. That was done for a specific purpose and is no longer needed. So, we'll put it to Auto

0 Kudos