VMware Cloud Community
lil328i
Enthusiast
Enthusiast
Jump to solution

vm's did not vmotion after host was isolated!

Hi everyone!

We recovered from this one, but had to manually!

One of our hosts running 4.1 became isolated, in that, vsphere still showed the host, but we could not even place the host in maint mode as most of the features were grayed out. Host showed as being disconnected in vsphere. clicking on CONNECT did not fix it. Rebooting the host, the host came back with a "boot file missing ..." error, and we had to use the REPAIR cd to get it restored.

My question is, why didn't all of our vm's vmotion to the other two hosts when this host became isolated?

HA is enabled,

Vmware HA is checked to enable host monitoring.

Admission control is Enabled

Admission control policy is set to: host failures tolerates set to: 1

Virtual Machine Options are: cluster settings: vm restart priority: are MEDIUM and host isolation response is set to Power Off.

VM monitoring is: VM monitoring only and sensitivity is set to MEDIUM.

We use Vsphere 4.1 Essentials Plus on 3 HP DL 365 G5 hosts and one EMC Clariion SAN.

Thanks!

Peter

Thanks & Regards Peter K. VCP 5.1
Reply
0 Kudos
1 Solution

Accepted Solutions
GaryHertz
Enthusiast
Enthusiast
Jump to solution

Was your DNS server on the host that went down?  I had a simiar situation where both my primary and secondary DNS servers were on the host that went down.  vCenter eventially lost communication with all the hosts because it couldn't resolve the names.  Rebooting the host didn't work because they were shutdown as part of the HA process. vCenter couldn't restart them on the other hosts because of the DNS issue.  I had to open a web console on the first host to start the DNS servers manually.

I now have a DRS rule setup so that my DNS servers can't be on the same host.  I also put the names of all of my hosts in the host file on my vCenter server.

View solution in original post

Reply
0 Kudos
8 Replies
a_p_
Leadership
Leadership
Jump to solution

Don't mix up HA and vMotion/DRS. vMotion/DRS is the process to online migrate virtual machine workloads between running hosts. HA kicks in if a host's management network becomes isolated (no heartbeat from other hosts and no connection to the isolation address). If isolation occurs the VM's would be powered off and restarted on other hosts (not vMotioned).

I guess the hosts was just disconnected from vCenter Server for any reason (e.g. DNS/name resolution issues), but not isolated from the network.

André

lil328i
Enthusiast
Enthusiast
Jump to solution

We dont have DRS, just HA.

But I know what you're saying.

We could ping the host, but, the VM's were all in a disconnected state, as well as the host.

Even though we powered down the host, the VM's still did not fail over to the other two hosts. Im not sure if powering down the host should trigger the HA of all the vm's on the host.

So, you're saying the host wasnt "isolated" even though it wasnt available to manage, and none of the VM's could be vmotioned.

Peter

Thanks & Regards Peter K. VCP 5.1
Reply
0 Kudos
GaryHertz
Enthusiast
Enthusiast
Jump to solution

Was your DNS server on the host that went down?  I had a simiar situation where both my primary and secondary DNS servers were on the host that went down.  vCenter eventially lost communication with all the hosts because it couldn't resolve the names.  Rebooting the host didn't work because they were shutdown as part of the HA process. vCenter couldn't restart them on the other hosts because of the DNS issue.  I had to open a web console on the first host to start the DNS servers manually.

I now have a DRS rule setup so that my DNS servers can't be on the same host.  I also put the names of all of my hosts in the host file on my vCenter server.

Reply
0 Kudos
arturka
Expert
Expert
Jump to solution

Hi

So, you're saying the host wasnt "isolated" even though it wasnt available to manage, and none of the VM's could be vmotioned.

Yep, isolation event occures when you isolation address is not reachable from ESX(i) hosts, by default is a GW address

Even though we powered down the host, the VM's still did not fail over to the other two hosts. Im not sure if powering down the host should trigger the HA of all the vm's on the host

Depends what you have set in Virtual machine Startup/Shutdown policy (advance option in ESX(i) configuration tab), by default Start and stop VM with the system is Disabled, means that if you powered off your server HA should restart your VMs on remainingg 2 hosts.

Why they didn't got restarted, I don't know.

Do you have enough capacity on remaining two hosts for restart VM's from failed hosts ?

Is host monitoring enabled ?

Is admission control Enable or Disable  ? If it's enable what's the type ?

Cheers

Artur

Visit my blog

Please, don't forget the awarding points for "helpful" and/or "correct" answers.
VCDX77 My blog - http://vmwaremine.com
Reply
0 Kudos
cdc1
Expert
Expert
Jump to solution

Powering off the host most likely wouldn't have released the SCSI locks on the VM's vmdk's.  While those locks are present, the other hosts in your cluster would not be able to power on the VM's due to SCSI reservations.

This, of course, is assuming that you powered off the host by literally pushing the actual power button on the physical hardware.

Also, how long did you wait before you determined that the HA restart of the VM's (that were on the "failed" host) was not working?  VM restarts can take up to 30 minutes to complete during an isolation event.

Reply
0 Kudos
arturka
Expert
Expert
Jump to solution

message was deleted, I've misinterpretted  cdc reply

sorry

VCDX77 My blog - http://vmwaremine.com
Reply
0 Kudos
cdc1
Expert
Expert
Jump to solution

That's not what I said.

I was inferring that there may have been operations in progess on the VM's vmdk's (that cause SCSI reservations) at the time the host was powered off.

However, in the end, I suspect that it's more probable that the OP didn't wait long enough before determining if the HA failover had failed or not.  After all, 30 minutes is a long time to wait.  (I know that if it was me, I would probably lose patience before it got to that point.)

Message was edited by: cdc

Reply
0 Kudos
lil328i
Enthusiast
Enthusiast
Jump to solution

Yes, my primary dns server was on this host, but the secondary was on another host.

I now chenged the ip address to another dns host outside of the SAN, a physical DC.

Hopefully this will take care of it.

Thank You!

Thanks & Regards Peter K. VCP 5.1
Reply
0 Kudos