TG11
Contributor
Contributor

Lost console connection caused HA to migrate the vm's

Jump to solution

We have 6 esx 3.0.2 servers running HA & DRS and virtual center 2.0.2 with a SAN. Today we were replacing a connection for our console port aon one of our esx servers and while the network connection was down, all of my vm's on that esx server HA'd to a different server. Since I was only losing the console connection, I didn't think the vm's would get moved. I have 4 nics on the server ( 1 for console, 2 for vm's and 1 for vmotion). I thought my vmgroup(has 2 nics running nic teaming) and vmotion group should not have been effected. Can someone explain why my vm's went down and had to be migrated?????

Did I miss a config somewhere or just not understanding HA??

0 Kudos
1 Solution

Accepted Solutions
fordian
Hot Shot
Hot Shot

Host failure detection occurs 15 seconds after the HA service on a host has stopped sending heartbeats to the other hosts in the cluster. A host stop sending heartbeats if is isolated from the network (COS network). At that time, other hosts in the cluster treat this host as failed, while this host declares itself as isolated from the network. By default, the isolated host powers off its VMs. These VMs can then successfully fail over to other hosts in the cluster.

If isolated host has SAN access, it retains the disk lock on the VM files and attempt to fail over the VM to other host failed. The VMs continues to run on the isolated host. VMFS disk locking prevents simultaneous write operations to the VM disk files and potential corruption.If the network connection is restore before 12 seconds other hosts in the cluster will not treat this as a host failure and VMs on the hosts that have had the network problem does not declare itself isolated and VMs continue to run.

As a result, if the network connection is restored in this window between 12 and 14 seconds after the host connectivity, the virtual machines are powered off but not failed over.

|

Thank you

|

Dominic

|

|

View solution in original post

0 Kudos
4 Replies
dominic7
Virtuoso
Virtuoso

That's working as designed IIRC. If you lose console connection for more than 15 seconds your VMs get migrated. There is another caveat for losing console connection between 12s and 15s but I can't remember what it is.

fordian
Hot Shot
Hot Shot

Host failure detection occurs 15 seconds after the HA service on a host has stopped sending heartbeats to the other hosts in the cluster. A host stop sending heartbeats if is isolated from the network (COS network). At that time, other hosts in the cluster treat this host as failed, while this host declares itself as isolated from the network. By default, the isolated host powers off its VMs. These VMs can then successfully fail over to other hosts in the cluster.

If isolated host has SAN access, it retains the disk lock on the VM files and attempt to fail over the VM to other host failed. The VMs continues to run on the isolated host. VMFS disk locking prevents simultaneous write operations to the VM disk files and potential corruption.If the network connection is restore before 12 seconds other hosts in the cluster will not treat this as a host failure and VMs on the hosts that have had the network problem does not declare itself isolated and VMs continue to run.

As a result, if the network connection is restored in this window between 12 and 14 seconds after the host connectivity, the virtual machines are powered off but not failed over.

|

Thank you

|

Dominic

|

|

View solution in original post

0 Kudos
TCronin
Expert
Expert

The caveat you're forgetting is that losing the console connection will cause the host to shut down the VM's so that when they come up on other hosts there aren't duplicates.

The problem is that if you lose multiple connections (switches go down or reset) you can end up with all of you VM's powered off and no other host bringing them up. You need to have redundancy on the service console for that reason.

Tom Cronin, VCP, VMware vExpert 2009 - 2021, Co-Leader Buffalo, NY VMUG
depping
Leadership
Leadership

esx starts powers off the vm's from the 12'th second, and starts the vm on other host the 15th second. so if between the 12th and th 15th second the network connection returns your vm stays down.

and technically it isn't a migration cause no data is moved, it just switches the power off on esx1 and powers it on on esx2. notice that it's a real power off, and not a shutdown.

Duncan

0 Kudos