VMware Cloud Community
bigusdadius
Contributor
Contributor
Jump to solution

Isolation Response?

This has probably been covered already, but I thought I would ask anyway since I couldn't find much on it.

If you lose network connectivity to all ESX servers in the HA cluster, will they individually declare themselves isolated and power down all VMs?

Reply
0 Kudos
31 Replies
Rumple
Virtuoso
Virtuoso
Jump to solution

I agree that the default should have been set to leave powered on and should probably stay that way.

In reality, if you have it set to leave powered on and HA thinks a host went down it will try to power on and fail because the files are obviously locked.

In a situation when a host has failed wouldn't the vm's..umm..be already off anyhow. WTF would you bother trying a poweroff of VM's that should in theory be down anyhow.

Initiate a power on and if they are already running the host isn't down.

Seems like a pretty basic idea made exceptionally difficult, but I am probably only thinking of whatever senario was used to design the system.

Reply
0 Kudos
dsolerdelcampo
Enthusiast
Enthusiast
Jump to solution

I agree with you.

"If VMWare gave me the ability to do a shutdown instead of a power of in case of isolation then I'd say "go for it". Since they don't give me that option, I've about convinced myself that I don't want the software to determine when to pull the power plug "[/i]

Reply
0 Kudos
conyards
Expert
Expert
Jump to solution

Having reread this thread, I cannot fault the logic behind the leave powered on idea. I'd be very interested to see how this works in a real world scenario.

my understanding of the powerdown command issued to the VM at the time of Isolation is it'll run something along the lines of;

vmware-cmd /vmfs/volumes/path/to/*.vmx stop trysoft hard

So it in theory shouldn't be as drastic as a plug pull...

Simon

https://virtual-simon.co.uk/
Reply
0 Kudos
VirtualKenneth
Virtuoso
Virtuoso
Jump to solution

I just reread the thread as well and I think the logic in leaving powered on lies in the fact that whenever a host gets isolated (the SC network gets isolated) the VM keeps on running so no need to shut this down (this only applies to situations in where the SC network and VM network are separated)

Reply
0 Kudos
koit
Contributor
Contributor
Jump to solution

I had a problem yesterday when I was changing network cables on my hosts.

All VMs on one of the hosts powered off due to host isolation.

This is by design cause I have configured HA with default isolation settings.

The problem was that the powered off VMs didn't start on any of the other hosts. They had to be manually started.

I can't figure out why this happened.

Reply
0 Kudos
stuten
Enthusiast
Enthusiast
Jump to solution

I've tested how this works after being bit by it in production and it definitely doesn't try a soft, it powers off the VM. I have set all my to stay on and have observed what I assumed to the case. When they stay powered on, the other hosts still try to relocate them and power them on, they just fail since they are still powered on. So, if the host have truly blown up and failed, the VMs would still restart on other hosts. I decided in my environment it was safely to keep them running. With the number if NICs in the hosts trunked, going to different switches and such, that if the network was causing an isolation, it was most likely a much larger problem than just powering them up on other hosts would solve.

I am planning to open a case with VMWare however. Recently when our network team was doing a switch refresh all my hosts went isolated, even though I have the sc on a trunk, because the link never went down (spanning tree had the trunk tied up). I decided that beacon probing would've kept that from happening so I switched failover detection to beacon probing. Much to my dismay I then lost all connectivty to the host. Looking around on the boards it appears that beacon probing doesn't work (it was real fun fixing that through the sc). I then created two separate sc connections and pinned each to one of the trunk lines. Various VMWare documentation suggests having more than one sc connection to help cut down on false isolation conditions. Unfortunately this didn't work either, the host still went isolated. Soooo, I want to see what VMWare suggests on this topic.

Reply
0 Kudos
dsolerdelcampo
Enthusiast
Enthusiast
Jump to solution

The problem we had recently is documented in http://www.vmware.com/community/thread.jspa?threadID=81052&tstart=0 . Due to a problem in our LAN, our ESX was during two minutes without LAN connectivity and by this reason HA powered off all our VMs. It took between ten and half an hour for our VMs to recover from the 2 minutes LAN problem. I guest that with the maintain the VMs powered on our problem should be resolve in 2 or 3 minutes. If you look to the event viewer you can see the "The previous system shutdown at xx:xx:xx on xx/xx/2007 was unexpected." message in all the VMs.

Besides due to a spaning tree recalculation we also had problems with the VMs in two ESX in the other building (we are using extended VLANs).

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

In that case, you would want to disable HA on the cluster first, do the network maintenance, and then re-enable HA.

Network is an important component of HA.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

The network is an important part of HA, for heartbeating and communications for host failure detection.

If network maintenance is done, it must be coordinated with the HA admin. The ideal thing would be to disable HA, do maintenance and then re-enable HA. Or do a phased network maintenance such that one path will still be available.

With redundant networks, network isolation should be a relatively rare occurrence.

The default of power-off on isolation is for the following:

The host is isolated from the network, which means the VM is likely unreachable as well. The other hosts in the cluster that are not isolated will think this host has failed, and will be trying to failover its VMs. If the VMs are still powered-on, they cannot be restarted elsewhere. So the isolated host gives up its resources as soon as possible, so that its VMs can be restarted in a timely manner on other hosts in the cluster.

Depending on your VM network configuration, and other factors, such as VM payload, you may want to choose "leave powered-on".

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

Another option for the network would be to have a private network for the hosts in the cluster, to do the heartbeating and HA communication, as the redundant secondary network.

There will be less chance of this network being affected by some network maintenance.

In any case, a network maintenance activity must be coordinated with HA.

Reply
0 Kudos
bigusdadius
Contributor
Contributor
Jump to solution

Just recently at TSX in Las Vegas it was indicated that the heartbeat will be more intelligent in the future and will not rely on just the one connection.

Reply
0 Kudos
admin
Immortal
Immortal
Jump to solution

It still will use multiple network connections, if they are available. It could use storage also for heartbeating, later on.

Reply
0 Kudos