VMware HA's default behavior screwed us; Suggestio...

nsabinske · ‎10-22-2007

This evening, for the third time we've had an issue where our core switch is rebooting itself. Since the last time it's done this we've implemented a sizeable VMware cluster. The end result is that it's seen as multiple NIC failures on every host in the HA cluster (Logged as Link is Down), and the behavior of HA causes every VM to be turned off never to turn on again. This really sucks because a feature that's supposed to give us HIGH availability has, by design, caused us NO availability.

Obviously, this is a situation we'd like to avoid in the future as I have to assume that the switch is going to do this again someday. However I WOULD like to retain the functionality of Vmware HA restarting guests in the event of an individual ESX host failure. (So yah, we'd have to monitor and fix network failures manually, but it's better than losing the entire infrastucture)

Does anyone have any idea about the cleanest way to turn off the network isolation response? I know you can go into the cluster configuration and set individual hosts to Leave Running, but it seems Power Off is the default. I'd like to change that default, or just disable the whole feature in one fell swoop?

Also, has anyone ran into this kind of issue before? Or have another solution? How likely is a split-brain scenario if we do not have the Power Off response, realllllly? (We use FC SAN -- and from what I can tell an isolated host will hold its lock indefinitely so it shouldn't happen?)

Thanks for any help/insights in advance!

SimonStrutt · ‎10-22-2007

Have a similar problem, we manually change each VM to "Leave Running" as you do. Not aware of a way of making this default or applying to multiple VM's

"The greatest challenge to any thinker is stating the problem in a way that will allow a solution." - Bertrand Russell

williamarrata · ‎10-22-2007

What you might want to do is in your HA settings, keep your VM's powered on. Your setting might be set to powered off.

Hope that helped. 🙂

virtualdud3 · ‎10-22-2007

What "level" of VC are you using?

You can upgrade to VC 2.0.2 (if you have not done so already), and then you will have the option of increasing the "isolation response" to a value greater than the default of 15 seconds (see link below). Of course, you could increase this to a value longer than the time it takes the "problem" switch to reboot.

http://kb.vmware.com/kb/1002080

(This will redirect you to "Setting Failure and Isolation Detection Timeout and Multiple Isolation Response Addresses"; then, see the attachement "HA_Tech_Best_Practices.pdf")

############### Under no circumstances are you to award me any points. Thanks!!!

nsabinske · ‎10-29-2007

The solution we're going with is multihoming the service console on a completely seperate switch, which tested very well this weekend. It took pulling both network cables at the same time to get HA to respond, pulling either one generated no response after minutes of waiting. This is inline with some other cluster configurations anyhow, having two heartbeat networks.

The chances of BOTH switches deciding to core at the same time would be extremely unlikely. Of course having your main switch do that is in and of itself an issue, but that's another story

Thanks all for your input!

All

VMware HA's default behavior screwed us; Suggestions, Experiences?