HA Isolation Response triggered?

vmproteau · ‎03-03-2008

VMWare ESX 3.0.2
VMWare Virtual Center 2.0.2

I have a 2-Host cluster. Each Host has 6 interfaces:

vSwitch1: Service Console (2-NICs each on seperate physical switch)
vSwitch2: VMkernel (2-NICs each on seperate physical switch)
vSwitch3: VM Network(2-NICs each on seperate physical switch)

While testing, I disconnected each Serive Console interface individually and verified various combinations and all failover properly. I can't test a switch reboot so, to simulate, I simultaneously pulled a Service Console interface cable from both Hosts. Everything worked fine and I continued to access both Hosts from Virtual Center. The problem seemed to be when I plugged them back in. When I got back to my desk, Each Host was reconfiguring HA and the VM had been powered off...I presume this was the result of HA Isolation Response. Of course, it worries me if a switch reboot might cause this behavior. Is the simulation I did valid? If not, why not.

ctfoster · ‎03-03-2008

Since you have two service consoles each on a separate switches I guess the chances of them both rebooting at the same time limits the risk.

However it is also possible to set the following parameters,

- das.isolationaddress2: An additional isolation response address can be specified. Use the IP address of the switch use by the second NIC of the virtual switch.

- das.failuredetectiontime: Controls the default timeout used for failure & isolation detection (default 15 seconds). You could set this to 60 seconds to allow for switch failover/spanning tree/portfast.

vmproteau · ‎03-03-2008

I thought it would howerver; my simulation was simulating one switch reboot. I pulled cables from each Host that plug into the same switch. That worked fine but, plugging back in triggered the Isolation Response. Since each Host had a Service Console connection that never went down, I wouldn't expect HA or Isolation Response to be triggered.

I do have "Rolling Failover" set to No, so plugging the cables back in would have made the Service Console switch back to these NICs. Could that be what causes the issue? I verified that setting it to Yes (so the Service Console doesn't swap back) doesn't trigger an Isolation Response. It doesn't seem that it should be necessary to do this though?

Thanks for those parameters I did find that while I was researching the VMWare KB:

I guess I'm looking for a technical explanation of why the simulation I did caused HA to powerdown VMs since I always had one Service Console NIC up at all times on each Host.

vmproteau · ‎03-03-2008

I think I understand what happened. We don't have "portfast" configured on our Service Console switch ports. So, when I unplugged the cables, everything failed over as expected. When I plugged it back in, the "Rolling Failover" setting told the vSwitch to fail back to the original NICs. Because Spanning Tree hadn't done it's thing, in 15 seconds or so, HA was triggeed.

So the solutions are any of the following:

Set portfast on the Service Console ports: This should make that port immediately available when it fails back.
Set "Rolling Failover" on the Service Console vSwitch to Yes. This will keep it from failing back.
Increase the default timeout for failure and isolation response. That should give the port enough for spanning tree to complete.
Add addtional isolation response addresses.

Thanks ctfoster

rossb2b · ‎03-03-2008

I ran into the same issue when testing my environment with one exception. I use trunking and there is also a portfast trunk command that I had never heard of. If you are using trunk ports port fast alone will not fix the issue.

All

HA Isolation Response triggered?