VMware Cloud Community
kurtwest
Contributor
Contributor

vSwitch Failback Question

Does setting the Failback option to No work when you are using Active Active for the vSwitch uplinks? Because it doesn't appear to be working for me. I had to change out a switch in my blade chassis today and when I pulled the switch out everything failed to the other uplink fine. However, when I put the new switch in I started getting several VM that were not accessible. I believe they were failing back to their original uplink. The VMs started working once I got all the cables plugged in to the chassis switch and configured the switch it.

I use blades for my ESX hosts and the ESX hosts detect the link from the internal chassis switch before the port channel going to my chassis switch is done completing the Spanning Tree Protocol process. This is why I don't want to use failback.

Any known issues with this or work arounds. It caused me some major headaches today.

Reply
0 Kudos
13 Replies
depping
Leadership
Leadership

In 3.0 setting failback to NO means it will failback.

Duncan

My virtualisation blog:

If you find this information useful, please award points for "correct" or "helpful".

Reply
0 Kudos
kurtwest
Contributor
Contributor

I think you are mistaken. Here is a section from the ESX Configuration Guide pg. 41

Failback - Select Yes or No to disable or enable failback.

This option determines how a physical adapter is returned to active duty after recovering from a failure. If failback is set to Yes (default), the adapter is returned to active duty immediately upon recovery, displacing the standby adapter that took over its slot, if any. If failback is set to No, a failed adapter is left inactive even after recovery until another currently active adapter fails,requiring its replacement.

Am I misunderstanding something?

Reply
0 Kudos
depping
Leadership
Leadership

than you probably posted this in the wrong section of the forums. which version are you running, 3.0.x or 3.5.x ? in 3.5.x the terminology changed from rolling failover(which I assumed you were talking about because of the 3.0 forum you posted this in) to failback.

you are right. when failback is set to yes the switch to the active adapter should occur. but do you have an "active / standby" setup or do you do load balancing on virtual port id?

Duncan

My virtualisation blog:

If you find this information useful, please award points for "correct" or "helpful".

Reply
0 Kudos
kurtwest
Contributor
Contributor

I have an active/active setup and I am using load balace based on virtual port id. I think I know what went wrong. I was loking at all my vSwtiches in the cluster and sure enough there was on that I forgot to set failback to no. I am going to confirm that the VMs that had the issue are on the host that didn't have that setting set correctly.

Kurt

Reply
0 Kudos
admin
Immortal
Immortal

The Failback option only applies if you are using Active/Standby adapters - it is not taken into consideration if you have no Standby adapters.

In an Active/Active setup, the VM may or may not change back to a vmnic if it fails then comes back. It depends on the calculation done by the load balancing policy, which is seperate to the "Failback" option.

kurtwest
Contributor
Contributor

I was afraid that might be the case. Do you know where I can find that answer on some official VMWare doc?

Reply
0 Kudos
depping
Leadership
Leadership

if he's using virtual port id, than there's no real calculation being done. the first vm uses the first nic, the second vm the second nic, the third the first etc.

the above link states that in case of a failover the standby nic takes over and will return to the "active" link when it returns. when you use "virtual port id" all nics will be active so there's no reason to failback. And i've read it in a VMworld 2008 ppt somewhere that there needs to be atleast 1 standby nic for this setting.

Duncan

My virtualisation blog:

If you find this information useful, please award points for "correct" or "helpful".

Reply
0 Kudos
kurtwest
Contributor
Contributor

Good grief this is giving me a major case of "tired head". The virtual port ID stuff has to do with Load Balancing, not failback...correct?

Thanks for all the help.

Reply
0 Kudos
depping
Leadership
Leadership

okay, let's try again:

you;ve got vswitches with two nics load balancing based on virtual port id. so these two nics are active. for the setting "failback" to work you would need to have atleast 1standby nic. if you don't have a standby nic all the vm's will keep running on the nic they were switched over to when the other nic died. the new nic will only start serving vm's when a vm is rebooted, started or vmotioned to the server.

Duncan

My virtualisation blog:

If you find this information useful, please award points for "correct" or "helpful".

kurtwest
Contributor
Contributor

Okay very good. I understand now. The only question left is why did some of my VMs did try to fail back when the original link came back up.

I am going to test this in the lab and I will post my results.

Reply
0 Kudos
admin
Immortal
Immortal

In an active/active setup using the load balancing policy as virtual port id, some VM's will failback.

As depping has said, when both vmnics are up, VM1 will use vmnic1, VM2 will use vmnic2, VM3 will use vmnic1 etc.

In the case where vmnic1 goes down, all VM's will move to vmnic2.

When vmnic1 comes back, VM1 and VM3 should be moved back to vmnic1.

What you experienced is quite common as ESX will "active" a vmnic once it detects the carrier on the link, regardless if the physical switch or network is operating correctly on that path. For instance, even though a physical port is active, spanning tree protocol will block that port for a number of seconds when it goes from down to up.

Some physical network configurations to help mitigate these failback problems include configuring portfast on the physical switch ports connected to ESX, and to enable Link State Tracking protocol.

It's possible to configure individual Active/standby and failback options per portgroup so if the failback issue is a problem for some critical VM's, you can place them in another portgroup using active/standby and failback set to No.

It's also possible to change the ESX failure detection from signal to beaconing. Beaconing sends out a layer 2 broadcast packet in the view that the other vmnics should receive it. If one doesn't, then it disables that vmnic.

Beaconing does not work well though when using 2 vmnics as its impossible to tell which vmnic has the issue.

i.e. vmnic1 sends out the packet but it never gets received by vmnic2. It's impossible to tell where the packet failed. i.e was there an issue with vmnic1's network or vmnic2's? Either, or both, result in the packet being dropped. Therefore ESX doesn't know which vmnic has the problem and so will send every packet through both vmnics simultaneously and this can lead to further problems as you can imagine.

Message was edited by: appk

Reply
0 Kudos
Alextan75
Contributor
Contributor

<span class="816482117-15122008">Hi to all.


<span class="816482117-15122008">We are testing our C-Class enclosure system with ESX3.5 U3.


<span class="816482117-15122008">We have 6 HPblades BL 465c, 4 Cisco 3020 (IOS 12.2.(46)SE) each connected to 2 3750 in stack.


<span class="816482117-15122008">The configuration is set with 2 teams 2 nics each, the first one dedicated to SC+Vmotion, the other one for Virtual Machines.



<span class="816482117-15122008">We set Active-Active NIC configuration (vmnic0+vmnic2) and (vmnic1+vmnic3. Each internal 3020 switch port (1-16) have been configured in trunk mode with spanning-tree PORTFAST enabled (as suggested in this post).



<span class="816482117-15122008">We are trying to simulate a failover and fail-back of a NIC on a blade, disabling and enabling a swich port back (attached to a nic of a blade host).


<span class="816482117-15122008">During the failover session, we noticed the a one-second downtime, time that allows sessions to remain valid (file copy on vm, 1 ICMP reply lost).



<span class="816482117-15122008">The Issue regards the failback behaviour. In this case, when the port is administratively put back online, we lose the VM for at least 30 seconds. In this case every kind of session is lost.


<span class="816482117-15122008">In attach you can find 2 capture file connected to a SPAN port that mirrors the traffic in this 2 different situations (UP-DOWN = failover / DOWN-UP = failback).


<span class="816482117-15122008">The filter to apply is eth.addr == (00:50:56:8c:61:10). This is the MAC of the Test VM.



<span class="816482117-15122008">In attach you also have the network diagram and vmnic a vswitch configurations.



Thank you in advance

Reply
0 Kudos
tanino
Contributor
Contributor

Nobody? Any info on this topic? Should I only disable the fallback? (it seems more a workaround than a solution...)

Thank you all

Reply
0 Kudos