VMware Cloud Community
RoscoT
Enthusiast
Enthusiast

Management Network Failover Peculiarities

Hello all,

I have a couple of ESXi 5.1 Update 1 hosts that are only connected to a pair of 10G switches with two fibre cables. All management, VM, iSCSI and vMotion traffic goes through those two 10G NICs. The hosts are using a distributed switch.

The dvPort Group used for the management traffic is set to use vmnic4 as the Active NIC and vmnic5 is set as Standby. The Teaming and Failover Failback policy is set to No.

I wanted to test my redundancy so I performed the following actions:

  • I physically unplugged the fibre cable associated with vmnic4. From a test machine I can still ping the ESXi management address although it drops one ping as it transitions to vmnic5.
  • I plug back in the cable to vmnic4 and the pinging continues fine.
  • When I unplug the fibre cable associated with vmnic5 I can no longer ping the ESXi management address
  • I plug back in the cable to vmnic5 and I still cannot ping the ESXi management address
  • The only way to get the pings to the ESXi management network to start again is to connect to the DCUI and choose the Restart Management Network.

It would seem like I have redundancy but only after one failure of the Active NIC. Thereafter, despite being "repaired" by plugging in the cable it will not failback to vmnic4. When Failback is set to "No", will it never ever failback or is there some time delay after which it will.

When I change the Failback to "Yes", the following happens:

  • I physically unplugged the fibre cable associated with vmnic4. From a test machine I can still ping the ESXi management address although it drops one ping as it transitions to vmnic5.
  • I plug back in the cable to vmnic4 and the pinging immediately stops, presumably because it has failed back to vmnic4.
  • The only way to get the pings to the ESXi management network to start again is to connect to the DCUI and choose the Restart Management Network.

Any ideas?

Thanks in advance.

Rosco

5 Replies
jrmunday
Commander
Commander

Hi Rosco,

How is your physical networking configured, and what load balancing policy are you using? Are there any hints in the vmkernel.log?

One thing that comes to mind is that PortFast might not be configured in this instance - could you confirm if this has been enabled?

Cheers,

Jon

Message was edited by: Jon Munday

vExpert 2014 - 2022 | VCP6-DCV | http://www.jonmunday.net | @JonMunday77
mjha
Hot Shot
Hot Shot

when failback is set to no, traffic will not failback to NIC 4 when it returns to active duty after failure untill and unless there is failure of NIC 5. As suggested by Jon please check portfast is enabled on your physical switch servicing NIC 4 and NIC 5.

Please mark this answer as "correct" or "helpful" if you found it useful.

Alex Hunt | IT Operations Analyst | VCP-DCV

Website : https://alexhunt86.wordpress.com

Blog    : https://communities.vmware.com/blogs/vgeeks/

Please consider marking this answer "correct" or "helpful" if you think your query have been answered correctly. Manish Jha | Operations Support Engineer | vCloud Air Operations vExpert 2015-17 | vExpert-NSX | vExpert-Cloud | VCAP6-DCV | VCP6-DCV | RHCE-7 Website : http://vstellar.com
RoscoT
Enthusiast
Enthusiast

Hi Jon / Alex,

Thanks for the ideas. I did some testing today with one of our network engineers and it turned out to be a problem with the ARP cache timeout on the Huawei switches. By reducing that it failed back fine after about 10 seconds which I can live with. Seems slightly faster when connected to Cisco kit but all in all happy with the failover and failback times now.

Cheers,

Rosco


Reply
0 Kudos
MKguy
Virtuoso
Virtuoso

Still seems a bit odd.

Failover should always work immediately (or just ~1 ping) since the host sends gratuitous ARP broadcast frames on the new link for all attached vNICs. Do you have the "notify switches" option enabled on the port group?

It basically works exactly the same when you vMotion a VM from one host to another. The new host will send gratuitous ARPs on behalf of the VM to update the physical switch's CAM/MAC tables, so you should have that issue there as well if you migrate VMs between hosts.

-- http://alpacapowered.wordpress.com
Reply
0 Kudos
RoscoT
Enthusiast
Enthusiast


Thanks for that MKguy. Notify Switches is set to "Yes". I just tested a vMotion and you're right, it does drop about 6 pings so something's not right with the switches. I'll get our network engineer to have another look.

Reply
0 Kudos