I am seeing a weird issue in my environment that I can't quite figure out how to resolve. I have 3 hosts configured originally using a LAG (LACP) for VXLAN and the VTEP interface (1 per host). Each host only has two vmnic's (vmnic0 and vmnic1). I have now removed the LAG. I was able to go in and change the existing vxw-dvs port groups to match the new failover order of Uplink 1 and Uplink 2 in the active adapters on the teaming and failover settings page.
The issue arises when I change the vxw-vmknicPG port group, which I believe is the one containing the vmk VTEPs. If I change that to have both adapters listed under active, even with the failover specified as the policy, ONE of the vDS port groups stops passing traffic, but only one. The rest continue on happily. As soon as I drop the second uplink back into standby or unused, traffic flows again. So I tried reversing the uplinks, and traffic still flows. So it doesn't seem to care which, only that there is only one.
It also doesn't matter what I have on the affected port group itself, that has both uplinks under active.
Just seems strange that I would have only a single port group affected by this.
Is there any other steps I should be taking to convert this from LACP to Failover mode to be able to fully remove the LAG? I did follow the blog post about using the API to specify the failover order and policy to change it from what it was on the Manager, which I have done and tested by creating a brand new port group and it does pull in the correct failover policy now.
Just trying to figure out this single port group and what is up...
Any thoughts/ideas would be greatly appreciated!
Is the issue specific to VTEP interfaces ? If possible, try creating a Portgroup directly on vSphere and attach these NIC's and test the teaming policy .
Sorry, just want to clarify.
So create a new port group on a new standard switch and attach the vmnics to that and remove them from the dvswitch uplinks and test vm connectivity using that?
Yes , let me know the result of that try.
Ok, moved everything on one of my hosts to standard vSwitch and set the port group for a VM to be explicit failover and put vmnic0 and vmnic1 in the active section. The vm kept connectivity the whole time, and no matter what order or settings I used in the failover order section (as expected).
What is odd to me is that it is really seemingly just this one VXLAN port group on the dvSwitch that is affected when I change the teaming policy on the VTEP port group on the dvSwitch. I can't explain why it is only this one group that struggles with it. Seems like they all would or none would.
I had a support ticket a while ago with NSX 6.2.1 I believe it was (biggggg bug relating the bridges generated by NSX), and I learned from the support guy that it is not supported to change the configurations of the "vxw-dvs Port Groups" as you mention, and in fact even though that part is just "unsupported" but may "work" what it really is a big NO NO is to change anything in the vxw-vmknicPG since even though the GUI let´s you change things (damn GUI guys...), NSX at a low level keeps some kind of DB on those configuration and if you mess with that then things are just going to break or have strange behaviors like the one you are experiencing.
If you want to change things like this with your Cluster you should from NSX :
- Remove the Cluster from the Transport Zones to which it was added (you may have more than one)
- Unconfigure the Cluster
- Go and make the changes that you would like to make to the vDS and all of the Distributed Port Groups
- Configure the Cluster once again and chose Failover in the VMKNic Teaming Policy part of the GUI when configuring the Cluster.
I don´t have any link with documentation for you since I learn this by hand and with this great support guy from VMware, this is cumbersome to say the least...but I have done it several times you kind of see the cluster go back to the way it was before NSX and then configure NSX once again on it.
I´m actually playing with this settings since I want to go from failover now to some kind of configuration that let´s me use more than one uplink on my Management / Edge Cluster (Management an Edge Clusters are not supposed to be use with LACP according to documentation), I already have LACP working with my Compute Cluster and NSX.
Well, good luck.