2 Replies Latest reply on May 23, 2020 12:41 PM by ZibiM

    vCenter destroyed, trying to recreate LAG to restore connectivity and failing miserably

    wowitsdave Novice

      Hey, guys,

       

      I'm dealing with a client right now who I hear from every year or so, so I am not in the environment all the time.

       

      He had a bunch of drives fail in short succession which took out his vCenter. We tried to bring up the backup but it didn't work. We created a new vCenter and pulled in all the hosts.

       

      He had a Distributed Virtual Switch with all the traffic going over it for internal and a standard switch/portgroup for WAN. Since the vCenter is gone, we now have the temporary proxy switch, and at least host management is running.

       

      He was also using dual 10Gb fiber for everything (that's plenty) set up with a LAG. (vmnic4 and vmnic5).

       

      When I go into the new DvS under LACP, I can create a LAG, but I only want to assign one vmnic (vmnic5) to it for now, because if it doesn't work it will take all the hosts management down (since the temporary old DvS had both of these interfaces.

       

      When I get this LAG created on my new DvS, set the other interfaces as unused in Teaming and Failover, I cannot get VMs on that switch to connect to anything. They have a NIC, but can't ping the default gateway. I put in a cable on one host for a management network to be able to bring vCenter up, so I have a standard switch and port group up and going. I have connectivity on that Standard switch and portgroup (no VLAN).

       

      He says the physical network gear has not been reconfigured and the fiber cables have not moved. However, when I put vmnic5 into the LAG, it has no network connectivity. I'm afraid to move vmnic4 to the lag for fear of some configuration on vmnic4 vanishing and the host management going away.

       

      I've been referring to this article: LACP Support on a vSphere Distributed Switch. That's how I built my new LAG. I also used esxcli network vswitch dvs vmware lacp config get and compared the LACP config on both the new and old switch to match up the settings (Active/Passive, etc) and they now match.

       

      Here's my LACP status on one of the hosts I've been focusing on. Notice the partner information on the new one is different/missing and the Port State is also different.

       

      [root@BLAH:~] esxcli network vswitch dvs vmware lacp status get

      DLanSwitch

         DVSwitch: DLanSwitch (the old one)

         Flags: S - Device is sending Slow LACPDUs, F - Device is sending fast LACPDUs, A - Device is in active mode, P - Device is in passive mode

         LAGID: 171963862

         Mode: Active

         Nic List:

               Local Information:

               Admin Key: 15

               Flags: SA

               Oper Key: 15

               Port Number: 2

               Port Priority: 255

               Port State: ACT,AGG,SYN,COL,DIST,

               Nic: vmnic4

              Partner Information:

               Age: 00:00:08

               Device ID: 00:05:33:68:12:23

               Flags: SA

               Oper Key: 32

               Port Number: 5

               Port Priority: 32768

               Port State: ACT,AGG,SYN,COL,DIST,

               State: Bundled

       

       

      DvSwitch

         DVSwitch: DvSwitch (the new one)

         Flags: S - Device is sending Slow LACPDUs, F - Device is sending fast LACPDUs, A - Device is in active mode, P - Device is in passive mode

         LAGID: 2649772321

         Mode: Active

         Nic List:

               Local Information:

               Admin Key: 79

               Flags: SA

               Oper Key: 79

               Port Number: 1

               Port Priority: 255

               Port State: ACT,AGG,SYN,

               Nic: vmnic5

              Partner Information:

               Age: 00:00:00

               Device ID:

               Flags:

               Oper Key: 0

               Port Number: 0

               Port Priority: 0

               Port State:

               State: Independent

       

      In summary:

       

      vmnic4 is on the old DvS from the gone vCenter. Host management and all other traffic is on vmnic4 and it works. The phyical vmnic is seeing networks. VMs and the vmk port have connectivity.

      vmnic5 is on the newly-created DvS. The phyical vmnic is not seeing networks. The test VM does not have network connectivity connectivity (can't ping gateway, can't get DHCP).

       

      I have been fussing with this for about 10 hours. What am I missing here? What is different about vmnic4 (working in the proxy DvS) and vmnic5 (not working in my new LAG)?

       

      Thank you!

        • 1. Re: vCenter destroyed, trying to recreate LAG to restore connectivity and failing miserably
          wowitsdave Novice

          I do realize that having everything on one DvS is not optimal (probably). I'm just trying to get this guy working again. I also want to create him an ephemeral port on the new DvS so he can recover vCenter more easily.

           

          I didn't build this, I'm just trying to put him back together.

          • 2. Re: vCenter destroyed, trying to recreate LAG to restore connectivity and failing miserably
            ZibiM Enthusiast

            Hi

             

            LAGs are bit unfriendly things to handle in the Vmware environment.

            Especially if you cannot move the esxi mgmt to the dedicated uplinks out of them.

            1. You need to switch one of the uplinks on the physical switch side to normal trunk config

            2. You need to assign this uplink to the standard vswitch and you need to have vmkernel used for management there

            3. You need to create distributed switch with your usual amount of uplinks

            4. You need to create the LAG in the dv switch (configure / edit settings / lacp)

            5. You need to assign your other uplink (the one still with lag config in the physical switch) to the lag port uplink of the dv switch

            6. You need to create all the needed portgrouups on the dv switch and change the uplinks policy on them to: uplink used: lag, uplinks unused: dvuplink1, dvuplink2

            7. You can now migrate vmkernel used for the mgmt traffic from the standard vswitch to the distribuded vswitch portgroup

            8. You can now assign the uplink from point 1 above to the distributed switch uplink lag port

            9. You can now revert the changes on the physical switch done in point 1 above -> that is you can now configure LAG on the physical switch side

             

            I know it looks convuluted and unneccessary long, but you cannot simply recover from failed LAG or establish the LAG in the Vmware in one go, if you don't have independent mgmt uplinks.

            You need to:

            1. establish LAG and

            2. migrate mgmt traffic into it.

            While doing 1st step you are making esxi host got disconnected, which triggers the dvswitch change rollback mechanism.