VMware Cloud Community
wowitsdave
Enthusiast
Enthusiast

vCenter destroyed, trying to recreate LAG to restore connectivity and failing miserably

Hey, guys,

I'm dealing with a client right now who I hear from every year or so, so I am not in the environment all the time.

He had a bunch of drives fail in short succession which took out his vCenter. We tried to bring up the backup but it didn't work. We created a new vCenter and pulled in all the hosts.

He had a Distributed Virtual Switch with all the traffic going over it for internal and a standard switch/portgroup for WAN. Since the vCenter is gone, we now have the temporary proxy switch, and at least host management is running.

He was also using dual 10Gb fiber for everything (that's plenty) set up with a LAG. (vmnic4 and vmnic5).

When I go into the new DvS under LACP, I can create a LAG, but I only want to assign one vmnic (vmnic5) to it for now, because if it doesn't work it will take all the hosts management down (since the temporary old DvS had both of these interfaces.

When I get this LAG created on my new DvS, set the other interfaces as unused in Teaming and Failover, I cannot get VMs on that switch to connect to anything. They have a NIC, but can't ping the default gateway. I put in a cable on one host for a management network to be able to bring vCenter up, so I have a standard switch and port group up and going. I have connectivity on that Standard switch and portgroup (no VLAN).

He says the physical network gear has not been reconfigured and the fiber cables have not moved. However, when I put vmnic5 into the LAG, it has no network connectivity. I'm afraid to move vmnic4 to the lag for fear of some configuration on vmnic4 vanishing and the host management going away.

I've been referring to this article: LACP Support on a vSphere Distributed Switch.​ That's how I built my new LAG. I also used esxcli network vswitch dvs vmware lacp config get and compared the LACP config on both the new and old switch to match up the settings (Active/Passive, etc) and they now match.

Here's my LACP status on one of the hosts I've been focusing on. Notice the partner information on the new one is different/missing and the Port State is also different.

[root@BLAH:~] esxcli network vswitch dvs vmware lacp status get

DLanSwitch

   DVSwitch: DLanSwitch (the old one)

   Flags: S - Device is sending Slow LACPDUs, F - Device is sending fast LACPDUs, A - Device is in active mode, P - Device is in passive mode

   LAGID: 171963862

   Mode: Active

   Nic List:

         Local Information:

         Admin Key: 15

         Flags: SA

         Oper Key: 15

         Port Number: 2

         Port Priority: 255

         Port State: ACT,AGG,SYN,COL,DIST,

         Nic: vmnic4

        Partner Information:

         Age: 00:00:08

         Device ID: 00:05:33:68:12:23

         Flags: SA

         Oper Key: 32

         Port Number: 5

         Port Priority: 32768

         Port State: ACT,AGG,SYN,COL,DIST,

         State: Bundled

DvSwitch

   DVSwitch: DvSwitch (the new one)

   Flags: S - Device is sending Slow LACPDUs, F - Device is sending fast LACPDUs, A - Device is in active mode, P - Device is in passive mode

   LAGID: 2649772321

   Mode: Active

   Nic List:

         Local Information:

         Admin Key: 79

         Flags: SA

         Oper Key: 79

         Port Number: 1

         Port Priority: 255

         Port State: ACT,AGG,SYN,

         Nic: vmnic5

        Partner Information:

         Age: 00:00:00

         Device ID:

         Flags:

         Oper Key: 0

         Port Number: 0

         Port Priority: 0

         Port State:

         State: Independent

In summary:

vmnic4 is on the old DvS from the gone vCenter. Host management and all other traffic is on vmnic4 and it works. The phyical vmnic is seeing networks. VMs and the vmk port have connectivity.

vmnic5 is on the newly-created DvS. The phyical vmnic is not seeing networks. The test VM does not have network connectivity connectivity (can't ping gateway, can't get DHCP).

I have been fussing with this for about 10 hours. What am I missing here? What is different about vmnic4 (working in the proxy DvS) and vmnic5 (not working in my new LAG)?

Thank you!

Reply
0 Kudos
2 Replies
wowitsdave
Enthusiast
Enthusiast

I do realize that having everything on one DvS is not optimal (probably). I'm just trying to get this guy working again. I also want to create him an ephemeral port on the new DvS so he can recover vCenter more easily.

I didn't build this, I'm just trying to put him back together.

Reply
0 Kudos
ZibiM
Enthusiast
Enthusiast

Hi

LAGs are bit unfriendly things to handle in the Vmware environment.

Especially if you cannot move the esxi mgmt to the dedicated uplinks out of them.

1. You need to switch one of the uplinks on the physical switch side to normal trunk config

2. You need to assign this uplink to the standard vswitch and you need to have vmkernel used for management there

3. You need to create distributed switch with your usual amount of uplinks

4. You need to create the LAG in the dv switch (configure / edit settings / lacp)

5. You need to assign your other uplink (the one still with lag config in the physical switch) to the lag port uplink of the dv switch

6. You need to create all the needed portgrouups on the dv switch and change the uplinks policy on them to: uplink used: lag, uplinks unused: dvuplink1, dvuplink2

7. You can now migrate vmkernel used for the mgmt traffic from the standard vswitch to the distribuded vswitch portgroup

8. You can now assign the uplink from point 1 above to the distributed switch uplink lag port

9. You can now revert the changes on the physical switch done in point 1 above -> that is you can now configure LAG on the physical switch side

I know it looks convuluted and unneccessary long, but you cannot simply recover from failed LAG or establish the LAG in the Vmware in one go, if you don't have independent mgmt uplinks.

You need to:

1. establish LAG and

2. migrate mgmt traffic into it.

While doing 1st step you are making esxi host got disconnected, which triggers the dvswitch change rollback mechanism.

Reply
0 Kudos