Re: nsx-t temporary network drop after vmotion

Erik_Horn · ‎05-27-2022

I'm running into an issue where VMs drop off of the network for short amount of time after a vmotion completes. I've seen up to 7 seconds during the handful of vmotions that I tested with. I understand that a short amount of drop is normal and with nsx-v, 1 dropped ping was typical, but 7 seconds is long enough to cause failures, which is what triggered my investigation.

I reviewed the nsx manager logs, and they seemed to process the changes in a fraction of a second, and I didn't see anything concerning, not that I'd really know what should be concerning.

nsx-t: 3.1.3.5

vsphere: 7.0.3

single cluster, 10 nodes.

Any help would be appreciated.

Thanks,

Erik

Sreec · ‎05-29-2022

Are we losing L2 or L3 connectivity?

Is the issue specific to VMs connected to NSX- segments?

How about VM which is connected to the native DVS port group?

Can you explain a bit more about upstream connectivity - Edge to Router and Host to Switches?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

Erik_Horn · ‎05-31-2022

The issue does not affect a vm that is attached directly to the VDS. It's only an issue with nsx segments.

Edge to Router connectivity: 2 failure domains each containing 2 edge vms and 2 routers, dynamic routing with bgp

Host to Switches: 2x10Gb connections per host, lacp

L2 or L3 dropping? After further testing, I believe the answer is that both L2 and L3 are working, but DFW may be blocking some of the packets during the outage.

To test L2/L3 connectivity I had wireshark logging all of the network traffic on the machine being vmotion'd. I was pinging from another vm on the same nsx segment. The packet trace showed about 16 seconds of missing ping packets, but in the middle of the ping outage are successful arp, dns, and ldap requests. The successful arp requests are from the gateway and the vm I'm pinging from.

The firewall rules for the dns and ldp requests allow requests from large IP ranges to security groups and are "applied to" DFW.

The firewall rules for the pings allow requests from any to security groups, and are "applied to" security groups.

The default firewall rule is set to drop.

Here is a picture of the packet trace:

vcenter reported that the vmotion completed during the same second that packets 1753-1754 were received.

10.7.6.76 is the system capturing the traffic and being vmotion'd

10.7.6.77 is the system sending the test ping packets. It's on the same network segment as the capturing system

10.7.6.73 is the tier-1 gateway for the nsx segment

10.7.2.40 and 10.7.2.18 are servers outside of this virtual infrastructure.

Thanks,

Erik

Sreec · ‎06-03-2022

Thanks for the detailed reply. Have you tried excluding the VM from firewall protection and testing it once? If the issue persists only when the rule is applied, it will require a deep dive check to know if we have any misconfigured rules or nested rules which is causing delay/drop. If you are sure rules are configured correctly I would recommend opening an SR. BTW the LACP configuration is from DVS Uplinks to Upstream switches?

Cheers,
Sree | VCIX-5X| VCAP-5X| VExpert 7x|Cisco Certified Specialist
Please KUDO helpful posts and mark the thread as solved if answered

All

nsx-t temporary network drop after vmotion