andreir
Enthusiast
Enthusiast

NSX-T 2.5.1 - no geneve tunnels

Jump to solution

Hi all,

hitting a really puzzling issue: configured the latest NSX-T on Cisco UCS, created T0, overlay TZ, created a segment and added two VMs. VMs are able to ping the gateway on T0, can ping each other if on the same host, but cannot ping each other if on different hosts. Upon closer inspection, it appears that no tunnels are formed between the ESXi nodes.

I'm able to ping between TEPs with large MTU, so no networking issues as far as I can see, but the tunnels are not formed... BFD shows tunnels are down (please see the output on the bottom).

Not seeing any related error messages in /var/log/vmkernel or /var/log/nsx-syslog.log on the hosts.

Anything else I can check? Would be happy to provide any other output. Please help!!!

Tested VXLAN connectivity and it looks good:

[root@NSX02:~] ping ++netstack=vxlan 10.12.0.151 -s 1600 -d

PING 10.12.0.151 (10.12.0.151): 1600 data bytes

1608 bytes from 10.12.0.151: icmp_seq=0 ttl=64 time=0.271 ms

Checked the logical switches on the hosts and they look good (the switch I'm using is called "test"):

NSX-Manager> get logical-switch

VNI     UUID                                  Name                                              Type

71688   5cce3073-c5c9-4cf6-9cad-8db50dd06b68  OV-WEB                                            DEFAULT

71689   8208b2cd-7d0c-407e-aacf-ee9297ef5cf2  OV-DB                                             DEFAULT

71691   fedd3ec3-d3e4-4d02-ac4f-cd94bde02fdf  transit-bp-2a5f80db-676d-41f4-b305-1e8591266f94   TRANSIT

71692   c9b96c71-ebff-4572-88a9-7639d2923743  transit-bp-8871e348-42da-447f-9193-70781b09730f   TRANSIT

71690   50db354a-bf9c-483f-9637-c397e78d05b7  transit-rl-8871e348-42da-447f-9193-70781b09730f   TRANSIT

71681   97655bd6-dd20-4746-8138-656a0c06e9b0  test                                              DEFAULT

71687   6fa865f8-4bb6-439a-a428-a94e27e02090  OV-APP                                            DEFAULT

[root@NSX02:~] nsxcli -c get logical-switch 71681 vtep-table

                                   Logical Switch VTEP Table

-----------------------------------------------------------------------------------------------

                                       Host Kernel Entry

===============================================================================================

Label      VTEP IP           Segment ID     Is MTEP       VTEP MAC       BFD count

124941    10.12.0.151        10.12.0.128     False  00:50:56:67:31:cb   0

                                       LCP Remote Entry

===============================================================================================

Label      VTEP IP           Segment ID          VTEP MAC                  DEVICE NAME

124941    10.12.0.151        10.12.0.128     00:50:56:67:31:cb                 None

                                        LCP Local Entry

===============================================================================================

Label      VTEP IP           Segment ID          VTEP MAC                  DEVICE NAME

124942    10.12.0.152        10.12.0.128     00:50:56:63:b0:56                 None

[root@NSX03:~] nsxcli -c get logical-switch 71681 vtep-table

                                   Logical Switch VTEP Table

-----------------------------------------------------------------------------------------------

                                       Host Kernel Entry

===============================================================================================

Label      VTEP IP           Segment ID     Is MTEP       VTEP MAC       BFD count

124942    10.12.0.152        10.12.0.128     False  00:50:56:63:b0:56   0

                                       LCP Remote Entry

===============================================================================================

Label      VTEP IP           Segment ID          VTEP MAC                  DEVICE NAME

124942    10.12.0.152        10.12.0.128     00:50:56:63:b0:56                 None

                                        LCP Local Entry

===============================================================================================

Label      VTEP IP           Segment ID          VTEP MAC                  DEVICE NAME

124941    10.12.0.151        10.12.0.128     00:50:56:67:31:cb                 None

Checked BFD sessions, tunnels down, no diagnostic....

[root@NSX03:/var/log]  net-vdl2 -M bfd -s nvds

BFD count:      3

===========================

Local IP: 10.12.0.151, Remote IP: 10.12.0.153, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 1, l3SpanCount: 1

Roundtrip Latency: NOT READY

VNI List: 71687

Routing Domain List: 8871e348-42da-447f-9193-70781b09730f

Local IP: 10.12.0.151, Remote IP: 10.12.0.200, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 3, l3SpanCount: 2

Roundtrip Latency: NOT READY

VNI List: 71690 71691   71692

Routing Domain List: 2a5f80db-676d-41f4-b305-1e8591266f94       8871e348-42da-447f-9193-70781b09730f

Local IP: 10.12.0.151, Remote IP: 10.12.0.152, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 2, l3SpanCount: 2

Roundtrip Latency: NOT READY

VNI List: 71681 71688

Routing Domain List: 2a5f80db-676d-41f4-b305-1e8591266f94       8871e348-42da-447f-9193-70781b09730f

Tags (2)
1 Solution

Accepted Solutions
mbangouraAnect
Contributor
Contributor

The problem is that NSX-T TN offloads IP checksum calculations by default to HW (UCS VIC - M81KR CNA firmware). Unfortunately, CNA from some reason can't calculate correct outer IP checksum for Geneve encapsulated packets. So incoming Geneve packets from TZ A to TZ B are received on the uplink interface of TZ B but with back IP checksum (outer, inner Geneve IP checksum is OK), therefore they are discarded by the system.

One can verify this by capturing incoming packets on TN via nsxcli: start capture interface _uplink1_ direction input file xyz.pcap. Upon transferring the xyz.pcap file from /tmp/ (via winscp or other utility) and loading the xyz.pcap to Wireshark, outer geneve packet IP checksums will be incorrect (turn on Protocol prefs: Validate the IPv4 checksums...).

There is almost none to zero chance that Cisco will fix that for old M81KR CNA, therefore this must be tweaked on ESXi side...

Workaround: turn off IP checksum HW offloading for all NSX-T vmnics on all TNs using Cisco VICs (in this case vmnicX-Y):

esxcli network nic software set --ipv4cso=1 -n vmnicX

esxcli network nic software set --ipv4cso=1 -n vmnicY

Parameter --ipv4cso=1 means IP checksum is done in SW, --ipv4cso=0 that IP checksum is HW offloaded.

Settings are reboot persistent.

To verify that IP checksum calculations are done in SW (vmkernel) run:

esxcli network nic software list

IPv4 CSO = on means IP checksum in SW.

Upon activating IP checksum in SW for NSX-T vmnics Geneve uplinks should go UP instantly (to verify run "nsxdp-cli bfd sessions list").

PS: It seems if you are testing Nested ESXi deployement which uses vmxnet3 with enabled DirectPath I/O same workaround must be applied to virtual vmxnet3 vmnics if they are bound with Cisco VICs (vmxnet3 offloads IP checksum calculations to VIC?).

Regarding performance concerns with SW IP checksum calculation: VM to VM throughput is similar (VMs residing on different B200 M1 blades):

- 9.67 Gbits/sec with DSwitch vs. 9.13 Gbits/sec with NSX-T SDN.

- NSX-T DR L3 routing: 8.04 Gbits/sec.

With this workaround were have successfully tested both NSX-T 2.5 and 3.0 using:

- Cisco B200 M1 blades with M81KR CNA/VIC in 5108 blade chassis

- FI 6100 with UCSM 2.2(8i)

- ESXi 6.5u3

(edge nodes must be on different cluster - newer servers due to AS-NI CPU requirement)

IMHO newer VIC cards like VIC1200 / VIC1300 have/had similar problems with Geneve packets, because previously we were unable to run NSX-T 2.4 on C240-M4 using VIC1300 (geneve tunnels down).

Lastly, I can confirm that NIC HW offloading of Geneve encapsulation is not a requirement for NSX-T 3.0.

View solution in original post

9 Replies
daphnissov
Immortal
Immortal

From the NSX-T Manager UI, do a Traceflow using the ICMP protocol from VM1 to VM2 when they are across hosts. What is the result? Post the screenshot.

0 Kudos
andreir
Enthusiast
Enthusiast

pastedImage_0.png

Zooming in:

pastedImage_1.png

pastedImage_2.png

0 Kudos
daphnissov
Immortal
Immortal

Has some event happened to these ESXi hosts after they were initially prepared with the NSX-T bits? Asked more directly, has this *ever* worked or no? Are these nested ESXi hosts? Have you tried to reboot each of them?

0 Kudos
andreir
Enthusiast
Enthusiast

These are physical UCS blades, fresh install as far as I know so nothing should've happened on the hosts. The NSX-T never worked correctly after it was set up. Will try to reboot.

0 Kudos
andreir
Enthusiast
Enthusiast

Reboot did not help, unfortunately.

I also did a packet capture while doing traceflow, and it looks like ICMP packet is received by the destination host:

on source (nsx02 host):

[root@NSX02:~] nsxcli -c start capture interface vmnic5 direction output expression dstip 10.12.0.151

01:30:24.292172 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:25.292190 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:25.606948 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 186: 10.12.0.152.49168 > 10.12.0.151.6081: Geneve, Flags [C], vni 0x11801, proto TEB (0x6558), options [8 bytes]: 00:50:56:b7:85:57 > 00:50:56:b7:c8:2e, ethertype IPv4 (0x0800), length 128: 10.12.67.172 > 10.12.67.10: ICMP echo request, id 0, seq 0, length 94

01:30:26.192260 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:27.092236 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:5

on destination (esx03 host):

[root@NSX03:~] nsxcli -c start capture interface vmnic5 direction input expression srcip 10.12.0.152

01:30:24.278450 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:25.278480 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:25.593242 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 186: 10.12.0.152.49168 > 10.12.0.151.6081: Geneve, Flags [C], vni 0x11801, proto TEB (0x6558), options [8 bytes]: 00:50:56:b7:85:57 > 00:50:56:b7:c8:2e, ethertype IPv4 (0x0800), length 128: 10.12.67.172 > 10.12.67.10: ICMP echo request, id 0, seq 0, length 94

01:30:26.178576 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

01:30:27.078539 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24

But the traceflow does not show that it was delivered on the other side?

pastedImage_6.png

I'll be happy to provide full packet captures if that might help...

0 Kudos
mbangouraAnect
Contributor
Contributor

Any luck solving this? I think that we have the same problem Smiley Happy

Seems like Cisco VIC has a problem with decapsulating geneve packets upon arrival...

Any hint is appreciated.

0 Kudos
andreir
Enthusiast
Enthusiast

Unfortunately we were not successful in resolving this issue. The original problem was observed on Cisco UCS B200 M2 blades running 2.2(8i) firmware and M81KR CNA with 2.2(3b). Since those are no longer supported on ESXi 6.7, we gave up. Best guess is that the VIC driver somehow mangles the encapsulated packet.

0 Kudos
mbangouraAnect
Contributor
Contributor

The problem is that NSX-T TN offloads IP checksum calculations by default to HW (UCS VIC - M81KR CNA firmware). Unfortunately, CNA from some reason can't calculate correct outer IP checksum for Geneve encapsulated packets. So incoming Geneve packets from TZ A to TZ B are received on the uplink interface of TZ B but with back IP checksum (outer, inner Geneve IP checksum is OK), therefore they are discarded by the system.

One can verify this by capturing incoming packets on TN via nsxcli: start capture interface _uplink1_ direction input file xyz.pcap. Upon transferring the xyz.pcap file from /tmp/ (via winscp or other utility) and loading the xyz.pcap to Wireshark, outer geneve packet IP checksums will be incorrect (turn on Protocol prefs: Validate the IPv4 checksums...).

There is almost none to zero chance that Cisco will fix that for old M81KR CNA, therefore this must be tweaked on ESXi side...

Workaround: turn off IP checksum HW offloading for all NSX-T vmnics on all TNs using Cisco VICs (in this case vmnicX-Y):

esxcli network nic software set --ipv4cso=1 -n vmnicX

esxcli network nic software set --ipv4cso=1 -n vmnicY

Parameter --ipv4cso=1 means IP checksum is done in SW, --ipv4cso=0 that IP checksum is HW offloaded.

Settings are reboot persistent.

To verify that IP checksum calculations are done in SW (vmkernel) run:

esxcli network nic software list

IPv4 CSO = on means IP checksum in SW.

Upon activating IP checksum in SW for NSX-T vmnics Geneve uplinks should go UP instantly (to verify run "nsxdp-cli bfd sessions list").

PS: It seems if you are testing Nested ESXi deployement which uses vmxnet3 with enabled DirectPath I/O same workaround must be applied to virtual vmxnet3 vmnics if they are bound with Cisco VICs (vmxnet3 offloads IP checksum calculations to VIC?).

Regarding performance concerns with SW IP checksum calculation: VM to VM throughput is similar (VMs residing on different B200 M1 blades):

- 9.67 Gbits/sec with DSwitch vs. 9.13 Gbits/sec with NSX-T SDN.

- NSX-T DR L3 routing: 8.04 Gbits/sec.

With this workaround were have successfully tested both NSX-T 2.5 and 3.0 using:

- Cisco B200 M1 blades with M81KR CNA/VIC in 5108 blade chassis

- FI 6100 with UCSM 2.2(8i)

- ESXi 6.5u3

(edge nodes must be on different cluster - newer servers due to AS-NI CPU requirement)

IMHO newer VIC cards like VIC1200 / VIC1300 have/had similar problems with Geneve packets, because previously we were unable to run NSX-T 2.4 on C240-M4 using VIC1300 (geneve tunnels down).

Lastly, I can confirm that NIC HW offloading of Geneve encapsulation is not a requirement for NSX-T 3.0.

andreir
Enthusiast
Enthusiast

This is great, thank you very much! I can confirm that disabling the checksum offload fixes the issue.

0 Kudos