Hi all,
hitting a really puzzling issue: configured the latest NSX-T on Cisco UCS, created T0, overlay TZ, created a segment and added two VMs. VMs are able to ping the gateway on T0, can ping each other if on the same host, but cannot ping each other if on different hosts. Upon closer inspection, it appears that no tunnels are formed between the ESXi nodes.
I'm able to ping between TEPs with large MTU, so no networking issues as far as I can see, but the tunnels are not formed... BFD shows tunnels are down (please see the output on the bottom).
Not seeing any related error messages in /var/log/vmkernel or /var/log/nsx-syslog.log on the hosts.
Anything else I can check? Would be happy to provide any other output. Please help!!!
Tested VXLAN connectivity and it looks good:
[root@NSX02:~] ping ++netstack=vxlan 10.12.0.151 -s 1600 -d
PING 10.12.0.151 (10.12.0.151): 1600 data bytes
1608 bytes from 10.12.0.151: icmp_seq=0 ttl=64 time=0.271 ms
Checked the logical switches on the hosts and they look good (the switch I'm using is called "test"):
NSX-Manager> get logical-switch
VNI UUID Name Type
71688 5cce3073-c5c9-4cf6-9cad-8db50dd06b68 OV-WEB DEFAULT
71689 8208b2cd-7d0c-407e-aacf-ee9297ef5cf2 OV-DB DEFAULT
71691 fedd3ec3-d3e4-4d02-ac4f-cd94bde02fdf transit-bp-2a5f80db-676d-41f4-b305-1e8591266f94 TRANSIT
71692 c9b96c71-ebff-4572-88a9-7639d2923743 transit-bp-8871e348-42da-447f-9193-70781b09730f TRANSIT
71690 50db354a-bf9c-483f-9637-c397e78d05b7 transit-rl-8871e348-42da-447f-9193-70781b09730f TRANSIT
71681 97655bd6-dd20-4746-8138-656a0c06e9b0 test DEFAULT
71687 6fa865f8-4bb6-439a-a428-a94e27e02090 OV-APP DEFAULT
[root@NSX02:~] nsxcli -c get logical-switch 71681 vtep-table
Logical Switch VTEP Table
-----------------------------------------------------------------------------------------------
Host Kernel Entry
===============================================================================================
Label VTEP IP Segment ID Is MTEP VTEP MAC BFD count
124941 10.12.0.151 10.12.0.128 False 00:50:56:67:31:cb 0
LCP Remote Entry
===============================================================================================
Label VTEP IP Segment ID VTEP MAC DEVICE NAME
124941 10.12.0.151 10.12.0.128 00:50:56:67:31:cb None
LCP Local Entry
===============================================================================================
Label VTEP IP Segment ID VTEP MAC DEVICE NAME
124942 10.12.0.152 10.12.0.128 00:50:56:63:b0:56 None
[root@NSX03:~] nsxcli -c get logical-switch 71681 vtep-table
Logical Switch VTEP Table
-----------------------------------------------------------------------------------------------
Host Kernel Entry
===============================================================================================
Label VTEP IP Segment ID Is MTEP VTEP MAC BFD count
124942 10.12.0.152 10.12.0.128 False 00:50:56:63:b0:56 0
LCP Remote Entry
===============================================================================================
Label VTEP IP Segment ID VTEP MAC DEVICE NAME
124942 10.12.0.152 10.12.0.128 00:50:56:63:b0:56 None
LCP Local Entry
===============================================================================================
Label VTEP IP Segment ID VTEP MAC DEVICE NAME
124941 10.12.0.151 10.12.0.128 00:50:56:67:31:cb None
Checked BFD sessions, tunnels down, no diagnostic....
[root@NSX03:/var/log] net-vdl2 -M bfd -s nvds
BFD count: 3
===========================
Local IP: 10.12.0.151, Remote IP: 10.12.0.153, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 1, l3SpanCount: 1
Roundtrip Latency: NOT READY
VNI List: 71687
Routing Domain List: 8871e348-42da-447f-9193-70781b09730f
Local IP: 10.12.0.151, Remote IP: 10.12.0.200, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 3, l3SpanCount: 2
Roundtrip Latency: NOT READY
VNI List: 71690 71691 71692
Routing Domain List: 2a5f80db-676d-41f4-b305-1e8591266f94 8871e348-42da-447f-9193-70781b09730f
Local IP: 10.12.0.151, Remote IP: 10.12.0.152, Local State: down, Remote State: down, Local Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000, isDisabled: 0, l2SpanCount: 2, l3SpanCount: 2
Roundtrip Latency: NOT READY
VNI List: 71681 71688
Routing Domain List: 2a5f80db-676d-41f4-b305-1e8591266f94 8871e348-42da-447f-9193-70781b09730f
The problem is that NSX-T TN offloads IP checksum calculations by default to HW (UCS VIC - M81KR CNA firmware). Unfortunately, CNA from some reason can't calculate correct outer IP checksum for Geneve encapsulated packets. So incoming Geneve packets from TZ A to TZ B are received on the uplink interface of TZ B but with back IP checksum (outer, inner Geneve IP checksum is OK), therefore they are discarded by the system.
One can verify this by capturing incoming packets on TN via nsxcli: start capture interface _uplink1_ direction input file xyz.pcap. Upon transferring the xyz.pcap file from /tmp/ (via winscp or other utility) and loading the xyz.pcap to Wireshark, outer geneve packet IP checksums will be incorrect (turn on Protocol prefs: Validate the IPv4 checksums...).
There is almost none to zero chance that Cisco will fix that for old M81KR CNA, therefore this must be tweaked on ESXi side...
Workaround: turn off IP checksum HW offloading for all NSX-T vmnics on all TNs using Cisco VICs (in this case vmnicX-Y):
esxcli network nic software set --ipv4cso=1 -n vmnicX
esxcli network nic software set --ipv4cso=1 -n vmnicY
Parameter --ipv4cso=1 means IP checksum is done in SW, --ipv4cso=0 that IP checksum is HW offloaded.
Settings are reboot persistent.
To verify that IP checksum calculations are done in SW (vmkernel) run:
esxcli network nic software list
IPv4 CSO = on means IP checksum in SW.
Upon activating IP checksum in SW for NSX-T vmnics Geneve uplinks should go UP instantly (to verify run "nsxdp-cli bfd sessions list").
PS: It seems if you are testing Nested ESXi deployement which uses vmxnet3 with enabled DirectPath I/O same workaround must be applied to virtual vmxnet3 vmnics if they are bound with Cisco VICs (vmxnet3 offloads IP checksum calculations to VIC?).
Regarding performance concerns with SW IP checksum calculation: VM to VM throughput is similar (VMs residing on different B200 M1 blades):
- 9.67 Gbits/sec with DSwitch vs. 9.13 Gbits/sec with NSX-T SDN.
- NSX-T DR L3 routing: 8.04 Gbits/sec.
With this workaround were have successfully tested both NSX-T 2.5 and 3.0 using:
- Cisco B200 M1 blades with M81KR CNA/VIC in 5108 blade chassis
- FI 6100 with UCSM 2.2(8i)
- ESXi 6.5u3
(edge nodes must be on different cluster - newer servers due to AS-NI CPU requirement)
IMHO newer VIC cards like VIC1200 / VIC1300 have/had similar problems with Geneve packets, because previously we were unable to run NSX-T 2.4 on C240-M4 using VIC1300 (geneve tunnels down).
Lastly, I can confirm that NIC HW offloading of Geneve encapsulation is not a requirement for NSX-T 3.0.
From the NSX-T Manager UI, do a Traceflow using the ICMP protocol from VM1 to VM2 when they are across hosts. What is the result? Post the screenshot.
Zooming in:
Has some event happened to these ESXi hosts after they were initially prepared with the NSX-T bits? Asked more directly, has this *ever* worked or no? Are these nested ESXi hosts? Have you tried to reboot each of them?
These are physical UCS blades, fresh install as far as I know so nothing should've happened on the hosts. The NSX-T never worked correctly after it was set up. Will try to reboot.
Reboot did not help, unfortunately.
I also did a packet capture while doing traceflow, and it looks like ICMP packet is received by the destination host:
on source (nsx02 host):
[root@NSX02:~] nsxcli -c start capture interface vmnic5 direction output expression dstip 10.12.0.151
01:30:24.292172 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:25.292190 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:25.606948 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 186: 10.12.0.152.49168 > 10.12.0.151.6081: Geneve, Flags [C], vni 0x11801, proto TEB (0x6558), options [8 bytes]: 00:50:56:b7:85:57 > 00:50:56:b7:c8:2e, ethertype IPv4 (0x0800), length 128: 10.12.67.172 > 10.12.67.10: ICMP echo request, id 0, seq 0, length 94
01:30:26.192260 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:27.092236 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:5
on destination (esx03 host):
[root@NSX03:~] nsxcli -c start capture interface vmnic5 direction input expression srcip 10.12.0.152
01:30:24.278450 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:25.278480 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:25.593242 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 186: 10.12.0.152.49168 > 10.12.0.151.6081: Geneve, Flags [C], vni 0x11801, proto TEB (0x6558), options [8 bytes]: 00:50:56:b7:85:57 > 00:50:56:b7:c8:2e, ethertype IPv4 (0x0800), length 128: 10.12.67.172 > 10.12.67.10: ICMP echo request, id 0, seq 0, length 94
01:30:26.178576 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
01:30:27.078539 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 116: 10.12.0.152.54710 > 10.12.0.151.6081: Geneve, Flags [O], vni 0x0, proto TEB (0x6558): 00:50:56:63:b0:56 > 00:50:56:67:31:cb, ethertype IPv4 (0x0800), length 66: 10.12.0.152.49152 > 10.12.0.151.3784: BFDv1, Control, State Down, Flags: [Poll], length: 24
But the traceflow does not show that it was delivered on the other side?
I'll be happy to provide full packet captures if that might help...
Any luck solving this? I think that we have the same problem
Seems like Cisco VIC has a problem with decapsulating geneve packets upon arrival...
Any hint is appreciated.
Unfortunately we were not successful in resolving this issue. The original problem was observed on Cisco UCS B200 M2 blades running 2.2(8i) firmware and M81KR CNA with 2.2(3b). Since those are no longer supported on ESXi 6.7, we gave up. Best guess is that the VIC driver somehow mangles the encapsulated packet.
The problem is that NSX-T TN offloads IP checksum calculations by default to HW (UCS VIC - M81KR CNA firmware). Unfortunately, CNA from some reason can't calculate correct outer IP checksum for Geneve encapsulated packets. So incoming Geneve packets from TZ A to TZ B are received on the uplink interface of TZ B but with back IP checksum (outer, inner Geneve IP checksum is OK), therefore they are discarded by the system.
One can verify this by capturing incoming packets on TN via nsxcli: start capture interface _uplink1_ direction input file xyz.pcap. Upon transferring the xyz.pcap file from /tmp/ (via winscp or other utility) and loading the xyz.pcap to Wireshark, outer geneve packet IP checksums will be incorrect (turn on Protocol prefs: Validate the IPv4 checksums...).
There is almost none to zero chance that Cisco will fix that for old M81KR CNA, therefore this must be tweaked on ESXi side...
Workaround: turn off IP checksum HW offloading for all NSX-T vmnics on all TNs using Cisco VICs (in this case vmnicX-Y):
esxcli network nic software set --ipv4cso=1 -n vmnicX
esxcli network nic software set --ipv4cso=1 -n vmnicY
Parameter --ipv4cso=1 means IP checksum is done in SW, --ipv4cso=0 that IP checksum is HW offloaded.
Settings are reboot persistent.
To verify that IP checksum calculations are done in SW (vmkernel) run:
esxcli network nic software list
IPv4 CSO = on means IP checksum in SW.
Upon activating IP checksum in SW for NSX-T vmnics Geneve uplinks should go UP instantly (to verify run "nsxdp-cli bfd sessions list").
PS: It seems if you are testing Nested ESXi deployement which uses vmxnet3 with enabled DirectPath I/O same workaround must be applied to virtual vmxnet3 vmnics if they are bound with Cisco VICs (vmxnet3 offloads IP checksum calculations to VIC?).
Regarding performance concerns with SW IP checksum calculation: VM to VM throughput is similar (VMs residing on different B200 M1 blades):
- 9.67 Gbits/sec with DSwitch vs. 9.13 Gbits/sec with NSX-T SDN.
- NSX-T DR L3 routing: 8.04 Gbits/sec.
With this workaround were have successfully tested both NSX-T 2.5 and 3.0 using:
- Cisco B200 M1 blades with M81KR CNA/VIC in 5108 blade chassis
- FI 6100 with UCSM 2.2(8i)
- ESXi 6.5u3
(edge nodes must be on different cluster - newer servers due to AS-NI CPU requirement)
IMHO newer VIC cards like VIC1200 / VIC1300 have/had similar problems with Geneve packets, because previously we were unable to run NSX-T 2.4 on C240-M4 using VIC1300 (geneve tunnels down).
Lastly, I can confirm that NIC HW offloading of Geneve encapsulation is not a requirement for NSX-T 3.0.
This is great, thank you very much! I can confirm that disabling the checksum offload fixes the issue.