montybeato
Contributor
Contributor

North-South TCP and UDP traffic discarded due to bad checksum when Checksum Offload is active

Hello,

Someone experiencing this known issue 2587257?

I think I might be suffering from this in a fresh installation... but no workaround is provided by VMware right now.

The release notes just say "in some cases"... which cases? where is the detailed information?

 

Any information appreciated.

 

Thanks,

Monty

0 Kudos
15 Replies
nmichelnsbu
VMware Employee
VMware Employee

Can you please provide more details ?

Thanks a lot !

0 Kudos
montybeato
Contributor
Contributor

Hi,

TCP-based communication between a VM in NSX-T overlay and an external computer in the physical network results in the following, in both directions:

- SYN packet received.

- No more packets exchanged... the application fails to connect (tested RDP, CIFS/SMB, HTTP, HTTPS).

- Only Ping is successful.


We experimented with several connectivity options for the NSX Edge Node: single N-VDS, multiple N-VDS, same port-groups for overlay/vlan, distinct port-groups... even deleted and re-deployed Edge node.

We can sucessfully bring up IPSec VPN tunnel (UDP right???) between NSX-T and a physical Fortigate FW and even against external virtual Sophos FW, but the traffic going through the tunnel behaves exactly the same. Only ping succeeds.

We disabled DFW just in case (although it is allowing Any/Any by default) and reviewed every possible firewall function in the path... every policy is allowing all or FW disabled.

Also tried creating explicit policies allowing desired traffic... same result.

License is NSX-T Data Center Advanced, so there is no IDS/IPS.

-----------------
To rule out external FW issues, we tested bringing up an IPSec tunnel between a virtual Sophos FW appliance inside the cluster and the external Fortigate without changing configuration, same subnets, same tunnel settings, same computer.

Everything works fine there.
-----------------

 

We are opening a support case with HPE VMware team... all VMware licenses were bought through HPE.

 

0 Kudos
nmichelnsbu
VMware Employee
VMware Employee

Yes please open a case and when you have a VMware case number, I ll be able to follow up on this.

 

Thanks !

 

Nicolas

 

0 Kudos
montybeato
Contributor
Contributor

The issue is related to Checksum Offload.

1. We discovered, through Wireshark capture by port-mirroring in the physical switches, that the checksum for TCP and UDP packets coming out of NSX-T to the physical network is incorrect.

The switches are delivering the frames to the router, but then in the destination the packets are being discarded because of bad checksum in the Transport Layer header.

ICMP works because Network Layer checksum (IPv4 checksum) is calculated correctly.


2. To confirm the issue, we disabled TSO and CSO for the two external pNICs in one of the ESXi hosts, rebooted the host, and then in the test Virtual Machine we disabled all Offload functions for the VMXNET3 ethernet card in Windows.

After doing this, all traffic works OK !!!


However, this is a workaround... but happy to find the issue.

0 Kudos
ChrisOk
Contributor
Contributor

@montybeatoSo you had to disable both?

a) TSO and CSO for all physical NICs on the ESXi hosts

b) disable all Offload functions for the VMXNET3 ethernet card in Windows.

Would a) be enough for the workaround?

0 Kudos
montybeato
Contributor
Contributor

Is not enough.

We had to do it in the VM for it to work.

 

Tags (1)
0 Kudos
dlapointe
Contributor
Contributor

I just did a fresh NSXT 3.1 install and we are having the same issue ! icmp worked fine but all TCP/UDP connections were failing.

We disabled all offloading options in the NIC inside the VM (without changing anything on esxi host) and everything is working now...

0 Kudos
montybeato
Contributor
Contributor

Ok!

What I understand is that disabling offload causes the VMs to use more CPU in order to calculate checksum for every TCP and UDP packet... and that is why offload is enabled by default.

We are looking at the firmware version of the NICs as well as driver/firmware combinations... the minimum supported versions in the VMware Compatibility List can contain known issues/bugs and there are more recent versions.

Please take a look at https://kb.vmware.com/s/article/2030818 and look for your NIC's manufacturer.

 

 

0 Kudos
dlapointe
Contributor
Contributor

I have just updated our ESXi host with the latest mellanox firmwareand it made no difference. I am not certain if it is firmware related.

Did you have an update from vmware support? I also opened a support ticket.

 

have you configured your NSX-T esxi host mode deployment with a standard switch or enhanced Datapath?

0 Kudos
Berry526
Contributor
Contributor

Instructions to configure checksum offload and the Load Balancer are given. In almost all cases, traffic is not forwarded with checksum offload enabled. When TCP/IP receives a packet with an invalid checksum, it discards it. allows promiscuous IP tracing and capture the packets with bad checksums.

My Gift Card Site

0 Kudos
montybeato
Contributor
Contributor

@dlapointe  Unfortunately we opened the ticket to the NSX-T team... and they do not acknowledge any issue with NSX-T until you prove that everything else in the hosts and ESXi is OK.  So we closed that ticket.

We've upgraded driver and firmware, but the issue persists.

Our NICs are all copper, Intel and Broadcom... it doesn't make a difference.

We have vSphere distributed switch (vDS).  No Enhanced Data Path.

 

 

0 Kudos
montybeato
Contributor
Contributor

@Berry526  Can you please share where are these instructions?

 

Thanks

0 Kudos
dlapointe
Contributor
Contributor

I heard the issue is resolved in NSXT 3.1.2 (what I got told on my support ticket I have with VMWare).

 

I did not test it yet, but if you did not upgrade, it might be something to test also on your end

0 Kudos
montybeato
Contributor
Contributor

Thanks for the info, but:

Yesterday morning we've updated to the latest 3.1.2, but the issue is still there.

Do yo have information on what is the resolved issue on 3.1.2?

 

0 Kudos
dlapointe
Contributor
Contributor

Hey, sorry for replying late, I have just upgrade to NSX-T 3.1.2.1 and I am still having the same issue...

I will see what support have to say now

0 Kudos