VMware Cloud Community
mpolok
Contributor
Contributor
Jump to solution

ESX 4.1. Disable TCP Checksum offload

Hi All,

This is my first post on this forum. We have problems with connection between application server and sql server. Both servers are running Windows server 2008 R2 and are hosted on the same ESX. We are receiving below alerts on application log, it means that an existing connection was forcibly closed by the remote host. The problem was checked by network team, connection is working fine.

> STATE: 01000
> NATIVE CODE: 10054
> MESSAGE: [ImageNow][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
>
> STATE: 08S01
> NATIVE CODE: 11
> MESSAGE: [ImageNow][ODBC SQL Server Driver][DBNETLIB]General network error.
> Check your network documentation.

Vendor suggest to change the TCP Checksum offload. We changed the settings on Windows but the problem still exist. My question is... how to change the TCP Checksum offload on ESX 4.1 (I can't find any vmware documentation regarding that, except of forum posts), from your experience are there any disadventages from changing the settings? (other that higher CPU utilization) Can it impact other vm's?

I would be gratefull if you could provide some more details

Thanks

Mateusz

0 Kudos
1 Solution

Accepted Solutions
kjb007
Immortal
Immortal
Jump to solution

Don't get me wrong, I have several SQL Server instances, and use the vmxnet2/3 NICs almost everywhere, and don't see the connection issues.  And I don't disable the default offloading features.

I'm inclined to believe there is something else at work here causing you headaches.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

0 Kudos
23 Replies
MartinAmaro
Expert
Expert
Jump to solution

By disabling checksum offload in my experience you gain netowrk performace by eliminating excessive amount of retries, freezing or locking etc etc.

You do not disable checksum at the ESX host you do that at the guest Client and Server levels.

Depending on the NIC driver use E100 VMXNET2 or VMXNET3 you might or might not have the ability to disable this from the properties page so you are better off doing it from the registry.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
DWORD = DisableTaskOffload
10
Value = 1

It is also recommended to to disable TCP large Send Offload (Ask you vendor)

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BNNS\Parameters\
DWORD = EnableOffload
Value “0”

"The TCP Large Send Offload option allows the TCP layer to build a TCP message up
to 64 KB long and send it in one call via IP and the Ethernet device driver. The adapter then re-segments the message
into multiple TCP frames for wire transmission. The TCP packets sent on the wire are either 1500 byte frames for a
Maximum Transmission Unit (MTU) of 1500 or up to 9000 byte frames for a MTU of 9000 (jumbo frames). Re-segmenting
and queuing packets to send in large frames can cause latency and timeouts to the Provisioning Server and therefore this
should be disabled on all Provisioning Servers and target devices."

Also you might want to make suer that the Duplex settings on the host and swtich ports match.

Disable Spanning Tree Or Enable PortFast is recomeded

"With Spanning Tree Protocol (STP) or Rapid Spanning Tree Protocol, the
ports are placed into a "blocked" state while the switch transmits Bridged Protocol Data Units (BPDU) and listens to
ensure that the BPDUs are not in a loopback configuration. The amount of time it takes for this process to converge
depends on the size of the switched network that might allow the Preboot Execution Environment (PXE) to time out
causing the VM to enter a wait state or reboot until the condition is cleared and the PXE process can resume. To resolve
this issue, disable STP on edge ports connected to clients or enable PortFast or Fast Link depending on the managed
switch brand. Refer to the table below:"

Last you might want to look at SQL Server on VMWare best practice.

http://communities.vmware.com/servlet/JiveServlet/previewBody/13249-102-1-14546/SQL%20Server%20on%20...

I hope this helps

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.
mpolok
Contributor
Contributor
Jump to solution

Martin,

So in your opinion changing the TCP Checksum Offload on ESX host will not change anything from the VM perspective? I already disabled the setting on both virtual machines, I also changed the nic type to vmxnet3. We are still receiving the error message.

I will ask for the "TCP large Send Offload" setting. Duplex settings on the host and switch ports match, portfast is enabled.

The weird thing is that there are no connection errors i SQL log, only on application side.

0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Since you are really looking for a vm portgroup, you can't really turn off this behavior from ESX.  As of 4.1, you can't disable this for a vmkernel NIC either.  You can however, use the e1000 NIC, which will ignore TSO requests, which will be above and beyond what you are doing at the OS layer to disable the feature.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
mpolok
Contributor
Contributor
Jump to solution

Hi Kanuj,

We were using the e1000 nic adapter before. In process of resolving out issue we changed the adapter type to wmxnet3 Smiley Happy so it looks like the changes recommended by vendor will not help.

0 Kudos
kjb007
Immortal
Immortal
Jump to solution

Don't get me wrong, I have several SQL Server instances, and use the vmxnet2/3 NICs almost everywhere, and don't see the connection issues.  And I don't disable the default offloading features.

I'm inclined to believe there is something else at work here causing you headaches.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB
0 Kudos
DoDo201110141
Contributor
Contributor
Jump to solution

Do you have a Broadcom, 5708 or 5709 ?

0 Kudos
rmathis
Contributor
Contributor
Jump to solution

Had the same issue not to jump off topic on this but how did you setup the ODBC connection on both boxes. If the DB is x64 you'll have to go into system32 for the 32bit ODBC if your application server is 32bit. If both are x64 ignore that. Also are you using dynamic SQL ports that windows firewall on 2k8 can be a pita and that's what got me. I had the same error and it took me a few to figure that out. In my limited exp I've never run into an issue where the ESX box vSwitch has been at fault yet hince limited. I have BC and Intel NICs in my host with them split and load balanced so 1 intel 1 bc to 1 vswitch to a single physical and no issues so far.

If anything you can always ignore this its just weird I see the same error a few days after I just got the blasted thing.

0 Kudos
mpolok
Contributor
Contributor
Jump to solution

Hi,

No, we are using HP NC373i Multifunction Gigabit Server Adapter

Both VM's are on the same ESX host, firewall is disabled on both of them, system on both vm's is the same MS Server 2008 R2 64-bit.

We are planning to move the instances from the SQL vm to central cluster, I hope it will resolve the issue,

Thanks !

0 Kudos
RumataRus
Commander
Commander
Jump to solution

Michael Lynch wrote:

Do you have a Broadcom, 5708 or 5709 ?

Hi!

We have the  same problem,  and we use Broadcom 5709.

What can you recommend?

P.S.: We also tried to change the TCP offload settings but it did not help.

0 Kudos
RumataRus
Commander
Commander
Jump to solution

mpolok wrote:

We are planning to move the instances from the SQL vm to central cluster, I hope it will resolve the issue,

Hi,

Have you resolved the problem after moving the instances from the SQL VM to central cluster?

0 Kudos
mpolok
Contributor
Contributor
Jump to solution

Hi,

We are still planning this change, it should be done in few weeks. For now we do not have any solution for the problem. Vendor suggest to make some changes on vswitch... if you have some test infrastracture you can try it...

-CsumOffload Off -TcpSegmentation Off -zeroCopyXmit Off

0 Kudos
RumataRus
Commander
Commander
Jump to solution

Hi,

do you use Veeam Backup&Replication on the SQL VM?

0 Kudos
mpolok
Contributor
Contributor
Jump to solution

No, we are not using Veeam Backup&Replication. From other information:

1. There is no dropped connection in logs on SQL Server,

2. We also noticed that when the problem appears there is entry in application logs "12:35:20.816620(d38)       Failed to Disconnect from the ODBC connection".It looks like ODBC driver fails to disconnect and then reconnect. From ODBC settings we can see that "Connection Polling Timeout" is set to 60 seconds, so it wait 60seconds before removing the connection from the pool of open connections. We asked vendor if we can turn it off - it may result in degraded performance, but there will be no open connections left in the pool, all will be closed instantaneously after transaction.

I will let you know when we will receive response

Regards

Mateusz

0 Kudos
RumataRus
Commander
Commander
Jump to solution

mpolok wrote:

I will let you know when we will receive response

Thank you!

It is interesting. I will wait.

0 Kudos
RumataRus
Commander
Commander
Jump to solution

Hi, Mateusz!

Do you have any news on this topic?

0 Kudos
mpolok
Contributor
Contributor
Jump to solution

Hi,

Unfortunatelly we had a 'change freeze' for last few weeks. Now we are waiting for approval and proposal date. It should be done in next 2-3 weeks... I will let you know

0 Kudos
RumataRus
Commander
Commander
Jump to solution

Thanks!

0 Kudos
mpolok
Contributor
Contributor
Jump to solution

Hi,

Unfortunatelly we are still getting those alerts after migrating to central cluster. We are suspecting that this is some issue with ODBC or drivers on app server, we are still in contact with vendor. If I will have any solution I will let you know

Regards

0 Kudos
RumataRus
Commander
Commander
Jump to solution

Hi!

Thank you for information.

Meanwhile we suspect that the problem is in our root switch.

We are waiting for the new switch to test our suspicion.

I will also let you know.

0 Kudos