Solved: Re: ESX 4.1. Disable TCP Checksum offload

mpolok · ‎09-20-2011

Hi All,

This is my first post on this forum. We have problems with connection between application server and sql server. Both servers are running Windows server 2008 R2 and are hosted on the same ESX. We are receiving below alerts on application log, it means that an existing connection was forcibly closed by the remote host. The problem was checked by network team, connection is working fine.

> STATE: 01000
> NATIVE CODE: 10054
> MESSAGE: [ImageNow][ODBC SQL Server Driver][DBNETLIB]ConnectionWrite (send()).
>
> STATE: 08S01
> NATIVE CODE: 11
> MESSAGE: [ImageNow][ODBC SQL Server Driver][DBNETLIB]General network error.
> Check your network documentation.

Vendor suggest to change the TCP Checksum offload. We changed the settings on Windows but the problem still exist. My question is... how to change the TCP Checksum offload on ESX 4.1 (I can't find any vmware documentation regarding that, except of forum posts), from your experience are there any disadventages from changing the settings? (other that higher CPU utilization) Can it impact other vm's?

I would be gratefull if you could provide some more details

Thanks

Mateusz

kjb007 · ‎09-21-2011

Don't get me wrong, I have several SQL Server instances, and use the vmxnet2/3 NICs almost everywhere, and don't see the connection issues. And I don't disable the default offloading features.

I'm inclined to believe there is something else at work here causing you headaches.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

View solution in original post

MartinAmaro · ‎09-20-2011

By disabling checksum offload in my experience you gain netowrk performace by eliminating excessive amount of retries, freezing or locking etc etc.

You do not disable checksum at the ESX host you do that at the guest Client and Server levels.

Depending on the NIC driver use E100 VMXNET2 or VMXNET3 you might or might not have the ability to disable this from the properties page so you are better off doing it from the registry.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
DWORD = DisableTaskOffload
10
Value = 1

It is also recommended to to disable TCP large Send Offload (Ask you vendor)

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\BNNS\Parameters\
DWORD = EnableOffload
Value “0”

"The TCP Large Send Offload option allows the TCP layer to build a TCP message up
to 64 KB long and send it in one call via IP and the Ethernet device driver. The adapter then re-segments the message
into multiple TCP frames for wire transmission. The TCP packets sent on the wire are either 1500 byte frames for a
Maximum Transmission Unit (MTU) of 1500 or up to 9000 byte frames for a MTU of 9000 (jumbo frames). Re-segmenting
and queuing packets to send in large frames can cause latency and timeouts to the Provisioning Server and therefore this
should be disabled on all Provisioning Servers and target devices."

Also you might want to make suer that the Duplex settings on the host and swtich ports match.

Disable Spanning Tree Or Enable PortFast is recomeded

"With Spanning Tree Protocol (STP) or Rapid Spanning Tree Protocol, the
ports are placed into a "blocked" state while the switch transmits Bridged Protocol Data Units (BPDU) and listens to
ensure that the BPDUs are not in a loopback configuration. The amount of time it takes for this process to converge
depends on the size of the switched network that might allow the Preboot Execution Environment (PXE) to time out
causing the VM to enter a wait state or reboot until the condition is cleared and the PXE process can resume. To resolve
this issue, disable STP on edge ports connected to clients or enable PortFast or Fast Link depending on the managed
switch brand. Refer to the table below:"

Last you might want to look at SQL Server on VMWare best practice.

http://communities.vmware.com/servlet/JiveServlet/previewBody/13249-102-1-14546/SQL%20Server%20on%20...

I hope this helps

If you find this or any other answer useful please consider awarding points by marking the answer correct or helpful.

mpolok · ‎09-21-2011

Martin,

So in your opinion changing the TCP Checksum Offload on ESX host will not change anything from the VM perspective? I already disabled the setting on both virtual machines, I also changed the nic type to vmxnet3. We are still receiving the error message.

I will ask for the "TCP large Send Offload" setting. Duplex settings on the host and switch ports match, portfast is enabled.

The weird thing is that there are no connection errors i SQL log, only on application side.

kjb007 · ‎09-21-2011

Since you are really looking for a vm portgroup, you can't really turn off this behavior from ESX. As of 4.1, you can't disable this for a vmkernel NIC either. You can however, use the e1000 NIC, which will ignore TSO requests, which will be above and beyond what you are doing at the OS layer to disable the feature.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

mpolok · ‎09-21-2011

Hi Kanuj,

We were using the e1000 nic adapter before. In process of resolving out issue we changed the adapter type to wmxnet3 so it looks like the changes recommended by vendor will not help.

kjb007 · ‎09-21-2011

Don't get me wrong, I have several SQL Server instances, and use the vmxnet2/3 NICs almost everywhere, and don't see the connection issues. And I don't disable the default offloading features.

I'm inclined to believe there is something else at work here causing you headaches.

-KjB

vExpert/VCP/VCAP vmwise.com / @vmwise -KjB

DoDo201110141 · ‎10-24-2011

Do you have a Broadcom, 5708 or 5709 ?

rmathis · ‎10-24-2011

Had the same issue not to jump off topic on this but how did you setup the ODBC connection on both boxes. If the DB is x64 you'll have to go into system32 for the 32bit ODBC if your application server is 32bit. If both are x64 ignore that. Also are you using dynamic SQL ports that windows firewall on 2k8 can be a pita and that's what got me. I had the same error and it took me a few to figure that out. In my limited exp I've never run into an issue where the ESX box vSwitch has been at fault yet hince limited. I have BC and Intel NICs in my host with them split and load balanced so 1 intel 1 bc to 1 vswitch to a single physical and no issues so far.

If anything you can always ignore this its just weird I see the same error a few days after I just got the blasted thing.

mpolok · ‎10-26-2011

Hi,

No, we are using HP NC373i Multifunction Gigabit Server Adapter

Both VM's are on the same ESX host, firewall is disabled on both of them, system on both vm's is the same MS Server 2008 R2 64-bit.

We are planning to move the instances from the SQL vm to central cluster, I hope it will resolve the issue,

Thanks !

RumataRus · ‎12-02-2011

Michael Lynch wrote:
Do you have a Broadcom, 5708 or 5709 ?

Hi!

We have the same problem, and we use Broadcom 5709.

What can you recommend?

P.S.: We also tried to change the TCP offload settings but it did not help.

RumataRus · ‎12-05-2011

mpolok wrote:
We are planning to move the instances from the SQL vm to central cluster, I hope it will resolve the issue,

Hi,

Have you resolved the problem after moving the instances from the SQL VM to central cluster?

mpolok · ‎12-05-2011

Hi,

We are still planning this change, it should be done in few weeks. For now we do not have any solution for the problem. Vendor suggest to make some changes on vswitch... if you have some test infrastracture you can try it...

-CsumOffload Off -TcpSegmentation Off -zeroCopyXmit Off

RumataRus · ‎12-06-2011

Hi,

do you use Veeam Backup&Replication on the SQL VM?

mpolok · ‎12-07-2011

No, we are not using Veeam Backup&Replication. From other information:

1. There is no dropped connection in logs on SQL Server,

2. We also noticed that when the problem appears there is entry in application logs "12:35:20.816620(d38) Failed to Disconnect from the ODBC connection".It looks like ODBC driver fails to disconnect and then reconnect. From ODBC settings we can see that "Connection Polling Timeout" is set to 60 seconds, so it wait 60seconds before removing the connection from the pool of open connections. We asked vendor if we can turn it off - it may result in degraded performance, but there will be no open connections left in the pool, all will be closed instantaneously after transaction.

I will let you know when we will receive response

Regards

Mateusz

RumataRus · ‎12-07-2011

mpolok wrote:
I will let you know when we will receive response

Thank you!

It is interesting. I will wait.

RumataRus · ‎01-10-2012

Hi, Mateusz!

Do you have any news on this topic?

mpolok · ‎01-11-2012

Hi,

Unfortunatelly we had a 'change freeze' for last few weeks. Now we are waiting for approval and proposal date. It should be done in next 2-3 weeks... I will let you know

RumataRus · ‎01-11-2012

Thanks!

mpolok · ‎01-24-2012

Hi,

Unfortunatelly we are still getting those alerts after migrating to central cluster. We are suspecting that this is some issue with ODBC or drivers on app server, we are still in contact with vendor. If I will have any solution I will let you know

Regards

RumataRus · ‎01-24-2012

Hi!

Thank you for information.

Meanwhile we suspect that the problem is in our root switch.

We are waiting for the new switch to test our suspicion.

I will also let you know.