Network Issue

RKAYSWISS · ‎10-01-2019

Hello fellow VMers

Since weeks an issue bothers me, not even the hosting company responsible for the hardware under the vSphere system was yet able to help.

The issue

Between 2 directly connected ESXi hosts, TCP-connections (especially MS-SQL targeting) is failing very often with so-called semaphore timeouts. Also sometimes simple file transfers fail.

The setup

Both hosts are supermicro superserver, each with dual decacore broadwell XEONs and 256GB RAM. The hosts are directly connected over strictly for this interconnection used NICs (these NICs have no uplink or any other connections - just a 1GBit/s connection from A - to - B. Both of thesm are Intel I210 interfaces. There is no vMotion or Fault Tollerance traffic going on, that could interfere with the regular TCP/IP traffic. Nothing fancy routingwise (just simple subnet routing).
Both hosts running ESXi 6.7

Checked / Adjusted

I can exclude an issue with the NIC(s), as when I use the same nics for a VPN connection from the datacenter to my office, the issue does not (or not noticably) occur. Also, load and/or traffic has no influence, even with just the SQL DB on A and a client accessing it on B the problem occurs. All found suggestions on the side MS-SQL (software/OS) I tried out, I'm pretty sure now, that this part of the story is not the root/issue.

I checked thousands of lines of enhanced verbose mode log files from ESXi - nothhing really jumped into my eyes, but I have to say, in terms of network adjustments & monitoring, ESXi doesn't offer really much - I guess that's available as sort of "addons" or "upgrades.

I have to say: I'm neither very experienced in networking nor in virtualization concepts. I'm a developer, i learned (what i know - or better what i think to know) because of the requirements of the project im working on. I managed to create stable working Site-to-Site-VPN connections between my office and the datacenter, asnd that handels the same traffic absolutely flawless, but the direct cableconnection between two hosts, that barely require much clicks & inputs is winning the fight against me - this is really driving me nuts ^^

Maybe someone of you experienced guys can point me in the right direction

ChrisFD2 · ‎10-02-2019

Directly connected, as it not using the a physical switch? Can you share the vSwitch and port group configuration please? Also are you sure that the underlying storage can keep up with the application? It might be prudent to look at the underlying datastore performance.

Also it might be worth changing the crossover cable between the hosts. I know NIC ports will do Auto MDI-X however I've always been a stickler for using the correct cable type.

Regards,
Chris
VCIX-DCV 2024 | VCIX-NV 2024 | vExpert 6x | CCNA R&S

hussainbte · ‎10-02-2019

we can start by verifying that the N/W adapter we are using is supported with 6.7.. also whatever logs you have seen regarding the nic or nic driver if you could share please. for vmkernel

If you found my answers useful please consider marking them as Correct OR Helpful Regards, Hussain https://virtualcubes.wordpress.com/

RKAYSWISS · ‎10-06-2019

Hey guys

First of all: Thank you very much for taking some time trying to help me, I really appreciate that.

I think, Datastore could be at least partially the root of evil. The database hosting host is based on pure SSD storage, with most of the SQL tables being in-memory. I didn't really took the datastore into account, as I read (or better: I think to remember to have read) that the whole networking traffic buffering is done purely in the RAM on ESXi - and my office host (addressing the same SQL db via OpenVPN site-to-site who actually has datastore issues is not showing these timeouts. But thinking more about, It makes sense: As the OpenVPN itself is adding extra latency & limiting factor - in addition to a limiting WAN-connection over country borders, the datastore itself (on the requesting side), cannot make an as huge difference as on the datacenter. I'll further check that, but massive thx already! I think I could absolutely make sure it's the datastore, if I artificially throttle the things down to same levels as with OpenVPN - if the issue is gone: It's clear. If it's really the datastore, then I would be so happy - because it's an easy fix - just get the hosting company to put me a SSD-RAID10 into the host, moving the DB over - thats it Simple, quick, cheap (compared to what an expert checking my VDC would cost).

RKAYSWISS · ‎10-06-2019

Hi Chris
Seems like my 'explanation' wasn't as clear as It seemed to me

Directly connected: I mean, there is no switch/hub/etc. between the hosts - it's just a networking cable between the nics of the hosts.

The vSwitches:

Host A

vSwitch - Standard

Nic1 as 'Uplink'

Portgroup 1 - VM (just the SQL server VM)

Portgroup 2 - VMKernel (just the the one VMKernel Nic

VMKernel Port
IPv4 20.1.1.1

Stack: Standard TCP/IP

DNS: local DNS servers (should not having an effect, as I'm adressing the problematic sservice explicitly via IP

Host B

vSwitch - Standard

Nic1 as 'Uplink'

Portgroup 1 - VM (just a bunch of 'client' VMs)

Portgroup 2 - VMKernel (just the the one VMKernel Nic

VMKernel Port
IPv4 20.1.1.2

Settings (exactly the same across all layers/levels, hosts)

Promiscious Mode, Forged Transmissions, Mac Changes - all disabled
MTU: 1500
No Faillover (as it's only one physical nic).

No Trafficshaping

Network Failover Detection - Signal Only

Did some general testing now, enabled vMotion, and stuff... moved VMs back and fort, and had no issuses - So i'm more and more thinking my issues are not directly network related, but related to the datastore. Activated Storage I/O Control now and set policies for all VMs - let's see if it changes something