VMware Horizon Community
talavital
Contributor
Contributor

Horizon F5 tcp Reset using UAG

Hi all

On a scenario i have,

Horizon is working perfect with UAG for internal network users, which aren't going through F5.

issue is for external clients, going through F5 and then to UAG.

Those external users are able to get horizon catalog (which means the connection to view server is fine) so this is leaving us with the connection from F5 and UAG to the VM's them selfs.

We inspect the F5 with F5 vendor guys and we observe following :

ov 11 09:58:25 F5ApmTlv debug tmm3[20058]: 019cffff:7: /Common/VMwareViewClient_AP_copy-07-05-2018:Common:c16bb787: RD: [S] 192.x.x.x.443 i 192..x.x.x.3338: server-side connection was reset, reason: Flow expired (sweeper) Nov 11 09:58:25 F5ApmTlv debug vdi[16377]: 019cffff:7: /Common/VMwareViewClient_AP_copy-07-05-2018:Common:c16bb787: {f8.C} TMEVT_CLOSE Nov 11 09:58:25 F5ApmTlv debug vdi[16377]: 019cffff:7: /Common/VMwareViewClient_AP_copy-07-05-2018:Common:c16bb787: {f8.C} -> ~D Nov 11 09:58:25 F5ApmTlv debug vdi[16377]: 019cffff:7: /Common/VMwareViewClient_AP_copy-07-05-2018:Common:c16bb787: {f8.C} BHL_OnDispose Nov 11 09:58:26 F5ApmTlv debug tmm1[20058]: 019cffff:7: /Common/VMwareViewClient_AP_copy-07-05-2018:Common:c16bb787: RD: [S] 192.x.x.x.443 i 192.x.x.x.50121: server-side connection was reset, reason: Flow expired (sweeper)״

All ports and pre-requisites are validated in-place including special configurations and iRules for F5.

We cant yet understand the root cause for this and to pin point there the issue is coming from.

Any ideas ?

Thank you.

Reply
0 Kudos
4 Replies
sjesse
Leadership
Leadership

With the UAGs, the web interface goes to the connection server, and then directly to the the desktop if the secure gateways aren't enabled. If the secure gateways are enabled the connections to the desktops are made through the uag and then directly to the virtual desktops. I'm wondering if your vms allow connections directly internal, but they are blocked externally. Check any firewall rules you have setup,check and see if you have the secure gateways enabled, and make sure they aren't enabled on the connection servers themselves.

Look at this guide for a reference if you haven't seen it

https://www.f5.com/pdf/solution-center/load-balancing-vmware-unified-access-gateway-servers-deployme...

Reply
0 Kudos
talavital
Contributor
Contributor

Thank you for your reply,

I am aware of everything you have mentioned, i have many Horizon installations, with such same architecture.

We have validated everything 10x times from top to bottom, including firewall, no drops are seen.

Gateways are disabled from View server as they are being now used in UAG, this is the first instant thing i am doing right away.

As i mentioned, internal users are also going through UAG and it works.

Means something is wrong with F5 but cant know where to point this. F5 vendor obviously is pointing the problem and blame Horizon...

Reply
0 Kudos
agalliasistju
Enthusiast
Enthusiast

What's currently configured in this folder on your broker?  C:\Program Files\VMware\VMware View\Server\sslgateway\conf

I ask because we had a hiccup with our f5 config early on and needed to add the following lines to the locked.properties file:

serverProtocol=http

checkOrigin=false

The serverPRotocol=http was in our config from the start to get the f5 setup to work.  We added checkOrigin=false after the Horizon 7 upgrade for this: VMware Knowledge Base

Reply
0 Kudos
SteveWH
Enthusiast
Enthusiast

To know for sure you would need to setup packet captures to see what is happening on the wire. If you can recreate the issue with a client you would ideally want to run a capture on the client, the F5, and the server but most likely just the F5 and remote server is needed. This will give you insight into why the RST packets are being sent from the F5 to the client causing the disconnection. If you can't reproduce the issue on-demand you would need to connect and run a rotating tcpdump on the F5 until the issue occurs. In the log excerpt you provided it shows the RST reason as 'Flow expired (sweeper)' The BIG-IP system will reap a connection from the connection table and send a TCP RST packet to the client when one of the following two conditions is met: 1) an idle timeout for the connection expired. This may be impacted by the Idle Timeout setting in the assigned TCP profile of the affected virtual server. 2) Memory usage on the BIG-IP system increased beyond the reaper high-water mark and triggered adaptive reaping.

K13223: Configuring the BIG-IP system to log TCP RST packets

https://support.f5.com/csp/article/K13223

K411: Overview of packet tracing with the tcpdump utility

https://support.f5.com/csp/article/K411

K13637: Capturing internal TMM information with tcpdump

https://support.f5.com/csp/article/K13637

The vendor you are working with can then analyze the capture files to see the communications leading up to the RST packet being sent. For example we had an issue where connections were intermittently being reset resulting in clients displaying a generic 'network error'. The client logs and server logs didn't show anything meaningful beyond the client being disconnected with a generic error. The F5 qkview's were clean and we were on latest versions and iRules in-line with the version of Horizon.

We connected to the F5 and checked the connections in real-time to monitor idle timers and session information (tmsh show sys connection) (tmsh show sys connection cs-client-addr x.x.x.x all-properties) but the timers weren't being reached and the system resources were in the green.

The traces ended up showing us TCP SYNS being sent but the server wasn't sending corresponding ACKs. Ultimately the F5 gave up and sent the RST to the client since the server wasn't responding. Not sure what you will see on your trace but you can feel free to post the results. For us the problem ended up being a mismatch between the default TIME_WAIT timer created in the TCP profile created in the Horizon iAPP vs. the default Windows Server TIME_WAIT timer.

Your issue sounds like it has a different root cause than us but I'll share the details of ours to show how the logs aren't always clear and the packet capture is needed.

By default the iAPP creates virtual server profiles that use SNAT automap with preserve source port. What this means is that the horizon client’s src port used to make the initial connection to the F5 is then used to make the second connection from the F5 to the connection server. For example:

 

CLIENT:50493 – F5:443 – F5:50493 – CONNECTIONSERVER1:443

The client uses a temporary ephemeral port to make the 443 connection to the F5. The F5 then uses that same ephemeral port to make the server side connection. This usually isn’t a problem but it could become a problem if you have many connections in a short period of time and the probability of a ports being re-used is increased.

The problem occurs when we increased the Horizon Global Setting for SSO timeout. We changed it from the default 15 minutes to the lowest available value of 1 minute. When we change the SSO setting to 1 minute it then has the client send a heartbeat every 20 seconds instead of every 5 minutes. This increases the amount of connections to the connection servers by 15.

This becomes a problem because clients are now communicating more frequently so the chance of re-using an ephemeral port in a short period of time is possible. This isn’t a problem for clients in a non-load balanced environment because they all have unique client IP addresses but in a load balanced environment the connection servers see all incoming requests from the same IPs (the IP of the F5 (depending on how many floating F5 IPs you have in the pool).

In accordance with RFC 793 windows has a bunch of TCP connection transition states. The one we are concerned about in this issue is the TIMEWAIT state. TIMEWAIT is supposed to be twice the maximum segment lifetime (2MSL). By default windows has this set to 4 minutes. So if you run ‘netstat’ you will see all connections in all the possible states: established, syn_sent, syn_recv, fin_wait1, fin_wait2, time_wait, close, close_wait, last_ack, listen, closing, unknown, etc.

When a connection is closed by the application the underlying TCP stack keeps the source IP address/source port combination in a TIME_WAIT state until the 2MSL timer is up. From the local end-point point of view the connection is closed but we’re still waiting before accepting a new connection in order to prevent delayed duplicate packets from the previous connection being accepted by a new connection. TCP blocks any second connections from the srcip/srcport pair until the TIME_WAIT state is finished. Any new connections coming in from that SRCIP/SRCPRT combo will be ignored and no ACKs will be sent for incoming SYNS.

This is a problem for load balancers like the F5 because it may re-use an ephemeral port if another client chooses to use it. The F5 also has a TIME_WAIT timer to prevent this from happening that should match the destination devices TIME_WAIT timer but it does not. By default the F5 timer is only 2 seconds whereas Windows is 4 minutes. This means a new srcip/srcport combo can be re-used after only 2 seconds on the F5 and it will try to proxy the connection to the server when that has the same srcip/srcport combo still blocked for another 4 minutes.

This is what was happening to us – we did a TCPDUMP on the F5 and we were sending SYNs to the connection server and not getting a response. It then tries retransmission 3 times and when it eventually fails it sends a RST back to the client telling it to disconnect thus the ‘Network Error’ our end users were receiving.

The windows server TIME_WAIT can be adjusted using the registry key: TcpTimedWaitDelay and accepts values from 30 seconds to 300 seconds with default being 240 (4 minutes).

I adjusted the connection servers to have 30 second TIME_WAIT and adjusted the F5 iAPP TCP profiles to be custom and use 30 seconds instead of 2 as well as disabling the TIME WAIT recycle option.

This issue would most likely not be seen by anyone in smaller networks because having different clients connecting using the same ephemeral port is unlikely. Depending on their configurations they may have many F5 floating IPs that connections go across which will also help mitigate the occurrence due to having more SRCIP/SRCPRT combinations. They may also change the source port settings to be changed sequentially instead of preserved so instead of the F5 using the same source port the client used it keeps its own table and just sequences it itself. All of these options can mask the underlying problem but the real fix is ensuring the timeout values are the same to limit port re-use during a time_wait period.

Reply
0 Kudos