VMware Networking Community
elgwhoppo
Hot Shot
Hot Shot

NSX Load Balancer,Traffic still being passed to a pool member marked as down?

Hey guys,

What I am trying to set up right now is very simple. I'm trying to create a load balancer for two security servers that for now just handles 443 traffic. Layout pictured below, the ESG is the load balancer and the targets are 10.0.112.2 and 10.0.112.3.

layout.png

One of the things I'm struggling with is that if a member in the pool becomes unavailable, the load balancer still sends traffic to it, even though the status is DOWN.

I've tried the following combinations of load balancing and monitoring policies without luck.

Round Robin, HTTPS

Round Robin, ICMP

Least Connections: HTTPS

Least Connections: ICMP

Even debug packet display interface vNic_0 host_10.0.112.3_and_port_443 on the ESG shows I'm still seeing traffic being passed through, even though the member is marked as down. Then, when I go into the pool and MANUALLY disable the down node, it will finally start handing traffic to the up node and the status in the window below will change to MAINT. Is this expected behavior? The whole point of determining the health is to automatically determine that a node is down and take it out of the distribution. Kind of at a loss here.

still.PNG

VCDX-Desktop
Tags (3)
5 Replies
ddesmidt
VMware Employee
VMware Employee

When a server is marked down, new sessions will never go to that servers.

That said existing sessions will still go to that server.

Is it what you're experiencing?

Dimitri

0 Kudos
elgwhoppo
Hot Shot
Hot Shot

How do you define 'existing session'? The point of having a load balancer is not just so that load is distributed, but so that we can prevent failed nodes from receiving traffic faster than manual interaction with the configuration of the service or DNS records, whether or not there is an existing session.

From what I can tell, yes, that is what I'm experiencing. In my testing, when I attempted to reconnect from an endpoint that had a pre-established connection, no amount of waiting changed the fact that it was still put through to the same server, even though it had changed to DOWN status. Only when I changed the IP address of the client computer did the load balancer send the traffic to the UP member.

The only comment from the pubs is: "Each pool is monitored by the associated service monitor. When the load balancer detects a problem with a pool member, it is marked as down."

From this location: Set Up Load Balancing

This needs a little more clarification in the pubs, as it sounds to me like this is either a bug or an undisclosed lacking of basic load balancer functionality.

VCDX-Desktop
0 Kudos
larsonm
VMware Employee
VMware Employee

On the application profile:  By chance are you using SOURCEIP as your persistence profile?  What do you have set for Expires in?

0 Kudos
elgwhoppo
Hot Shot
Hot Shot

Hey Larsonm, yes, SOURCEIP is the persistence in the application profile and expires in is blank. I also tried setting expires in to 30 without luck. BTW, this is all with SSL passthrough. Trying easier configs before moving to the more difficult ones.

I think I finally got it.

I tried using SSl Session ID persistence in the application profile in combination with IP-HASH algorithm in the pool. This seems to work pretty well, reconnects me quickly without issue.

VCDX-Desktop
0 Kudos
ddesmidt
VMware Employee
VMware Employee

I tried the same thing as you did with "VIP with sce-iP persistence with no expiration time"

Step0: both servers UP

NSX-edge-3-0> show service loadbalancer pool Pool-Web01-http

-----------------------------------------------------------------------

Loadbalancer Pool Statistics:

POOL Pool-Web01-http

|  LB METHOD round-robin

|  LB PROTOCOL L7

|  Transparent enabled

|  SESSION (cur, max, total) = (0, 1, 7)

|  BYTES in = (1197), out = (1911)

  +->POOL MEMBER: Pool-Web01-http/web01, STATUS: UP

  |  |  HEALTH MONITOR = BUILT-IN, default_http_monitor:L7OK

  |  |  |  LAST STATE CHANGE: 2015-09-25 06:51:49

  |  |  SESSION (cur, max, total) = (0, 1, 5)

  |  |  BYTES in = (855), out = (1365)

  +->POOL MEMBER: Pool-Web01-http/web02, STATUS: UP

  |  |  HEALTH MONITOR = BUILT-IN, default_http_monitor:L7OK

  |  |  |  LAST STATE CHANGE: 2015-09-25 06:51:49

  |  |  SESSION (cur, max, total) = (0, 1, 1)

  |  |  BYTES in = (129), out = (225)

Step1: I access the VIP from client 20.20.20.1 and I'm redirected to server2 (test.php is a specific page displaying the server IP@)

vyatta@vyatta:~$ curl http://20.20.20.2/test.php

The Client IP@ is: 20.20.20.1<br>

The Server IP@ is: 10.1.1.12

I validate also the persistence table for the Client (20.20.20.1)

NSX-edge-3-0> show service loadbalancer table ipv4_ip_table_Pool-Web01-http

-----------------------------------------------------------------------

L7 Loadbalancer Sticky Table [ipv4_ip_table_Pool-Web01-http] Status:

# table: ipv4_ip_table_Pool-Web01-http, type: ip, size:1048576, used:1

0x341154e94d8: key=20.20.20.1 use=0 exp=223550 server_id=2 conn_cnt=0 conn_rate(60000)=0 conn_cur=0 sess_cnt=0 sess_rate(60000)=0 http_req_cnt=0 http_req_rate(60000)=0

And I check there client (20.20.20.1) does not have any more its TCP connection to the VIP (so when the client will re-connect to the VIP, that will be with a new TCP session)

NSX-edge-3-0> show service loadbalancer session l7

-----------------------------------------------------------------------

L7 Loadbalancer Current Sessions:

0x341154e3310: proto=unix_stream src=unix:1 fe=GLOBAL be=<NONE> srv=<none> ts=09 age=0s calls=2 rq[f=c08200h,i=0,an=00h,rx=20s,wx=,ax=] rp[f=008000h,i=0,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=1,ex=] s1=[7,0h,fd=-1,ex=] exp=20s

Step2: I stop Apache on Server2

root@Web02:~# service apache2 stop

And I validate the Edge detected Server2 DOWN

NSX-edge-3-0> show service loadbalancer pool Pool-Web01-http

-----------------------------------------------------------------------

Loadbalancer Pool Statistics:

POOL Pool-Web01-http

|  LB METHOD round-robin

|  LB PROTOCOL L7

|  Transparent enabled

|  SESSION (cur, max, total) = (0, 1, 7)

|  BYTES in = (1197), out = (1911)

  +->POOL MEMBER: Pool-Web01-http/web01, STATUS: UP

  |  |  HEALTH MONITOR = BUILT-IN, default_http_monitor:L7OK

  |  |  |  LAST STATE CHANGE: 2015-09-25 06:51:49

  |  |  SESSION (cur, max, total) = (0, 1, 5)

  |  |  BYTES in = (855), out = (1365)

  +->POOL MEMBER: Pool-Web01-http/web02, STATUS: DOWN

  |  |  HEALTH MONITOR = BUILT-IN, default_http_monitor:L4CON

  |  |  |  LAST STATE CHANGE: 2015-09-25 07:08:32

  |  |  |  FAILURE DETAIL: Connection refused

  |  |  SESSION (cur, max, total) = (0, 1, 2)

  |  |  BYTES in = (342), out = (546)

I can also validate the persistence entry is still there using Server02

NSX-edge-3-0> show service loadbalancer table ipv4_ip_table_Pool-Web01-http

-----------------------------------------------------------------------

L7 Loadbalancer Sticky Table [ipv4_ip_table_Pool-Web01-http] Status:

# table: ipv4_ip_table_Pool-Web01-http, type: ip, size:1048576, used:1

0x341154e94d8: key=20.20.20.1 use=0 exp=183452 server_id=2 conn_cnt=0 conn_rate(60000)=0 conn_cur=0 sess_cnt=0 sess_rate(60000)=0 http_req_cnt=0 http_req_rate(60000)=0

Step4: I access the VIP from client 20.20.20.1 and I'm redirected now to server1 (test.php is a specific page displaying the server IP@)

vyatta@vyatta:~$ curl http://20.20.20.2/test.php

The Client IP@ is: 20.20.20.1<br>

The Server IP@ is: 10.1.1.11

I validate also the persistence table for the Client (20.20.20.1)

NSX-edge-3-0> show service loadbalancer table ipv4_ip_table_Pool-Web01-http

-----------------------------------------------------------------------

L7 Loadbalancer Sticky Table [ipv4_ip_table_Pool-Web01-http] Status:

# table: ipv4_ip_table_Pool-Web01-http, type: ip, size:1048576, used:1

0x341154e94d8: key=20.20.20.1 use=0 exp=288973 server_id=1 conn_cnt=0 conn_rate(60000)=0 conn_cur=0 sess_cnt=0 sess_rate(60000)=0 http_req_cnt=0 http_req_rate(60000)=0




Conclusion:

I'm using the latest NSX-v build (6.2.0), but I don't think this would be different in older releases.

The only explanation I can think of is, you're using a browser and the TCP connection of your browser pointing to the VIP (and then load balanced to the server) is NOT closed by your browser.

In that case when the server becomes DOWN, the Edge LB still forward packets in that TCP session to the server DOWN.

The reason is, the load balancer could NOT forward it to the other server since the other server did not see the beginning of that TCP session.

However new TCP session from that client would go to the new server and the persistence table would be updated accordingly (as show above in the very detailed steps).

Dimitri