Solved: vRA 6.2 with F5 loadbalancer

Yankora · ‎03-30-2015

We have a problem when trying to deploy vRA6.2 distributed install with F5 loadbalancer. The issue seems to be with load balancer. When we power up the first appliance, After 10 minutes all services are listed as registered (which is the usual normal behaviour) however, when we change the host settings to point to the Loadbalancer VIP FQDN, most of the services fail to start with below error from catalina.out

-------------------------------------------------------------------------------------------Catalina.out-----------------------------------------------------------------------------------------------------------------------------------------

2015-03-30 14:03:01,687 vcac: [component="cafe:catalog" priority="INFO" thread="taskScheduler-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:01,710 vcac: [component="cafe:catalog" priority="WARN" thread="taskScheduler-1" tenant=""] com.vmware.vcac.platform.rest.client.support.RetriableOperation.call:74 - Exception handled during retry operation with message: I/O error on GET request for "https://mycloud.tedata.net/component-registry/endpoints/types/sso":Connection reset; nested exception is java.net.SocketException: Connection reset

2015-03-30 14:03:01,710 vcac: [component="cafe:catalog" priority="INFO" thread="taskScheduler-1" tenant=""] com.vmware.vcac.platform.rest.client.support.RetriableOperation.call:76 - Retries left: [10]. Sleeping for [20] seconds before the next retry attempt.

2015-03-30 14:03:03,036 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:03,036 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:03,045 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:03,045 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:03,053 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:03,054 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:03,062 vcac: [component="cafe:component-registry" priority="WARN" thread="eventPublisherExecutor-1" tenant=""] com.vmware.vcac.platform.rest.client.support.RetriableOperation.call:74 - Exception handled during retry operation with message: I/O error on GET request for "https://mycloud.tedata.net/component-registry/endpoints/types/sso":Connection reset; nested exception is java.net.SocketException: Connection reset

2015-03-30 14:03:03,062 vcac: [component="cafe:component-registry" priority="INFO" thread="eventPublisherExecutor-1" tenant=""] com.vmware.vcac.platform.rest.client.support.RetriableOperation.call:76 - Retries left: [39]. Sleeping for [20] seconds before the next retry attempt.

2015-03-30 14:03:06,355 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:06,356 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:06,365 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:06,366 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:06,374 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:93 - I/O exception (java.net.SocketException) caught when processing request: Connection reset

2015-03-30 14:03:06,375 vcac: [component="cafe:shell" priority="INFO" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] org.apache.http.impl.execchain.RetryExec.execute:106 - Retrying request

2015-03-30 14:03:06,420 vcac: [component="cafe:shell" priority="ERROR" thread="org.springframework.scheduling.config.TaskExecutorFactoryBean#321ff047-1" tenant=""] com.vmware.vcac.core.service.registry.ServiceRegistryManager.register:123 -

-------------------------------------------------------------------------------------------End-----------------------------------------------------------------------------------------------------------------------------------------

Certificates are loaded properly on the loadbalancer VIP and all configurations are followed as per https://www.vmware.com/files/pdf/products/vCloud/VMW-vRealize-Automation-61-Deployment-Guide-HA.pdf

While i have not done the F5 configurations myself, I trust they followed the procedures, Anyone has experienced the same problem before?

SkyCoop · ‎03-30-2015

The health monitors rely on services being up before passing traffic to a particular node, the services can't start until the traffic is being passed to the first node that is starting up. I typically put in a record in the hosts file on both the virtual appliance and the IAAS web nodes that point to the VIP name to the local IP, then the stack can start up with the health monitoring left in place.

If you do this on the virtual appliance, make sure you put the hosts entry outside of the VAMI GENERATED (or something like that) comments, as this is rebuilt each time the appliance restarts.

-just saw the certificates are loaded on the load balancer - I don't do that, just pass the traffic through. (Think it is layer4-performance in F5 terms)

View solution in original post

SkyCoop · ‎03-30-2015

The health monitors rely on services being up before passing traffic to a particular node, the services can't start until the traffic is being passed to the first node that is starting up. I typically put in a record in the hosts file on both the virtual appliance and the IAAS web nodes that point to the VIP name to the local IP, then the stack can start up with the health monitoring left in place.

If you do this on the virtual appliance, make sure you put the hosts entry outside of the VAMI GENERATED (or something like that) comments, as this is rebuilt each time the appliance restarts.

-just saw the certificates are loaded on the load balancer - I don't do that, just pass the traffic through. (Think it is layer4-performance in F5 terms)

Yankora · ‎03-30-2015

Thanks Skycoop, I will manage the /etc/hosts entry using crontab.I will test it and get back to you tomorrow. But I am wondering what is the impact of loading the certificate on the LB VIP?

SkyCoop · ‎03-30-2015

Is the certificate on the Load balancer a SAN or Wildcard cert and the same one that is installed on the appliances / IAAS hosts?

Yankora · ‎03-31-2015

Yes Skycoop, It is a wildcard cert load on the VIP and the two vRA nodes behind it.

jwhites · ‎04-06-2015

I would be interested in what caused this as we had no such issues when publishing them behind a load balancer.

It's interesting that the most number of issues with VRA seem to come with the distributed install which is what vmware recommends you do in a production environment yet the clarity of the documentation for said install is somewhat poor in areas. A lot of the tutorials/videos/walkthroughs you see as well are people doing a basic install....does anyone have any realistic numbers on users/operations on a basic install vs a distributed? It's never good when the scenario we are implementing for high availability ends up causing us downtime. I'm not suggesting that you don't go the distributed route but it would be interesting to see when we should realistically be doing this instead of making something more complex just because 'we can' and 'we should'.

Hopefully the installation of 'automation center' will be somewhat more 'automated' in the future. I'd like to see the IaaS windows box go away completely as well as the dependency on MSSQL.

All

vRA 6.2 with F5 loadbalancer