VMware Cloud Community
skoch
Enthusiast
Enthusiast

Steer clear of vRA 7

I'm engaged in a vRA 7 fully distributed HA environment with external vROs and the platform is not stable. While installation may go smoothly, HA is not currently possible due to either bugs in the vRA appliances or in the vRO appliances. Both embedded vRO and external vRO result in platform outages either randomly or when a vRO node is taken offline.

The behavior seen with the embedded vRO is that randomly one of the nodes will simply stop working. Then if you reboot the other node, you're no longer able to bring the environment online at all. This is due to the new load balancer rules that check the health of all services on the vRA appliances. Due to the appliance not being completely healthy (embedded vRO service isn't running) the load balancer doesn't send traffic to it. The other appliance that is attempting to boot can't start because vRO is unavailable and the shell-ui-app and advanced-designer-service services refuse to start until it is.

The behavior with external vRO appliances is similar, but slightly different. The first difference is that the vRO appliances don't randomly stop working as seen with the vRO embedded service. Also, because there is a separate load balancer VIP (and associated health checks) one would think that vRO can continue to run even if vRA is offline. Unfortunately that isn't the case because to assign licenses to the vRO appliances you need to authenticate using the vRealize Automation authentication provider. This would also be fine, if that authentication method didn't use the vRA load balancer VIP which requires all services on the vRA appliances to be running for traffic to pass to it.

So essentially you end up with an infinite loop because vRA requires vRO to be online for it to start and vRO requires vRA to be online for it to start. The only way around this is to disable the health checks on the vRA appliances temporarily any time you need to take either of the vRO nodes offline (or one goes offline), and even this sometimes doesn't work and requires you to then restart the vRA appliances for it come back online. Ultimately this results in unplanned downtime.

Tags (3)
11 Replies
AlexJudge
VMware Employee
VMware Employee

Perhaps your post title should reflect the issue is with load balancing?

Reply
0 Kudos
skoch
Enthusiast
Enthusiast

Is there a way to do an HA distributed environment without load balancing?

Reply
0 Kudos
AlexJudge
VMware Employee
VMware Employee

Not to sound pedantic, but distributed vRA isn't the same thing as HA vRA. You can deploy vRA distributed, without load balancers. You just don't get the availability.

So in short, the answer is yes, you can do distributed without load balancers. But if you want high availability, you have to install it in a distributed manner, including load balancers.

Reply
0 Kudos
skoch
Enthusiast
Enthusiast

Yep, I'm familiar with the non-HA distributed deployment (I believe it's section 4 of the E07 install and config guide). That's why I made sure in my post to state right off the bat that the issue is with a distributed HA environment which is the recommended architecture for production use.

Reply
0 Kudos
gradinka
VMware Employee
VMware Employee

the health check dependency of shell-UI / advanced-service-designer can be easily disabled by modifying one of the setenv files in /etc/vcac

but the key here is why is vro stopping - any hints? logs?

that should not happen

Reply
0 Kudos
Michael_Rudloff
Enthusiast
Enthusiast

I wouldn't necessarily say to stay clear but there are indeed parts of the product which are just broken - vRB7 with vRA7 is just one example, just can't get it to work properly ..

If you did hit a bug I highly suggest getting in contact with VMware Support. If this is a production environment, you should have support. If not and it is just a POC and you are partner - you have a set of complimentary support tickets.

But I certainly hope 7.1 is nearby myself ..

___ My own knowledge base made public: http://open902.com
Reply
0 Kudos
zwal1986
Enthusiast
Enthusiast

We saw similar issues in our environment (6.2.2) using f5s in our distributed HA environment.

Our issues weren't with the orchestrators but with the IaaS components not correctly registering to the vRA appliances after the vRA appliances were rebooted. This was due to the fact that the IaaS components had all been configured to use the load balanced URL for communication with vRA and the load balancer was reporting that the appliances were down due to the failure of all the services to register correctly. We had a bit of a chicken and egg issue going on.

I opened a ticket with GSS and determined that the easiest way to fix the issue was to simple edit the hosts file on the appliances and point the appliance back to itself there. This allowed the vRA appliances to come back up, which brought the load balancer up, which then allowed for all of the services to register. Perhaps doing similar would take care of the issues you are seeing with orchestrator as well.

Reply
0 Kudos
hughesr9
Contributor
Contributor

I am finding vRA 7.0 to be quite unstable, I have had to re-install twice.

Today vRA just stopped working again and I tried a reboot. I can login to the vRA Appliance but I can see that the shell-ui-app has failed and only component-registry, licensing-service and plugin-service are registered.

Is anybody else having similar issues? I really don't want to have to re-install again, is there a way to get the shell-ui-app to work.

Reply
0 Kudos
GrantOrchardVMw
Commander
Commander

I'm not sure that these kinds of posts really help. In essence you are saying "we are having a problem, therefore everyone should stay clear".

Can you provide your SR number and I can assist with getting engineering to work on this with you?

Grant

Grant http://grantorchard.com
GrantOrchardVMw
Commander
Commander

Just bumping this. If you can provide me an SR number we can get you some assistance.

Grant

Grant http://grantorchard.com
Reply
0 Kudos
skoch
Enthusiast
Enthusiast

Thanks Grant, and I understand why that might seem to be the case, however I very much meant what I said. Customers looking to put this into production should wait until VMware has QC'd the product and resolved these errors that impact functionality and stability. The issues we were having aren't limited to our environment which was determined through three separate SR's. Some of the issues we were able to work around others not so much.

Two of the issues VMware knew about but didn't include in the release notes (they still haven't been added to the release notes). Some might have been new to VMware, however many should have been caught in QC. Here's the issues list from my most recent engagement:

  • You can't modify the size of a secondary disk when requesting a machine. No workaround available.

  • When using the embedded vRO Appliances you're unable to failover to the secondary node. Workaround is to use external vRO appliances.

  • When using the embedded vRO Appliances the secondary appliance's "vco" and "advanced-services-designer" service randomly fail and require a reboot to bring back online. Workaround is to use external vRO appliances.

  • When using the embedded vRO Appliances, vRA continuously attempts to connect to the load balanced vRO VIP on port 8281, despite that port not being present with the embedded vRO. Workaround is to use external vRO appliances.

  • If you move to external vRO appliances the vRA appliances don't correctly update a file so it fails to point to the external vRO appliances and instead continues to attempt to connect to the embedded vRO. This may or may not happen after changing the authentication method in vRO. The workaround is to modify the setenv-server to disable the vco status check.

  • You can't change the IaaS web or manager certificates through the VAMI. Fails with a "can't match thumbprint error". Workaround is to manually change the certificates like in 6.2.

  • When modifying the name of a component blueprint and the multi-machine blueprint not being updated with the new name and giving an error. Workaround is to delete the blueprint, save and then add it back in.

  • The vIDM doesn't work with integrated Windows Authentication. Workaround is to use AD over LDAP.

  • The vIDM doesn't work with AD over LDAP when using the DNS Service Location. Workaround is to uncheck the DNS service lookup option and specify a Domain controller.

In total I spent around 30 hours working with support over the course of two weeks to workaround these issues. This is why I would caution anyone considering place vRA 7 into production.

Reply
0 Kudos